working on gf compiler doc

2006-10-22 14:17:50 +00:00
parent 6f7617bc72
commit 6b00e78f66
1 changed files with 309 additions and 18 deletions
@@ -1,13 +1,23 @@
 Compiling GF
 Aarne Ranta
+Proglog meeting, 1 November 2006
+
+% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
+
+%!target:html
+%!postproc(html): #NEW <!-- NEW -->
+
+#NEW

 ==The compilation task==

 GF is a grammar formalism, i.e. a special purpose programming language
 for writing grammars.

-Cf: BNF, YACC, Happy (grammars for programming languages); 
-PATR, HPSG, LFG (grammars for natural languages).
+Other grammar formalisms: 
+- BNF, YACC, Happy (grammars for programming languages); 
+- PATR, HPSG, LFG (grammars for natural languages).
+

 The grammar compiler prepares a GF grammar for two computational tasks:
 - linearization: take syntax trees to strings
@@ -15,19 +25,40 @@ The grammar compiler prepares a GF grammar for two computational tasks:


 The grammar gives a declarative description of these functionalities,
-preferably on a high abstraction level enhancing grammar writing
+on a high abstraction level that helps grammar writing
 productivity.

+Some of the ideas in GF and experience gained from it 
+can be useful for other special-purpose functional languages.
+
+#NEW

 ==Characteristics of GF language==

 Functional language with types, both built-in and user-defined.
+```
+  Str : Type

-Pattern matching and higher-order functions.
+  param Number = Sg | Pl
+
+  param AdjForm = ASg Gender | APl
+
+  Noun : Type = {s : Number => Str ; g : Gender}
+```
+Pattern matching.
+```
+  svart_A = table {
+    ASg _ => "svart" ;
+    _ => "svarta"
+    }
+```
+Higher-order functions.

 Module system reminiscent of ML (signatures, structures, functors).


+#NEW
+
 ==GF vs. Haskell==

 Some things that (standard) Haskell hasn't:
@@ -43,17 +74,19 @@ Some things that GF hasn't:
 - classes, polymorphism


+#NEW
+
 ==GF vs. most linguistic grammar formalisms==

-GF separates abstract syntax from concrete syntax
+GF separates abstract syntax from concrete syntax.

-GF has a module system with separate compilation
+GF has a module system with separate compilation.

-GF is generation-oriented (as opposed to parsing)
+GF is generation-oriented (as opposed to parsing).

-GF has unidirectional matching (as opposed to unification)
+GF has unidirectional matching (as opposed to unification).

-GF has a static type system (as opposed to a type-free universe)
+GF has a static type system (as opposed to a type-free universe).

 "I was - and I still am - firmly convinced that a program composed
 out of statically type-checked parts is more likely to faithfully
@@ -62,11 +95,13 @@ weakly-typed interfaces or dynamically-checked interfaces."
 (B. Stroustrup, 1994, p. 107)


+#NEW
+
 ==The computation model==

 An abstract syntax defines a free algebra of trees (using
-dependent types, recursion, higher-order abstract syntax: GF has a
-complete Logical Framework).
+dependent types, recursion, higher-order abstract syntax: 
+GF includes a complete Logical Framework).

 A concrete syntax defines a homomorphism (compositional mapping)
 from the abstract syntax to a system of tuples of strings.
@@ -77,6 +112,8 @@ The parsing problem can be reduced to that of MPCFG (Multiple
 Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).


+#NEW
+
 ==The compilation task, again==

 1. From a GF source grammar, derive a canonical GF grammar 
@@ -97,6 +134,8 @@ Moreover, we generate supplementary formats such as grammars
 required by various speech recognition systems.


+#NEW
+
 ==An overview of compilation phases==

 Legend:
@@ -110,19 +149,23 @@ Legend:
 [gf-compiler.png]


+#NEW
+
 ==Using the compiler==

-Batch mode (cf. GHC)
+Batch mode (cf. GHC).

 Interactive mode, building the grammar incrementally from
 different files, with the possibility of testing them
-(cf. GHCI)
+(cf. GHCI).

 The interactive mode was first, built on the model of ALF-2
 (L. Magnusson), and there was no file output of compiled
 grammars.


+#NEW
+
 ==Modules and separate compilation==

 The above diagram shows what happens to each module.
@@ -137,11 +180,13 @@ details (although it would be beneficial to make explicit the
 rules that are right now just in the implementation...)

 Separate compilation is //extremely important// when developing
-big grammars, especially when using grammar libraries. Compiling
+big grammars, especially when using grammar libraries. Example: compiling
 the GF resource grammar library takes 5 minutes, whereas reading
 in the compiled image takes 10 seconds.


+#NEW
+
 ==Techniques used==

 BNFC is used for generating both the parsers and printers.
@@ -149,7 +194,8 @@ This has helped to make the formats portable.

 "Almost compositional functions" (``composOp``) are used in
 many compiler passes, making them easier to write and understand. 
-A grep on the sources reveals 40 uses (outside the definition itself).
+A grep on the sources reveals 40 uses (outside the definition 
+of ``composOp`` itself).

 The key algorithmic ideas are
 - type-driven partial evaluation in GF-to-GFC generation
@@ -157,6 +203,8 @@ The key algorithmic ideas are
 - some ideas in GFC-to-MCFG encoding


+#NEW
+
 ==Type-driven partial evaluation==

 Each abstract syntax category in GF has a corresponding linearization type:
@@ -180,10 +228,253 @@ and library functions.
 The compilation technique proceeds as follows:
 - use eta-expansion on ``t`` to determine the canonical form of the term
 ```
-  \ <C1>, ...., <Cn> -> (t <C1> .... <Cn>)
+  \ $C1, ...., $Cn -> (t $C1 .... $Cn)
 ```
-with unique variables ``<C1> .... <Cn>`` for the arguments; repeat this
+with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
 inside the term for records and tables
 - evaluate the resulting term using the computation rules of GF
- what remains is a canonical term with ``<C1> .... <Cn>`` the only
+- what remains is a canonical term with ``$C1 .... $Cn`` the only
 variables (the run-time input of the linearization function)
+
+
+#NEW
+
+==Eta-expanding records and tables==
+
+For records that are valied via subtyping, eta expansion 
+eliminates superfluous fields:
+```
+  {r1 = t1 ; r2 = t2} : {r1 : T1}  ---->  {r1 = t1}
+```
+For tables, the effect is always expansion, since
+pattern matching can be used to represent tables
+compactly:
+```
+  table {n => "fish"} : Number => Str   --->
+
+  table {
+    Sg => "fish" ;
+    Pl => "fish"
+    }
+```
+This can be helped by back-end optimizations (see below).
+
+
+#NEW
+
+==Eliminating functions==
+
+"Everything is finite": parameter types, records, tables;
+finite number of string tokens per grammar.
+
+But "inifinite types" such as function types are useful when
+writing grammars, to enable abstractions.
+
+Since function types do not appear in linearization types,
+we want functions to be eliminated from linearization terms.
+
+This is similar to the **subformula property** in logic.
+Also the main problem is similar: function depending on
+a run-time variable,
+```
+  (table {P => f ; Q = g} ! x) a
+```
+This is not a redex, but we can make it closer to one by moving
+the application inside the table,
+```
+  table {P => f a ; Q = g a} ! x
+```
+The transformation is the same as Prawitz's (1965) elimination
+of maximal segments:
+```
+                                            A           B
+                                          C -> D  C   C -> D  C
+           A       B                      ---------   ---------
+  A v B  C -> D  C -> D            A v B       D          D
+  ---------------------     ===>   -------------------------
+      C -> D             C                    D
+      --------------------
+              D
+```
+
+
+
+#NEW
+
+==Size effects of partial evaluation==
+
+Irrelevant table branches are thrown away, which can reduce the size.
+
+But, since tables are expanded and auxiliary functions are inlined,
+the size can grow exponentially.
+
+How can we keep the first and eliminate the second?
+
+
+#NEW
+
+==Parametrization of tables==
+
+Algorithm: for each branch in a table, consider replacing the
+argument by a variable:
+```
+  table {             table {
+    P => t ;    --->    x => t[P->x] ;
+    Q => t              x => t[Q->x]
+    }                   }
+```
+If the resulting branches are all equal, you can replace the table
+by a lambda abstract
+```
+  \\x => t[P->x]
+```
+If each created variable ``x`` is unique in the grammar, computation
+with the lambda abstract is efficient.
+
+
+
+#NEW
+
+==Common subexpression elimination==
+
+Algorithm: 
+ Go through all terms and subterms in a module, creating
+  a symbol table mapping terms to the number of occurrences.
+ For each subterm appearing at least twice, create a fresh
+  constant defined as that subterm.
+ Go through all rules (incl. rules for the new constants),
+  replacing largest possible subterms with such new constants.
+
+
+This algorithm, in a way, creates the strongest possible abstractions.
+
+In general, the new constants have open terms as definitions.
+But since all variables (and constants) are unique, they can
+be computed by simple replacement.
+
+
+#NEW
+
+==Course-of-values tables==
+
+By maintaining a canonical order of parameters in a type, we can
+eliminate the left hand sides of branches.
+```
+  table {              table T [
+    P => t ;    --->     P ;
+    Q => u               Q
+    }                    ]
+```
+The treatment is similar to ``Enum`` instances in Haskell.
+
+Actually, all parameter types could be translated to
+initial segments of integers. This is done in the GFCC format.
+
+
+#NEW
+
+==Size effects of optimizations==
+
+Example: the German resource grammar
+``LangGer``
+
+|| optimization |  lines |  characters |  size % | blow-up |
+| none       |  5394 |  3208435 |   100 |  25 |
+| all        |  5394 |   750277 |    23 |   6 |
+| none_subs  |  5772 |  1290866 |    40 |  10 |
+| all_subs   |  5644 |   414119 |    13 |   3 |
+| gfcc       |  3279 |   190004 |     6 |   1.5 |
+| gf source  |  3976 |   121939 |     4 |   1 |
+
+
+Optimization "all" means parametrization + course-of-values.
+
+The source code size is an estimate, since it includes
+potentially irrelevant library modules.
+
+The GFCC format is not reusable in separate compilation.
+
+
+
+#NEW
+
+==The shared prefix optimization==
+
+This is currently performed in GFCC only.
+
+The idea works for languages that have a rich morphology
+based on suffixes. Then we can replace a course of values
+with a pair of a prefix and a suffix set:
+```
+  ["apa", "apan", "apor", "aporna"] ---> 
+  ("ap" + ["a", "an", "or", "orna"])
+```
+The real gain comes via common subexpression elimination:
+```
+  _34 = ["a", "an", "or", "orna"]
+  apa = ("ap" + _34)
+  blomma = ("blomm" + _34)
+  flicka = ("flick" + _34)
+```
+Notice that it now matters a lot how grammars are written.
+For instance, if German verbs are treated as a one-dimensional
+table,
+```
+  ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
+```
+no shared prefix optimization is possible. A better form is
+separate tables for non-"ge" and "ge" forms:
+```
+  [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
+```
+
+
+#NEW
+
+==Reuse of grammars as libraries==
+
+The idea of resource grammars: take care of all aspects of
+surface grammaticality (inflection, agreement, word order).
+
+Reuse in application grammar: via translations
+```
+  cat C          --->  oper C : Type = T
+  lincat C = T 
+
+  fun f : A      --->  oper f : A* = t
+  lin f = t 
+```
+The user only needs to know the type signatures (abstract syntax).
+
+However, this does not quite guarantee grammaticality, because
+different categories can have the same lincat:
+```
+  lincat Conj = {s : Str}
+  lincat Adv  = {s : Str}
+```
+Thus someone may by accident use "and" as an adverb!
+
+
+#NEW
+
+==Forcing the type checker to act as a grammar checker==
+
+We just have to make linearization types unique for each category.
+
+The technique is reminiscent of Haskell's ``newtype`` but uses
+records instead: we add **lock fields** e.g.
+```
+  lincat Conj = {s : Str ; lock_Conj : {}}
+  lincat Adv  = {s : Str ; lock_Adv  : {}}
+```
+Thanks to record subtyping, the translation is simple:
+```
+  fun f : C1 -> ... -> Cn -> C        
+  lin f = t
+
+                  --->
+
+  oper f : C1* -> ... -> Cn* -> C* = 
+    \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
+```
+