From 6b00e78f66cb0aa82ffd3796b5d4198614f6029b Mon Sep 17 00:00:00 2001 From: aarne Date: Sun, 22 Oct 2006 14:17:50 +0000 Subject: [PATCH] working on gf compiler doc --- doc/compiling-gf.txt | 327 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 309 insertions(+), 18 deletions(-) diff --git a/doc/compiling-gf.txt b/doc/compiling-gf.txt index bba946b14..6c179115c 100644 --- a/doc/compiling-gf.txt +++ b/doc/compiling-gf.txt @@ -1,13 +1,23 @@ Compiling GF Aarne Ranta +Proglog meeting, 1 November 2006 + +% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html + +%!target:html +%!postproc(html): #NEW + +#NEW ==The compilation task== GF is a grammar formalism, i.e. a special purpose programming language for writing grammars. -Cf: BNF, YACC, Happy (grammars for programming languages); -PATR, HPSG, LFG (grammars for natural languages). +Other grammar formalisms: +- BNF, YACC, Happy (grammars for programming languages); +- PATR, HPSG, LFG (grammars for natural languages). + The grammar compiler prepares a GF grammar for two computational tasks: - linearization: take syntax trees to strings @@ -15,19 +25,40 @@ The grammar compiler prepares a GF grammar for two computational tasks: The grammar gives a declarative description of these functionalities, -preferably on a high abstraction level enhancing grammar writing +on a high abstraction level that helps grammar writing productivity. +Some of the ideas in GF and experience gained from it +can be useful for other special-purpose functional languages. + +#NEW ==Characteristics of GF language== Functional language with types, both built-in and user-defined. +``` + Str : Type -Pattern matching and higher-order functions. + param Number = Sg | Pl + + param AdjForm = ASg Gender | APl + + Noun : Type = {s : Number => Str ; g : Gender} +``` +Pattern matching. +``` + svart_A = table { + ASg _ => "svart" ; + _ => "svarta" + } +``` +Higher-order functions. Module system reminiscent of ML (signatures, structures, functors). +#NEW + ==GF vs. Haskell== Some things that (standard) Haskell hasn't: @@ -43,17 +74,19 @@ Some things that GF hasn't: - classes, polymorphism +#NEW + ==GF vs. most linguistic grammar formalisms== -GF separates abstract syntax from concrete syntax +GF separates abstract syntax from concrete syntax. -GF has a module system with separate compilation +GF has a module system with separate compilation. -GF is generation-oriented (as opposed to parsing) +GF is generation-oriented (as opposed to parsing). -GF has unidirectional matching (as opposed to unification) +GF has unidirectional matching (as opposed to unification). -GF has a static type system (as opposed to a type-free universe) +GF has a static type system (as opposed to a type-free universe). "I was - and I still am - firmly convinced that a program composed out of statically type-checked parts is more likely to faithfully @@ -62,11 +95,13 @@ weakly-typed interfaces or dynamically-checked interfaces." (B. Stroustrup, 1994, p. 107) +#NEW + ==The computation model== An abstract syntax defines a free algebra of trees (using -dependent types, recursion, higher-order abstract syntax: GF has a -complete Logical Framework). +dependent types, recursion, higher-order abstract syntax: +GF includes a complete Logical Framework). A concrete syntax defines a homomorphism (compositional mapping) from the abstract syntax to a system of tuples of strings. @@ -77,6 +112,8 @@ The parsing problem can be reduced to that of MPCFG (Multiple Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004). +#NEW + ==The compilation task, again== 1. From a GF source grammar, derive a canonical GF grammar @@ -97,6 +134,8 @@ Moreover, we generate supplementary formats such as grammars required by various speech recognition systems. +#NEW + ==An overview of compilation phases== Legend: @@ -110,19 +149,23 @@ Legend: [gf-compiler.png] +#NEW + ==Using the compiler== -Batch mode (cf. GHC) +Batch mode (cf. GHC). Interactive mode, building the grammar incrementally from different files, with the possibility of testing them -(cf. GHCI) +(cf. GHCI). The interactive mode was first, built on the model of ALF-2 (L. Magnusson), and there was no file output of compiled grammars. +#NEW + ==Modules and separate compilation== The above diagram shows what happens to each module. @@ -137,11 +180,13 @@ details (although it would be beneficial to make explicit the rules that are right now just in the implementation...) Separate compilation is //extremely important// when developing -big grammars, especially when using grammar libraries. Compiling +big grammars, especially when using grammar libraries. Example: compiling the GF resource grammar library takes 5 minutes, whereas reading in the compiled image takes 10 seconds. +#NEW + ==Techniques used== BNFC is used for generating both the parsers and printers. @@ -149,7 +194,8 @@ This has helped to make the formats portable. "Almost compositional functions" (``composOp``) are used in many compiler passes, making them easier to write and understand. -A grep on the sources reveals 40 uses (outside the definition itself). +A grep on the sources reveals 40 uses (outside the definition +of ``composOp`` itself). The key algorithmic ideas are - type-driven partial evaluation in GF-to-GFC generation @@ -157,6 +203,8 @@ The key algorithmic ideas are - some ideas in GFC-to-MCFG encoding +#NEW + ==Type-driven partial evaluation== Each abstract syntax category in GF has a corresponding linearization type: @@ -180,10 +228,253 @@ and library functions. The compilation technique proceeds as follows: - use eta-expansion on ``t`` to determine the canonical form of the term ``` - \ , ...., -> (t .... ) + \ $C1, ...., $Cn -> (t $C1 .... $Cn) ``` -with unique variables `` .... `` for the arguments; repeat this +with unique variables ``$C1 .... $Cn`` for the arguments; repeat this inside the term for records and tables - evaluate the resulting term using the computation rules of GF -- what remains is a canonical term with `` .... `` the only +- what remains is a canonical term with ``$C1 .... $Cn`` the only variables (the run-time input of the linearization function) + + +#NEW + +==Eta-expanding records and tables== + +For records that are valied via subtyping, eta expansion +eliminates superfluous fields: +``` + {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1} +``` +For tables, the effect is always expansion, since +pattern matching can be used to represent tables +compactly: +``` + table {n => "fish"} : Number => Str ---> + + table { + Sg => "fish" ; + Pl => "fish" + } +``` +This can be helped by back-end optimizations (see below). + + +#NEW + +==Eliminating functions== + +"Everything is finite": parameter types, records, tables; +finite number of string tokens per grammar. + +But "inifinite types" such as function types are useful when +writing grammars, to enable abstractions. + +Since function types do not appear in linearization types, +we want functions to be eliminated from linearization terms. + +This is similar to the **subformula property** in logic. +Also the main problem is similar: function depending on +a run-time variable, +``` + (table {P => f ; Q = g} ! x) a +``` +This is not a redex, but we can make it closer to one by moving +the application inside the table, +``` + table {P => f a ; Q = g a} ! x +``` +The transformation is the same as Prawitz's (1965) elimination +of maximal segments: +``` + A B + C -> D C C -> D C + A B --------- --------- + A v B C -> D C -> D A v B D D + --------------------- ===> ------------------------- + C -> D C D + -------------------- + D +``` + + + +#NEW + +==Size effects of partial evaluation== + +Irrelevant table branches are thrown away, which can reduce the size. + +But, since tables are expanded and auxiliary functions are inlined, +the size can grow exponentially. + +How can we keep the first and eliminate the second? + + +#NEW + +==Parametrization of tables== + +Algorithm: for each branch in a table, consider replacing the +argument by a variable: +``` + table { table { + P => t ; ---> x => t[P->x] ; + Q => t x => t[Q->x] + } } +``` +If the resulting branches are all equal, you can replace the table +by a lambda abstract +``` + \\x => t[P->x] +``` +If each created variable ``x`` is unique in the grammar, computation +with the lambda abstract is efficient. + + + +#NEW + +==Common subexpression elimination== + +Algorithm: ++ Go through all terms and subterms in a module, creating + a symbol table mapping terms to the number of occurrences. ++ For each subterm appearing at least twice, create a fresh + constant defined as that subterm. ++ Go through all rules (incl. rules for the new constants), + replacing largest possible subterms with such new constants. + + +This algorithm, in a way, creates the strongest possible abstractions. + +In general, the new constants have open terms as definitions. +But since all variables (and constants) are unique, they can +be computed by simple replacement. + + +#NEW + +==Course-of-values tables== + +By maintaining a canonical order of parameters in a type, we can +eliminate the left hand sides of branches. +``` + table { table T [ + P => t ; ---> P ; + Q => u Q + } ] +``` +The treatment is similar to ``Enum`` instances in Haskell. + +Actually, all parameter types could be translated to +initial segments of integers. This is done in the GFCC format. + + +#NEW + +==Size effects of optimizations== + +Example: the German resource grammar +``LangGer`` + +|| optimization | lines | characters | size % | blow-up | +| none | 5394 | 3208435 | 100 | 25 | +| all | 5394 | 750277 | 23 | 6 | +| none_subs | 5772 | 1290866 | 40 | 10 | +| all_subs | 5644 | 414119 | 13 | 3 | +| gfcc | 3279 | 190004 | 6 | 1.5 | +| gf source | 3976 | 121939 | 4 | 1 | + + +Optimization "all" means parametrization + course-of-values. + +The source code size is an estimate, since it includes +potentially irrelevant library modules. + +The GFCC format is not reusable in separate compilation. + + + +#NEW + +==The shared prefix optimization== + +This is currently performed in GFCC only. + +The idea works for languages that have a rich morphology +based on suffixes. Then we can replace a course of values +with a pair of a prefix and a suffix set: +``` + ["apa", "apan", "apor", "aporna"] ---> + ("ap" + ["a", "an", "or", "orna"]) +``` +The real gain comes via common subexpression elimination: +``` + _34 = ["a", "an", "or", "orna"] + apa = ("ap" + _34) + blomma = ("blomm" + _34) + flicka = ("flick" + _34) +``` +Notice that it now matters a lot how grammars are written. +For instance, if German verbs are treated as a one-dimensional +table, +``` + ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...] +``` +no shared prefix optimization is possible. A better form is +separate tables for non-"ge" and "ge" forms: +``` + [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]] +``` + + +#NEW + +==Reuse of grammars as libraries== + +The idea of resource grammars: take care of all aspects of +surface grammaticality (inflection, agreement, word order). + +Reuse in application grammar: via translations +``` + cat C ---> oper C : Type = T + lincat C = T + + fun f : A ---> oper f : A* = t + lin f = t +``` +The user only needs to know the type signatures (abstract syntax). + +However, this does not quite guarantee grammaticality, because +different categories can have the same lincat: +``` + lincat Conj = {s : Str} + lincat Adv = {s : Str} +``` +Thus someone may by accident use "and" as an adverb! + + +#NEW + +==Forcing the type checker to act as a grammar checker== + +We just have to make linearization types unique for each category. + +The technique is reminiscent of Haskell's ``newtype`` but uses +records instead: we add **lock fields** e.g. +``` + lincat Conj = {s : Str ; lock_Conj : {}} + lincat Adv = {s : Str ; lock_Adv : {}} +``` +Thanks to record subtyping, the translation is simple: +``` + fun f : C1 -> ... -> Cn -> C + lin f = t + + ---> + + oper f : C1* -> ... -> Cn* -> C* = + \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}} +``` +