forked from GitHub/gf-core
working on gf compiler doc
This commit is contained in:
@@ -1,13 +1,23 @@
|
|||||||
Compiling GF
|
Compiling GF
|
||||||
Aarne Ranta
|
Aarne Ranta
|
||||||
|
Proglog meeting, 1 November 2006
|
||||||
|
|
||||||
|
% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
|
||||||
|
|
||||||
|
%!target:html
|
||||||
|
%!postproc(html): #NEW <!-- NEW -->
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==The compilation task==
|
==The compilation task==
|
||||||
|
|
||||||
GF is a grammar formalism, i.e. a special purpose programming language
|
GF is a grammar formalism, i.e. a special purpose programming language
|
||||||
for writing grammars.
|
for writing grammars.
|
||||||
|
|
||||||
Cf: BNF, YACC, Happy (grammars for programming languages);
|
Other grammar formalisms:
|
||||||
PATR, HPSG, LFG (grammars for natural languages).
|
- BNF, YACC, Happy (grammars for programming languages);
|
||||||
|
- PATR, HPSG, LFG (grammars for natural languages).
|
||||||
|
|
||||||
|
|
||||||
The grammar compiler prepares a GF grammar for two computational tasks:
|
The grammar compiler prepares a GF grammar for two computational tasks:
|
||||||
- linearization: take syntax trees to strings
|
- linearization: take syntax trees to strings
|
||||||
@@ -15,19 +25,40 @@ The grammar compiler prepares a GF grammar for two computational tasks:
|
|||||||
|
|
||||||
|
|
||||||
The grammar gives a declarative description of these functionalities,
|
The grammar gives a declarative description of these functionalities,
|
||||||
preferably on a high abstraction level enhancing grammar writing
|
on a high abstraction level that helps grammar writing
|
||||||
productivity.
|
productivity.
|
||||||
|
|
||||||
|
Some of the ideas in GF and experience gained from it
|
||||||
|
can be useful for other special-purpose functional languages.
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==Characteristics of GF language==
|
==Characteristics of GF language==
|
||||||
|
|
||||||
Functional language with types, both built-in and user-defined.
|
Functional language with types, both built-in and user-defined.
|
||||||
|
```
|
||||||
|
Str : Type
|
||||||
|
|
||||||
Pattern matching and higher-order functions.
|
param Number = Sg | Pl
|
||||||
|
|
||||||
|
param AdjForm = ASg Gender | APl
|
||||||
|
|
||||||
|
Noun : Type = {s : Number => Str ; g : Gender}
|
||||||
|
```
|
||||||
|
Pattern matching.
|
||||||
|
```
|
||||||
|
svart_A = table {
|
||||||
|
ASg _ => "svart" ;
|
||||||
|
_ => "svarta"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Higher-order functions.
|
||||||
|
|
||||||
Module system reminiscent of ML (signatures, structures, functors).
|
Module system reminiscent of ML (signatures, structures, functors).
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==GF vs. Haskell==
|
==GF vs. Haskell==
|
||||||
|
|
||||||
Some things that (standard) Haskell hasn't:
|
Some things that (standard) Haskell hasn't:
|
||||||
@@ -43,17 +74,19 @@ Some things that GF hasn't:
|
|||||||
- classes, polymorphism
|
- classes, polymorphism
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==GF vs. most linguistic grammar formalisms==
|
==GF vs. most linguistic grammar formalisms==
|
||||||
|
|
||||||
GF separates abstract syntax from concrete syntax
|
GF separates abstract syntax from concrete syntax.
|
||||||
|
|
||||||
GF has a module system with separate compilation
|
GF has a module system with separate compilation.
|
||||||
|
|
||||||
GF is generation-oriented (as opposed to parsing)
|
GF is generation-oriented (as opposed to parsing).
|
||||||
|
|
||||||
GF has unidirectional matching (as opposed to unification)
|
GF has unidirectional matching (as opposed to unification).
|
||||||
|
|
||||||
GF has a static type system (as opposed to a type-free universe)
|
GF has a static type system (as opposed to a type-free universe).
|
||||||
|
|
||||||
"I was - and I still am - firmly convinced that a program composed
|
"I was - and I still am - firmly convinced that a program composed
|
||||||
out of statically type-checked parts is more likely to faithfully
|
out of statically type-checked parts is more likely to faithfully
|
||||||
@@ -62,11 +95,13 @@ weakly-typed interfaces or dynamically-checked interfaces."
|
|||||||
(B. Stroustrup, 1994, p. 107)
|
(B. Stroustrup, 1994, p. 107)
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==The computation model==
|
==The computation model==
|
||||||
|
|
||||||
An abstract syntax defines a free algebra of trees (using
|
An abstract syntax defines a free algebra of trees (using
|
||||||
dependent types, recursion, higher-order abstract syntax: GF has a
|
dependent types, recursion, higher-order abstract syntax:
|
||||||
complete Logical Framework).
|
GF includes a complete Logical Framework).
|
||||||
|
|
||||||
A concrete syntax defines a homomorphism (compositional mapping)
|
A concrete syntax defines a homomorphism (compositional mapping)
|
||||||
from the abstract syntax to a system of tuples of strings.
|
from the abstract syntax to a system of tuples of strings.
|
||||||
@@ -77,6 +112,8 @@ The parsing problem can be reduced to that of MPCFG (Multiple
|
|||||||
Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
|
Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==The compilation task, again==
|
==The compilation task, again==
|
||||||
|
|
||||||
1. From a GF source grammar, derive a canonical GF grammar
|
1. From a GF source grammar, derive a canonical GF grammar
|
||||||
@@ -97,6 +134,8 @@ Moreover, we generate supplementary formats such as grammars
|
|||||||
required by various speech recognition systems.
|
required by various speech recognition systems.
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==An overview of compilation phases==
|
==An overview of compilation phases==
|
||||||
|
|
||||||
Legend:
|
Legend:
|
||||||
@@ -110,19 +149,23 @@ Legend:
|
|||||||
[gf-compiler.png]
|
[gf-compiler.png]
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==Using the compiler==
|
==Using the compiler==
|
||||||
|
|
||||||
Batch mode (cf. GHC)
|
Batch mode (cf. GHC).
|
||||||
|
|
||||||
Interactive mode, building the grammar incrementally from
|
Interactive mode, building the grammar incrementally from
|
||||||
different files, with the possibility of testing them
|
different files, with the possibility of testing them
|
||||||
(cf. GHCI)
|
(cf. GHCI).
|
||||||
|
|
||||||
The interactive mode was first, built on the model of ALF-2
|
The interactive mode was first, built on the model of ALF-2
|
||||||
(L. Magnusson), and there was no file output of compiled
|
(L. Magnusson), and there was no file output of compiled
|
||||||
grammars.
|
grammars.
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==Modules and separate compilation==
|
==Modules and separate compilation==
|
||||||
|
|
||||||
The above diagram shows what happens to each module.
|
The above diagram shows what happens to each module.
|
||||||
@@ -137,11 +180,13 @@ details (although it would be beneficial to make explicit the
|
|||||||
rules that are right now just in the implementation...)
|
rules that are right now just in the implementation...)
|
||||||
|
|
||||||
Separate compilation is //extremely important// when developing
|
Separate compilation is //extremely important// when developing
|
||||||
big grammars, especially when using grammar libraries. Compiling
|
big grammars, especially when using grammar libraries. Example: compiling
|
||||||
the GF resource grammar library takes 5 minutes, whereas reading
|
the GF resource grammar library takes 5 minutes, whereas reading
|
||||||
in the compiled image takes 10 seconds.
|
in the compiled image takes 10 seconds.
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==Techniques used==
|
==Techniques used==
|
||||||
|
|
||||||
BNFC is used for generating both the parsers and printers.
|
BNFC is used for generating both the parsers and printers.
|
||||||
@@ -149,7 +194,8 @@ This has helped to make the formats portable.
|
|||||||
|
|
||||||
"Almost compositional functions" (``composOp``) are used in
|
"Almost compositional functions" (``composOp``) are used in
|
||||||
many compiler passes, making them easier to write and understand.
|
many compiler passes, making them easier to write and understand.
|
||||||
A grep on the sources reveals 40 uses (outside the definition itself).
|
A grep on the sources reveals 40 uses (outside the definition
|
||||||
|
of ``composOp`` itself).
|
||||||
|
|
||||||
The key algorithmic ideas are
|
The key algorithmic ideas are
|
||||||
- type-driven partial evaluation in GF-to-GFC generation
|
- type-driven partial evaluation in GF-to-GFC generation
|
||||||
@@ -157,6 +203,8 @@ The key algorithmic ideas are
|
|||||||
- some ideas in GFC-to-MCFG encoding
|
- some ideas in GFC-to-MCFG encoding
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
==Type-driven partial evaluation==
|
==Type-driven partial evaluation==
|
||||||
|
|
||||||
Each abstract syntax category in GF has a corresponding linearization type:
|
Each abstract syntax category in GF has a corresponding linearization type:
|
||||||
@@ -180,10 +228,253 @@ and library functions.
|
|||||||
The compilation technique proceeds as follows:
|
The compilation technique proceeds as follows:
|
||||||
- use eta-expansion on ``t`` to determine the canonical form of the term
|
- use eta-expansion on ``t`` to determine the canonical form of the term
|
||||||
```
|
```
|
||||||
\ <C1>, ...., <Cn> -> (t <C1> .... <Cn>)
|
\ $C1, ...., $Cn -> (t $C1 .... $Cn)
|
||||||
```
|
```
|
||||||
with unique variables ``<C1> .... <Cn>`` for the arguments; repeat this
|
with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
|
||||||
inside the term for records and tables
|
inside the term for records and tables
|
||||||
- evaluate the resulting term using the computation rules of GF
|
- evaluate the resulting term using the computation rules of GF
|
||||||
- what remains is a canonical term with ``<C1> .... <Cn>`` the only
|
- what remains is a canonical term with ``$C1 .... $Cn`` the only
|
||||||
variables (the run-time input of the linearization function)
|
variables (the run-time input of the linearization function)
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Eta-expanding records and tables==
|
||||||
|
|
||||||
|
For records that are valied via subtyping, eta expansion
|
||||||
|
eliminates superfluous fields:
|
||||||
|
```
|
||||||
|
{r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1}
|
||||||
|
```
|
||||||
|
For tables, the effect is always expansion, since
|
||||||
|
pattern matching can be used to represent tables
|
||||||
|
compactly:
|
||||||
|
```
|
||||||
|
table {n => "fish"} : Number => Str --->
|
||||||
|
|
||||||
|
table {
|
||||||
|
Sg => "fish" ;
|
||||||
|
Pl => "fish"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
This can be helped by back-end optimizations (see below).
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Eliminating functions==
|
||||||
|
|
||||||
|
"Everything is finite": parameter types, records, tables;
|
||||||
|
finite number of string tokens per grammar.
|
||||||
|
|
||||||
|
But "inifinite types" such as function types are useful when
|
||||||
|
writing grammars, to enable abstractions.
|
||||||
|
|
||||||
|
Since function types do not appear in linearization types,
|
||||||
|
we want functions to be eliminated from linearization terms.
|
||||||
|
|
||||||
|
This is similar to the **subformula property** in logic.
|
||||||
|
Also the main problem is similar: function depending on
|
||||||
|
a run-time variable,
|
||||||
|
```
|
||||||
|
(table {P => f ; Q = g} ! x) a
|
||||||
|
```
|
||||||
|
This is not a redex, but we can make it closer to one by moving
|
||||||
|
the application inside the table,
|
||||||
|
```
|
||||||
|
table {P => f a ; Q = g a} ! x
|
||||||
|
```
|
||||||
|
The transformation is the same as Prawitz's (1965) elimination
|
||||||
|
of maximal segments:
|
||||||
|
```
|
||||||
|
A B
|
||||||
|
C -> D C C -> D C
|
||||||
|
A B --------- ---------
|
||||||
|
A v B C -> D C -> D A v B D D
|
||||||
|
--------------------- ===> -------------------------
|
||||||
|
C -> D C D
|
||||||
|
--------------------
|
||||||
|
D
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Size effects of partial evaluation==
|
||||||
|
|
||||||
|
Irrelevant table branches are thrown away, which can reduce the size.
|
||||||
|
|
||||||
|
But, since tables are expanded and auxiliary functions are inlined,
|
||||||
|
the size can grow exponentially.
|
||||||
|
|
||||||
|
How can we keep the first and eliminate the second?
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Parametrization of tables==
|
||||||
|
|
||||||
|
Algorithm: for each branch in a table, consider replacing the
|
||||||
|
argument by a variable:
|
||||||
|
```
|
||||||
|
table { table {
|
||||||
|
P => t ; ---> x => t[P->x] ;
|
||||||
|
Q => t x => t[Q->x]
|
||||||
|
} }
|
||||||
|
```
|
||||||
|
If the resulting branches are all equal, you can replace the table
|
||||||
|
by a lambda abstract
|
||||||
|
```
|
||||||
|
\\x => t[P->x]
|
||||||
|
```
|
||||||
|
If each created variable ``x`` is unique in the grammar, computation
|
||||||
|
with the lambda abstract is efficient.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Common subexpression elimination==
|
||||||
|
|
||||||
|
Algorithm:
|
||||||
|
+ Go through all terms and subterms in a module, creating
|
||||||
|
a symbol table mapping terms to the number of occurrences.
|
||||||
|
+ For each subterm appearing at least twice, create a fresh
|
||||||
|
constant defined as that subterm.
|
||||||
|
+ Go through all rules (incl. rules for the new constants),
|
||||||
|
replacing largest possible subterms with such new constants.
|
||||||
|
|
||||||
|
|
||||||
|
This algorithm, in a way, creates the strongest possible abstractions.
|
||||||
|
|
||||||
|
In general, the new constants have open terms as definitions.
|
||||||
|
But since all variables (and constants) are unique, they can
|
||||||
|
be computed by simple replacement.
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Course-of-values tables==
|
||||||
|
|
||||||
|
By maintaining a canonical order of parameters in a type, we can
|
||||||
|
eliminate the left hand sides of branches.
|
||||||
|
```
|
||||||
|
table { table T [
|
||||||
|
P => t ; ---> P ;
|
||||||
|
Q => u Q
|
||||||
|
} ]
|
||||||
|
```
|
||||||
|
The treatment is similar to ``Enum`` instances in Haskell.
|
||||||
|
|
||||||
|
Actually, all parameter types could be translated to
|
||||||
|
initial segments of integers. This is done in the GFCC format.
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Size effects of optimizations==
|
||||||
|
|
||||||
|
Example: the German resource grammar
|
||||||
|
``LangGer``
|
||||||
|
|
||||||
|
|| optimization | lines | characters | size % | blow-up |
|
||||||
|
| none | 5394 | 3208435 | 100 | 25 |
|
||||||
|
| all | 5394 | 750277 | 23 | 6 |
|
||||||
|
| none_subs | 5772 | 1290866 | 40 | 10 |
|
||||||
|
| all_subs | 5644 | 414119 | 13 | 3 |
|
||||||
|
| gfcc | 3279 | 190004 | 6 | 1.5 |
|
||||||
|
| gf source | 3976 | 121939 | 4 | 1 |
|
||||||
|
|
||||||
|
|
||||||
|
Optimization "all" means parametrization + course-of-values.
|
||||||
|
|
||||||
|
The source code size is an estimate, since it includes
|
||||||
|
potentially irrelevant library modules.
|
||||||
|
|
||||||
|
The GFCC format is not reusable in separate compilation.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==The shared prefix optimization==
|
||||||
|
|
||||||
|
This is currently performed in GFCC only.
|
||||||
|
|
||||||
|
The idea works for languages that have a rich morphology
|
||||||
|
based on suffixes. Then we can replace a course of values
|
||||||
|
with a pair of a prefix and a suffix set:
|
||||||
|
```
|
||||||
|
["apa", "apan", "apor", "aporna"] --->
|
||||||
|
("ap" + ["a", "an", "or", "orna"])
|
||||||
|
```
|
||||||
|
The real gain comes via common subexpression elimination:
|
||||||
|
```
|
||||||
|
_34 = ["a", "an", "or", "orna"]
|
||||||
|
apa = ("ap" + _34)
|
||||||
|
blomma = ("blomm" + _34)
|
||||||
|
flicka = ("flick" + _34)
|
||||||
|
```
|
||||||
|
Notice that it now matters a lot how grammars are written.
|
||||||
|
For instance, if German verbs are treated as a one-dimensional
|
||||||
|
table,
|
||||||
|
```
|
||||||
|
["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
|
||||||
|
```
|
||||||
|
no shared prefix optimization is possible. A better form is
|
||||||
|
separate tables for non-"ge" and "ge" forms:
|
||||||
|
```
|
||||||
|
[["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Reuse of grammars as libraries==
|
||||||
|
|
||||||
|
The idea of resource grammars: take care of all aspects of
|
||||||
|
surface grammaticality (inflection, agreement, word order).
|
||||||
|
|
||||||
|
Reuse in application grammar: via translations
|
||||||
|
```
|
||||||
|
cat C ---> oper C : Type = T
|
||||||
|
lincat C = T
|
||||||
|
|
||||||
|
fun f : A ---> oper f : A* = t
|
||||||
|
lin f = t
|
||||||
|
```
|
||||||
|
The user only needs to know the type signatures (abstract syntax).
|
||||||
|
|
||||||
|
However, this does not quite guarantee grammaticality, because
|
||||||
|
different categories can have the same lincat:
|
||||||
|
```
|
||||||
|
lincat Conj = {s : Str}
|
||||||
|
lincat Adv = {s : Str}
|
||||||
|
```
|
||||||
|
Thus someone may by accident use "and" as an adverb!
|
||||||
|
|
||||||
|
|
||||||
|
#NEW
|
||||||
|
|
||||||
|
==Forcing the type checker to act as a grammar checker==
|
||||||
|
|
||||||
|
We just have to make linearization types unique for each category.
|
||||||
|
|
||||||
|
The technique is reminiscent of Haskell's ``newtype`` but uses
|
||||||
|
records instead: we add **lock fields** e.g.
|
||||||
|
```
|
||||||
|
lincat Conj = {s : Str ; lock_Conj : {}}
|
||||||
|
lincat Adv = {s : Str ; lock_Adv : {}}
|
||||||
|
```
|
||||||
|
Thanks to record subtyping, the translation is simple:
|
||||||
|
```
|
||||||
|
fun f : C1 -> ... -> Cn -> C
|
||||||
|
lin f = t
|
||||||
|
|
||||||
|
--->
|
||||||
|
|
||||||
|
oper f : C1* -> ... -> Cn* -> C* =
|
||||||
|
\x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
|
||||||
|
```
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user