forked from GitHub/gf-core
working on gf compiler doc
This commit is contained in:
@@ -1,13 +1,23 @@
|
||||
Compiling GF
|
||||
Aarne Ranta
|
||||
Proglog meeting, 1 November 2006
|
||||
|
||||
% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
|
||||
|
||||
%!target:html
|
||||
%!postproc(html): #NEW <!-- NEW -->
|
||||
|
||||
#NEW
|
||||
|
||||
==The compilation task==
|
||||
|
||||
GF is a grammar formalism, i.e. a special purpose programming language
|
||||
for writing grammars.
|
||||
|
||||
Cf: BNF, YACC, Happy (grammars for programming languages);
|
||||
PATR, HPSG, LFG (grammars for natural languages).
|
||||
Other grammar formalisms:
|
||||
- BNF, YACC, Happy (grammars for programming languages);
|
||||
- PATR, HPSG, LFG (grammars for natural languages).
|
||||
|
||||
|
||||
The grammar compiler prepares a GF grammar for two computational tasks:
|
||||
- linearization: take syntax trees to strings
|
||||
@@ -15,19 +25,40 @@ The grammar compiler prepares a GF grammar for two computational tasks:
|
||||
|
||||
|
||||
The grammar gives a declarative description of these functionalities,
|
||||
preferably on a high abstraction level enhancing grammar writing
|
||||
on a high abstraction level that helps grammar writing
|
||||
productivity.
|
||||
|
||||
Some of the ideas in GF and experience gained from it
|
||||
can be useful for other special-purpose functional languages.
|
||||
|
||||
#NEW
|
||||
|
||||
==Characteristics of GF language==
|
||||
|
||||
Functional language with types, both built-in and user-defined.
|
||||
```
|
||||
Str : Type
|
||||
|
||||
Pattern matching and higher-order functions.
|
||||
param Number = Sg | Pl
|
||||
|
||||
param AdjForm = ASg Gender | APl
|
||||
|
||||
Noun : Type = {s : Number => Str ; g : Gender}
|
||||
```
|
||||
Pattern matching.
|
||||
```
|
||||
svart_A = table {
|
||||
ASg _ => "svart" ;
|
||||
_ => "svarta"
|
||||
}
|
||||
```
|
||||
Higher-order functions.
|
||||
|
||||
Module system reminiscent of ML (signatures, structures, functors).
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==GF vs. Haskell==
|
||||
|
||||
Some things that (standard) Haskell hasn't:
|
||||
@@ -43,17 +74,19 @@ Some things that GF hasn't:
|
||||
- classes, polymorphism
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==GF vs. most linguistic grammar formalisms==
|
||||
|
||||
GF separates abstract syntax from concrete syntax
|
||||
GF separates abstract syntax from concrete syntax.
|
||||
|
||||
GF has a module system with separate compilation
|
||||
GF has a module system with separate compilation.
|
||||
|
||||
GF is generation-oriented (as opposed to parsing)
|
||||
GF is generation-oriented (as opposed to parsing).
|
||||
|
||||
GF has unidirectional matching (as opposed to unification)
|
||||
GF has unidirectional matching (as opposed to unification).
|
||||
|
||||
GF has a static type system (as opposed to a type-free universe)
|
||||
GF has a static type system (as opposed to a type-free universe).
|
||||
|
||||
"I was - and I still am - firmly convinced that a program composed
|
||||
out of statically type-checked parts is more likely to faithfully
|
||||
@@ -62,11 +95,13 @@ weakly-typed interfaces or dynamically-checked interfaces."
|
||||
(B. Stroustrup, 1994, p. 107)
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==The computation model==
|
||||
|
||||
An abstract syntax defines a free algebra of trees (using
|
||||
dependent types, recursion, higher-order abstract syntax: GF has a
|
||||
complete Logical Framework).
|
||||
dependent types, recursion, higher-order abstract syntax:
|
||||
GF includes a complete Logical Framework).
|
||||
|
||||
A concrete syntax defines a homomorphism (compositional mapping)
|
||||
from the abstract syntax to a system of tuples of strings.
|
||||
@@ -77,6 +112,8 @@ The parsing problem can be reduced to that of MPCFG (Multiple
|
||||
Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==The compilation task, again==
|
||||
|
||||
1. From a GF source grammar, derive a canonical GF grammar
|
||||
@@ -97,6 +134,8 @@ Moreover, we generate supplementary formats such as grammars
|
||||
required by various speech recognition systems.
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==An overview of compilation phases==
|
||||
|
||||
Legend:
|
||||
@@ -110,19 +149,23 @@ Legend:
|
||||
[gf-compiler.png]
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Using the compiler==
|
||||
|
||||
Batch mode (cf. GHC)
|
||||
Batch mode (cf. GHC).
|
||||
|
||||
Interactive mode, building the grammar incrementally from
|
||||
different files, with the possibility of testing them
|
||||
(cf. GHCI)
|
||||
(cf. GHCI).
|
||||
|
||||
The interactive mode was first, built on the model of ALF-2
|
||||
(L. Magnusson), and there was no file output of compiled
|
||||
grammars.
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Modules and separate compilation==
|
||||
|
||||
The above diagram shows what happens to each module.
|
||||
@@ -137,11 +180,13 @@ details (although it would be beneficial to make explicit the
|
||||
rules that are right now just in the implementation...)
|
||||
|
||||
Separate compilation is //extremely important// when developing
|
||||
big grammars, especially when using grammar libraries. Compiling
|
||||
big grammars, especially when using grammar libraries. Example: compiling
|
||||
the GF resource grammar library takes 5 minutes, whereas reading
|
||||
in the compiled image takes 10 seconds.
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Techniques used==
|
||||
|
||||
BNFC is used for generating both the parsers and printers.
|
||||
@@ -149,7 +194,8 @@ This has helped to make the formats portable.
|
||||
|
||||
"Almost compositional functions" (``composOp``) are used in
|
||||
many compiler passes, making them easier to write and understand.
|
||||
A grep on the sources reveals 40 uses (outside the definition itself).
|
||||
A grep on the sources reveals 40 uses (outside the definition
|
||||
of ``composOp`` itself).
|
||||
|
||||
The key algorithmic ideas are
|
||||
- type-driven partial evaluation in GF-to-GFC generation
|
||||
@@ -157,6 +203,8 @@ The key algorithmic ideas are
|
||||
- some ideas in GFC-to-MCFG encoding
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Type-driven partial evaluation==
|
||||
|
||||
Each abstract syntax category in GF has a corresponding linearization type:
|
||||
@@ -180,10 +228,253 @@ and library functions.
|
||||
The compilation technique proceeds as follows:
|
||||
- use eta-expansion on ``t`` to determine the canonical form of the term
|
||||
```
|
||||
\ <C1>, ...., <Cn> -> (t <C1> .... <Cn>)
|
||||
\ $C1, ...., $Cn -> (t $C1 .... $Cn)
|
||||
```
|
||||
with unique variables ``<C1> .... <Cn>`` for the arguments; repeat this
|
||||
with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
|
||||
inside the term for records and tables
|
||||
- evaluate the resulting term using the computation rules of GF
|
||||
- what remains is a canonical term with ``<C1> .... <Cn>`` the only
|
||||
- what remains is a canonical term with ``$C1 .... $Cn`` the only
|
||||
variables (the run-time input of the linearization function)
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Eta-expanding records and tables==
|
||||
|
||||
For records that are valied via subtyping, eta expansion
|
||||
eliminates superfluous fields:
|
||||
```
|
||||
{r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1}
|
||||
```
|
||||
For tables, the effect is always expansion, since
|
||||
pattern matching can be used to represent tables
|
||||
compactly:
|
||||
```
|
||||
table {n => "fish"} : Number => Str --->
|
||||
|
||||
table {
|
||||
Sg => "fish" ;
|
||||
Pl => "fish"
|
||||
}
|
||||
```
|
||||
This can be helped by back-end optimizations (see below).
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Eliminating functions==
|
||||
|
||||
"Everything is finite": parameter types, records, tables;
|
||||
finite number of string tokens per grammar.
|
||||
|
||||
But "inifinite types" such as function types are useful when
|
||||
writing grammars, to enable abstractions.
|
||||
|
||||
Since function types do not appear in linearization types,
|
||||
we want functions to be eliminated from linearization terms.
|
||||
|
||||
This is similar to the **subformula property** in logic.
|
||||
Also the main problem is similar: function depending on
|
||||
a run-time variable,
|
||||
```
|
||||
(table {P => f ; Q = g} ! x) a
|
||||
```
|
||||
This is not a redex, but we can make it closer to one by moving
|
||||
the application inside the table,
|
||||
```
|
||||
table {P => f a ; Q = g a} ! x
|
||||
```
|
||||
The transformation is the same as Prawitz's (1965) elimination
|
||||
of maximal segments:
|
||||
```
|
||||
A B
|
||||
C -> D C C -> D C
|
||||
A B --------- ---------
|
||||
A v B C -> D C -> D A v B D D
|
||||
--------------------- ===> -------------------------
|
||||
C -> D C D
|
||||
--------------------
|
||||
D
|
||||
```
|
||||
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Size effects of partial evaluation==
|
||||
|
||||
Irrelevant table branches are thrown away, which can reduce the size.
|
||||
|
||||
But, since tables are expanded and auxiliary functions are inlined,
|
||||
the size can grow exponentially.
|
||||
|
||||
How can we keep the first and eliminate the second?
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Parametrization of tables==
|
||||
|
||||
Algorithm: for each branch in a table, consider replacing the
|
||||
argument by a variable:
|
||||
```
|
||||
table { table {
|
||||
P => t ; ---> x => t[P->x] ;
|
||||
Q => t x => t[Q->x]
|
||||
} }
|
||||
```
|
||||
If the resulting branches are all equal, you can replace the table
|
||||
by a lambda abstract
|
||||
```
|
||||
\\x => t[P->x]
|
||||
```
|
||||
If each created variable ``x`` is unique in the grammar, computation
|
||||
with the lambda abstract is efficient.
|
||||
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Common subexpression elimination==
|
||||
|
||||
Algorithm:
|
||||
+ Go through all terms and subterms in a module, creating
|
||||
a symbol table mapping terms to the number of occurrences.
|
||||
+ For each subterm appearing at least twice, create a fresh
|
||||
constant defined as that subterm.
|
||||
+ Go through all rules (incl. rules for the new constants),
|
||||
replacing largest possible subterms with such new constants.
|
||||
|
||||
|
||||
This algorithm, in a way, creates the strongest possible abstractions.
|
||||
|
||||
In general, the new constants have open terms as definitions.
|
||||
But since all variables (and constants) are unique, they can
|
||||
be computed by simple replacement.
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Course-of-values tables==
|
||||
|
||||
By maintaining a canonical order of parameters in a type, we can
|
||||
eliminate the left hand sides of branches.
|
||||
```
|
||||
table { table T [
|
||||
P => t ; ---> P ;
|
||||
Q => u Q
|
||||
} ]
|
||||
```
|
||||
The treatment is similar to ``Enum`` instances in Haskell.
|
||||
|
||||
Actually, all parameter types could be translated to
|
||||
initial segments of integers. This is done in the GFCC format.
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Size effects of optimizations==
|
||||
|
||||
Example: the German resource grammar
|
||||
``LangGer``
|
||||
|
||||
|| optimization | lines | characters | size % | blow-up |
|
||||
| none | 5394 | 3208435 | 100 | 25 |
|
||||
| all | 5394 | 750277 | 23 | 6 |
|
||||
| none_subs | 5772 | 1290866 | 40 | 10 |
|
||||
| all_subs | 5644 | 414119 | 13 | 3 |
|
||||
| gfcc | 3279 | 190004 | 6 | 1.5 |
|
||||
| gf source | 3976 | 121939 | 4 | 1 |
|
||||
|
||||
|
||||
Optimization "all" means parametrization + course-of-values.
|
||||
|
||||
The source code size is an estimate, since it includes
|
||||
potentially irrelevant library modules.
|
||||
|
||||
The GFCC format is not reusable in separate compilation.
|
||||
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==The shared prefix optimization==
|
||||
|
||||
This is currently performed in GFCC only.
|
||||
|
||||
The idea works for languages that have a rich morphology
|
||||
based on suffixes. Then we can replace a course of values
|
||||
with a pair of a prefix and a suffix set:
|
||||
```
|
||||
["apa", "apan", "apor", "aporna"] --->
|
||||
("ap" + ["a", "an", "or", "orna"])
|
||||
```
|
||||
The real gain comes via common subexpression elimination:
|
||||
```
|
||||
_34 = ["a", "an", "or", "orna"]
|
||||
apa = ("ap" + _34)
|
||||
blomma = ("blomm" + _34)
|
||||
flicka = ("flick" + _34)
|
||||
```
|
||||
Notice that it now matters a lot how grammars are written.
|
||||
For instance, if German verbs are treated as a one-dimensional
|
||||
table,
|
||||
```
|
||||
["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
|
||||
```
|
||||
no shared prefix optimization is possible. A better form is
|
||||
separate tables for non-"ge" and "ge" forms:
|
||||
```
|
||||
[["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
|
||||
```
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Reuse of grammars as libraries==
|
||||
|
||||
The idea of resource grammars: take care of all aspects of
|
||||
surface grammaticality (inflection, agreement, word order).
|
||||
|
||||
Reuse in application grammar: via translations
|
||||
```
|
||||
cat C ---> oper C : Type = T
|
||||
lincat C = T
|
||||
|
||||
fun f : A ---> oper f : A* = t
|
||||
lin f = t
|
||||
```
|
||||
The user only needs to know the type signatures (abstract syntax).
|
||||
|
||||
However, this does not quite guarantee grammaticality, because
|
||||
different categories can have the same lincat:
|
||||
```
|
||||
lincat Conj = {s : Str}
|
||||
lincat Adv = {s : Str}
|
||||
```
|
||||
Thus someone may by accident use "and" as an adverb!
|
||||
|
||||
|
||||
#NEW
|
||||
|
||||
==Forcing the type checker to act as a grammar checker==
|
||||
|
||||
We just have to make linearization types unique for each category.
|
||||
|
||||
The technique is reminiscent of Haskell's ``newtype`` but uses
|
||||
records instead: we add **lock fields** e.g.
|
||||
```
|
||||
lincat Conj = {s : Str ; lock_Conj : {}}
|
||||
lincat Adv = {s : Str ; lock_Adv : {}}
|
||||
```
|
||||
Thanks to record subtyping, the translation is simple:
|
||||
```
|
||||
fun f : C1 -> ... -> Cn -> C
|
||||
lin f = t
|
||||
|
||||
--->
|
||||
|
||||
oper f : C1* -> ... -> Cn* -> C* =
|
||||
\x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user