Author's address:
http://www.cs.chalmers.se/~aarne
History:
GFCC is a low-level format for GF grammars. Its aim is to contain the minimum that is needed to process GF grammars at runtime. This minimality has three advantages:
The idea is that all embedded GF applications are compiled to GFCC. The GF system would be primarily used as a compiler and as a grammar development tool.
Since GFCC is implemented in BNFC, a parser of the format is readily available for C, C++, Haskell, Java, and OCaml. Also an XML representation is generated in BNFC. A reference implementation of linearization and some other functions has been written in Haskell.
GFCC is aimed to replace GFC as the run-time grammar format. GFC was designed to be a run-time format, but also to support separate compilation of grammars, i.e. to store the results of compiling individual GF modules. But this means that GFC has to contain extra information, such as type annotations, which is only needed in compilation and not at run-time. In particular, the pattern matching syntax and semantics of GFC is complex and therefore difficult to implement in new platforms.
The main differences of GFCC compared with GFC can be summarized as follows:
Here is an example of a GF grammar, consisting of three modules, as translated to GFCC. The representations are aligned, with the exceptions due to the alphabetical sorting of GFCC grammars.
grammar Ex(Eng,Swe);
abstract Ex = { abstract {
cat
S ; NP ; VP ;
fun
Pred : NP -> VP -> S ; Pred : NP,VP -> S = (Pred);
She, They : NP ; She : -> NP = (She);
Sleep : VP ; Sleep : -> VP = (Sleep);
They : -> NP = (They);
} } ;
concrete Eng of Ex = { concrete Eng {
lincat
S = {s : Str} ;
NP = {s : Str ; n : Num} ;
VP = {s : Num => Str} ;
param
Num = Sg | Pl ;
lin
Pred np vp = { Pred = [(($0!1),(($1!0)!($0!0)))];
s = np.s ++ vp.s ! np.n} ;
She = {s = "she" ; n = Sg} ; She = [0, "she"];
They = {s = "they" ; n = Pl} ;
Sleep = {s = table { Sleep = [("sleep" + ["s",""])];
Sg => "sleeps" ;
Pl => "sleep" They = [1, "they"];
} } ;
} ;
}
concrete Swe of Ex = { concrete Swe {
lincat
S = {s : Str} ;
NP = {s : Str} ;
VP = {s : Str} ;
param
Num = Sg | Pl ;
lin
Pred np vp = { Pred = [(($0!0),($1!0))];
s = np.s ++ vp.s} ;
She = {s = "hon"} ; She = ["hon"];
They = {s = "de"} ; They = ["de"];
Sleep = {s = "sover"} ; Sleep = ["sover"];
} } ;
A grammar has a header telling the name of the abstract syntax (often specifying an application domain), and the names of the concrete languages. The abstract syntax and the concrete syntaxes themselves follow.
Grammar ::= Header ";" Abstract ";" [Concrete] ;
Header ::= "grammar" CId "(" [CId] ")" ;
Abstract ::= "abstract" "{" [AbsDef] "}" ;
Concrete ::= "concrete" CId "{" [CncDef] "}" ;
Abstract syntax judgements give typings and semantic definitions. Concrete syntax judgements give linearizations.
AbsDef ::= CId ":" Type "=" Exp ;
CncDef ::= CId "=" Term ;
Also flags are possible, local to each "module" (i.e. abstract and concretes).
AbsDef ::= "%" CId "=" String ;
CncDef ::= "%" CId "=" String ;
For the run-time system, the reference implementation in Haskell uses a structure that gives efficient look-up:
data GFCC = GFCC {
absname :: CId ,
cncnames :: [CId] ,
abstract :: Abstr ,
concretes :: Map CId Concr
}
data Abstr = Abstr {
funs :: Map CId Type, -- find the type of a fun
cats :: Map CId [CId] -- find the funs giving a cat
}
type Concr = Map CId Term
Types are first-order function types built from
category symbols. Syntax trees (Exp) are
rose trees with the head (Atom) either a function
constant, a metavariable, or a string, integer, or float
literal.
Type ::= [CId] "->" CId ;
Exp ::= "(" Atom [Exp] ")" ;
Atom ::= CId ; -- function constant
Atom ::= "?" ; -- metavariable
Atom ::= String ; -- string literal
Atom ::= Integer ; -- integer literal
Atom ::= Double ; -- float literal
Linearization terms (Term) are built as follows.
Constructor names are shown to make the later code
examples readable.
R. Term ::= "[" [Term] "]" ; -- array
P. Term ::= "(" Term "!" Term ")" ; -- access to indexed field
S. Term ::= "(" [Term] ")" ; -- sequence with ++
K. Term ::= Tokn ; -- token
V. Term ::= "$" Integer ; -- argument
C. Term ::= Integer ; -- array index
FV. Term ::= "[|" [Term] "|]" ; -- free variation
TM. Term ::= "?" ; -- linearization of metavariable
Tokens are strings or (maybe obsolescent) prefix-dependent variant lists.
KS. Tokn ::= String ;
KP. Tokn ::= "[" "pre" [String] "[" [Variant] "]" "]" ;
Var. Variant ::= [String] "/" [String] ;
Three special forms of terms are introduced by the compiler as optimizations. They can in principle be eliminated, but their presence makes grammars much more compact. Their semantics will be explained in a later section.
F. Term ::= CId ; -- global constant
W. Term ::= "(" String "+" Term ")" ; -- prefix + suffix table
RP. Term ::= "(" Term "@" Term ")"; -- record parameter alias
Identifiers are like Ident in GF and GFC, except that
the compiler produces constants prefixed with _ in
the common subterm elimination optimization.
token CId (('_' | letter) (letter | digit | '\'' | '_')*) ;
The linearization algorithm is essentially the same as in GFC: a tree is linearized by evaluating its linearization term in the environment of the linearizations of the subtrees. Literal atoms are linearized in the obvious way. The function also needs to know the language (i.e. concrete syntax) in which linearization is performed.
linExp :: GFCC -> CId -> Exp -> Term
linExp mcfg lang tree@(Tr at trees) = case at of
AC fun -> comp (Prelude.map lin trees) $ look fun
AS s -> R [kks (show s)] -- quoted
AI i -> R [kks (show i)]
AF d -> R [kks (show d)]
AM -> TM
where
lin = linExp mcfg lang
comp = compute mcfg lang
look = lookLin mcfg lang
The result of linearization is usually a record, which is realized as a string using the following algorithm.
realize :: Term -> String
realize trm = case trm of
R (t:_) -> realize t
S ss -> unwords $ Prelude.map realize ss
K (KS s) -> s
K (KP s _) -> unwords s ---- prefix choice TODO
W s t -> s ++ realize t
FV (t:_) -> realize t
TM -> "?"
Since the order of record fields is not necessarily the same as in GF source, this realization does not work securely for categories whose lincats more than one field.
Evaluation follows call-by-value order, with two environments needed:
The code is presented in one-level pattern matching, to enable reimplementations in languages that do not permit deep patterns (such as Java and C++).
compute :: GFCC -> CId -> [Term] -> Term -> Term
compute mcfg lang args = comp where
comp trm = case trm of
P r p -> proj (comp r) (comp p)
RP i t -> RP (comp i) (comp t)
W s t -> W s (comp t)
R ts -> R $ Prelude.map comp ts
V i -> idx args (fromInteger i) -- already computed
F c -> comp $ look c -- not computed (if contains V)
FV ts -> FV $ Prelude.map comp ts
S ts -> S $ Prelude.filter (/= S []) $ Prelude.map comp ts
_ -> trm
look = lookLin mcfg lang
idx xs i = xs !! i
proj r p = case (r,p) of
(_, FV ts) -> FV $ Prelude.map (proj r) ts
(W s t, _) -> kks (s ++ getString (proj t p))
_ -> comp $ getField r (getIndex p)
getString t = case t of
K (KS s) -> s
_ -> trace ("ERROR in grammar compiler: string from "++ show t) "ERR"
getIndex t = case t of
C i -> fromInteger i
RP p _ -> getIndex p
TM -> 0 -- default value for parameter
_ -> trace ("ERROR in grammar compiler: index from " ++ show t) 0
getField t i = case t of
R rs -> idx rs i
RP _ r -> getField r i
TM -> TM
_ -> trace ("ERROR in grammar compiler: field from " ++ show t) t
The three forms introduced by the compiler may a need special explanation.
Global constants
Term ::= CId ;
are shorthands for complex terms. They are produced by the compiler by (iterated) common subexpression elimination. They are often more powerful than hand-devised code sharing in the source code. They could be computed off-line by replacing each identifier by its definition.
Prefix-suffix tables
Term ::= "(" String "+" Term ")" ;
represent tables of word forms divided to the longest common prefix and its array of suffixes. In the example grammar above, we have
Sleep = [("sleep" + ["s",""])]
which in fact is equal to the array of full forms
["sleeps", "sleep"]
The power of this construction comes from the fact that suffix sets tend to be repeated in a language, and can therefore be collected by common subexpression elimination. It is this technique that explains the used syntax rather than the more accurate
"(" String "+" [String] ")"
since we want the suffix part to be a Term for the optimization to
take effect.
The most curious construct of GFCC is the parameter array alias,
Term ::= "(" Term "@" Term ")";
This form is used as the value of parameter records, such as the type
{n : Number ; p : Person}
The problem with parameter records is their double role. They can be used like parameter values, as indices in selection,
VP.s ! {n = Sg ; p = P3}
but also as records, from which parameters can be projected:
{n = Sg ; p = P3}.n
Whichever use is selected as primary, a prohibitively complex case expression must be generated at compilation to GFCC to get the other use. The adopted solution is to generate a pair containing both a parameter value index and an array of indices of record fields. For instance, if we have
param Number = Sg | Pl ; Person = P1 | P2 | P3 ;
we get the encoding
{n = Sg ; p = P3} ---> (2 @ [0,2])
The GFCC computation rules are essentially
(t ! (i @ _)) = (t ! i)
((_ @ r) ! j) =(r ! j)
Compilation to GFCC is performed by the GF grammar compiler, and GFCC interpreters need not know what it does. For grammar writers, however, it might be interesting to know what happens to the grammars in the process.
The compilation phases are the following
values optimization to normalize tables
coding flag)
Notice that a major part of the compilation is done within GFC, so that GFC-related tasks (such as parser generation) could be performed by using the old algorithms.
Two major problems had to be solved in compiling GFC to GFCC:
The current implementation is still experimental and may fail to generate correct code. Any errors remaining are likely to be related to the two problems just mentioned.
The order problem is solved in different ways for tables and records.
For tables, the values optimization of GFC already manages to
maintain a canonical order. But this order can be destroyed by the
share optimization. To make sure that GFCC compilation works properly,
it is safest to recompile the GF grammar by using the values
optimization flag.
Records can be canonically ordered by sorting them by labels.
In fact, this was done in connection of the GFCC work as a part
of the GFC generation, to guarantee consistency. This means that
e.g. the s field will in general no longer appear as the first
field, even if it does so in the GF source code. But relying on the
order of fields in a labelled record would be misplaced anyway.
The canonical form of records is further complicated by lock fields,
i.e. dummy fields of form lock_C = <>, which are added to grammar
libraries to force intensionality of linearization types. The problem
is that the absence of a lock field only generates a warning, not
an error. Therefore a GFC grammar can contain objects of the same
type with and without a lock field. This problem was solved in GFCC
generation by just removing all lock fields (defined as fields whose
type is the empty record type). This has the further advantage of
(slightly) reducing the grammar size. More importantly, it is safe
to remove lock fields, because they are never used in computation,
and because intensional types are only needed in grammars reused
as libraries, not in grammars used at runtime.
While the order problem is rather bureaucratic in nature, run-time variables are an interesting problem. They arise in the presence of complex parameter values, created by argument-taking constructors and parameter records. To give an example, consider the GF parameter type system
Number = Sg | Pl ;
Person = P1 | P2 | P3 ;
Agr = Ag Number Person ;
The values can be translated to integers in the expected way,
Sg = 0, Pl = 1
P1 = 0, P2 = 1, P3 = 2
Ag Sg P1 = 0, Ag Sg P2 = 1, Ag Sg P3 = 2,
Ag Pl P1 = 3, Ag Pl P2 = 4, Ag Pl P3 = 5
However, an argument of Agr can be a run-time variable, as in
Ag np.n P3
This expression must first be translated to a case expression,
case np.n of {
0 => 2 ;
1 => 5
}
which can then be translated to the GFCC term
([2,5] ! ($0 ! $1))
assuming that the variable np is the first argument and that its
Number field is the second in the record.
This transformation of course has to be performed recursively, since there can be several run-time variables in a parameter value:
Ag np.n np.p
A similar transformation would be possible to deal with the double role of parameter records discussed above. Thus the type
RNP = {n : Number ; p : Person}
could be uniformly translated into the set {0,1,2,3,4,5}
as Agr above. Selections would be simple instances of indexing.
But any projection from the record should be translated into
a case expression,
rnp.n ===>
case rnp of {
0 => 0 ;
1 => 0 ;
2 => 0 ;
3 => 1 ;
4 => 1 ;
5 => 1
}
To avoid the code bloat resulting from this, we chose the alias representation which is easy enough to deal with in interpreters.
Linearization types (lincat) are not needed when generating with
GFCC, but they have been added to enable parser generation directly from
GFCC. The linearization type definitions are shown as a part of the
concrete syntax, by using terms to represent types. Here is the table
showing how different linearization types are encoded.
P* = size(P) -- parameter type
{_ : I ; __ : R}* = (I* @ R*) -- record of parameters
{r1 : T1 ; ... ; rn : Tn}* = [T1*,...,Tn*] -- other record
(P => T)* = [T* ,...,T*] -- size(P) times
Str* = ()
The category symbols are prefixed with two underscores (__).
For example, the linearization type present/CatEng.NP is
translated as follows:
NP = {
a : { -- 6 = 2*3 values
n : {ParamX.Number} ; -- 2 values
p : {ParamX.Person} -- 3 values
} ;
s : {ResEng.Case} => Str -- 3 values
}
__NP = [(6@[2,3]),[(),(),()]]
GFCC generation is a part of the
developers' version
of GF since September 2006. To invoke the compiler, the flag
-printer=gfcc to the command
pm = print_multi is used. It is wise to recompile the grammar from
source, since previously compiled libraries may not obey the canonical
order of records. To strip the grammar before
GFCC translation removes unnecessary interface references.
Here is an example, performed in
example/bronzeage.
i -src -path=.:prelude:resource-1.0/* -optimize=all_subs BronzeageEng.gf
i -src -path=.:prelude:resource-1.0/* -optimize=all_subs BronzeageGer.gf
strip
pm -printer=gfcc | wf bronze.gfcc
The reference interpreter written in Haskell consists of the following files:
-- source file for BNFC
GFCC.cf -- labelled BNF grammar of gfcc
-- files generated by BNFC
AbsGFCC.hs -- abstrac syntax of gfcc
ErrM.hs -- error monad used internally
LexGFCC.hs -- lexer of gfcc files
ParGFCC.hs -- parser of gfcc files and syntax trees
PrintGFCC.hs -- printer of gfcc files and syntax trees
-- hand-written files
DataGFCC.hs -- post-parser grammar creation, linearization and evaluation
GenGFCC.hs -- random and exhaustive generation, generate-and-test parsing
RunGFCC.hs -- main function - a simple command interpreter
It is included in the
developers' version
of GF, in the subdirectory GF/src/GF/Canon/GFCC.
To compile the interpreter, type
make gfcc
in GF/src. To run it, type
./gfcc <GFCC-file>
The available commands are
gr <Cat> <Int>: generate a number of random trees in category.
and show their linearizations in all languages
grt <Cat> <Int>: generate a number of random trees in category.
and show the trees and their linearizations in all languages
gt <Cat> <Int>: generate a number of trees in category from smallest,
and show their linearizations in all languages
gtt <Cat> <Int>: generate a number of trees in category from smallest,
and show the trees and their linearizations in all languages
p <Int> <Cat> <String>: "parse", i.e. generate trees until match or
until the given number have been generated
<Tree>: linearize tree in all languages, also showing full records
quit: terminate the system cleanly
A base-line interpreter in C++ has been started. Its main functionality is random generation of trees and linearization of them.
Here are some results from running the different interpreters, compared
to running the same grammar in GF, saved in .gfcm format.
The grammar contains the English, German, and Norwegian
versions of Bronzeage. The experiment was carried out on
Ubuntu Linux laptop with 1.5 GHz Intel centrino processor.
| GF | gfcc(hs) | gfcc++ | |
|---|---|---|---|
| program size | 7249k | 803k | 113k |
| grammar size | 336k | 119k | 119k |
| read grammar | 1150ms | 510ms | 100ms |
| generate 222 | 9500ms | 450ms | 800ms |
| memory | 21M | 10M | 20M |
To summarize:
Interpreter in Java.
Parsing via MCFG
Hand-written parsers for GFCC grammars to reduce code size (and efficiency?) of interpreters.
Binary format and/or file compression of GFCC output.
Syntax editor based on GFCC.
Rewriting of resource libraries in order to exploit the word-suffix sharing better (depth-one tables, as in FM).