Grammatical Framework Version 2

Release of Version 2.0

Planned: 24 June 2004

Aarne Ranta

Highlights

Module system.

Separate compilation to canonical GF.

Improved GUI.

Improved parser generation.

Improved shell (new commands and options, help, error messages).

Accurate language specification (also of GFC).

Extended resource library.

Extended Numerals library.

Module system

Separate modules for abstract, concrete, and resource.

Replaces the file-based include system

Name space handling with qualified names

Hierarchic structure (single inheritance **) + cross-cutting reuse (open)

Separate compilation, one module per file

Reuse of abstract+concrete as resource

Parametrized modules: interface, instance, incomplete.

New experimental module types: transfer, union.

Canonical format GFC

The target of GF compiler; to reuse, just read in.

Readable by Haskell/Java/C++/C applications (by BNFC generated parsers).

New features in expression language

In addition to the module system:

Disjunctive patterns P | ... | Q.

String patterns "foo".

(?) Integer patterns 74.

Binding token &+ to glue separate tokens at unlexing phase, and unlexer to resolve this.

New syntax alternatives for local definitions: let without braces and where.

Pattern variables can be used on lhs's of oper definitions.

New Unicode transliterations (by Harad Hammarström).

New shell commands and command functionalities

pi = print_info: information on an identifier in scope.

h = help now in long or short form, and on individual commands.

gt = generate_trees: all trees of a given category or instantiations of a given incomplete term, up to a given depth.

gr = generate_random can now be given an incomplete term as an argument, to constrain generation.

so = show_opers shows all ope operations with a given value type.

pm = print_multi prints the multilingual grammar resident in the current state to a ready-compiles .gfcm file.

All commands have both long and short names (see help). Short names are easier to type, whereas long names make scripts more readable.

Meaningless command options generate warnings.

New editor features

Active text field: click the middle button in the focus to send in refinement through the parser.

Clipboard: copy complex terms into the refine menu.

Two-step refinements generated by the "Generate" operation.

Improved implementation

Haskell source code is organized into subdirectories.

BNF Converter is used for defining the languages GF and GFC, which also give reliable LaTeX documentation.

Lexical rules sorted out by option -cflexer for efficient parsing with large lexica.

GHC optimizations and strictness flags are used for improving performance.

New parser (work in progress)

By Peter Ljunglöf, based on MCFG.

Much more efficient for morphology and discontinuous constituents.

Treatment of cyclic rules.

Currently lots of alternative parsers via flags -parser=newX.

Status (21/6/2004)

Grammar compiler, editor GUIs, and shell work for all platforms (with restrictions for Solaris).

The updated HelpFile (accessible through h command) marks unsupported features present in GF 1.2 with *. They will be supported again if interested users appear.

GF1 grammars can be automatically translated to GF2 (although the result is not as good as manual, since indentation and comments are destroyed). The results can be saved in GF2 files, but this is not necessary. Some rarely used GF1 features are no longer supported (see next section).

It is also possible to write a GF2 grammar back to GF1, with the command pg -printer=old. Resource libraries and some example grammars and have been converted. Most old example grammars work without any changes. There is a new resource API with many new constructions.

A make facility works, finding out which modules have to be recompiled.

Soundness checking of module depencencies and completeness is not complete. This means that some errors may show up too late.

The environment variable GF_LIB_PATH needs some more work.

Latex and XML printing of grammars do not work yet.

How to use GF 1.* files

Backward compatibility with respect to old GF grammars has been a central goal. All GF grammars, from version 0.9, should work in the old way in GF2. The main exceptions are some features that are rarely used.

The package system introduced in GF 1.2, cannot be interpreted in the module system of GF 2.0, since packages are in mutual scope with the top level.
tokenizer pragmas are cannot be parsed any more. In GF 1.2, they are already replaced by lexer flags.
var pragmas cannot be parsed any more.

Very old GF grammars (from versions before 0.9), with the completely different notation, do not work. They should be first converted to GF1 by using GF version 1.2. The import command i can be given the option -old. E.g.

  i -old tut1.Eng.g2

But this is no more necessary: GF2 detects automatically if a grammar is in the GF1 format.

Importing a set of GF2 files generates, internally, three modules:

  abstract tut1 = ...
  resource ResEng = ...
  concrete Eng of tut1 = open ResEng in ...

(The names are different if the file name has fewer parts.)

The option -o causes GF2 to write these modules into files. The flags -abs, -cnc, and -res can be used to give custom names to the modules. In particular, it is good to use the -abs flag to guarantee that the abstract syntax module has the same name for all grammars in a multilingual environmens:

  i -old -abs=Numerals hungarian.gf
  i -old -abs=Numerals tamil.gf
  i -old -abs=Numerals sanskrit.gf

The same flags as in the import command can be used when invoking GF2 from the system shell. Many grammars can be imported on the same command line, e.g.

  % gf2 -old -abs=Tutorial tut1.Eng.gf tut1.Fin.gf tut1.Fra.gf

To write a GF2 grammar back to GF1 (as one big file), use the command

  > pg -old

GF2 has more reserved words than GF 1.2. When old files are read, a preprocessor replaces every identifier that has the shape of a new reserved word with a variant where the last letter is replaced by Z, e.g. instance is replaced by instancZ. This method is of course unsafe and should be replaced by something better.

Abstract, concrete, and resource modules

Judgement forms are sorted as follows:

abstract: cat, fun, def, data, flags
concrete: lincat, cat, printname, flags
resource: param, oper, flags

Example:

  abstract Sums = {
    cat 
      Exp ;
    fun 
      One : Exp ;
      plus : Exp -> Exp -> Exp ;
  }

  concrete EnglishSums of Sums = open ResEng in {
    lincat 
      Exp = {s : Str ; n : Number} ;
    lin
      One = expSg "one" ;
      sum x y = expSg ("the" ++ "sum" ++ "of" ++ x.s ++ "and" ++ y.s) ;
  }

  resource ResEng = {
    param 
      Number = Sg | Pl ;
    oper 
      expSG : Str -> {s : Str ; n : Number} = \s -> {s = s ; n = Sg} ;
  }

Opening and extending modules

A concrete or resource can open a resource. This means that

the names defined in resource can be used ("become visible")
but: these names are not included in ("exported from") the opening module

A module of any type can moreover extend a module of the same type. This means that

the names defined in the extended module can be used ("become visible")
and also: these names are included in ("exported from") the extending module

Examples of extension:

  abstract Products = Sums ** {
    fun times : Exp -> Exp -> Exp ;
  }
  -- names exported: Exp, plus, times

  concrete English of Products = EnglishSums ** open ResEng in {
    lin times x y = expSg ("the" ++ "product" ++ "of" ++ x.s ++ "and" ++ y.s) ;
  }

Another important difference:

extension is single

opening can be multiple: open Foo, Bar, Baz in {...} Moreover:

opening can be qualified

Example of qualified opening:

  concrete NumberSystems of Systems = open (Bin = Binary), (Dec = Decimal) in {
    lin 
      BZero = Bin.Zero ;
      DZero = Dec.Zero
  }

Compiling modules

Separate compilation assumes there is one module per file.

The module header is the beginning of the module code up to the first left bracket ({). The header gives

the module type: abstract, concrete (of A), or resource
the name of the module (next to the module type keyword)
the name of extended module (between = and **)
the names of opened modules

filename = modulename . extension

File name extensions:

gf: GF source file (uses GF syntax, is type checked and compiled)
gfc: canonical GF file (uses GFC syntax, is simply read in instead of compiled; produced from all kinds of modules)
gfr: GF resource file (uses GF syntax, is only read in; produced from resource modules)
gfcm: canonical multilingual GF file (uses GFC syntax, is only read in; produced from a set of abstract and conctrete modules)

Only gf files should ever be written/edited manually! What the make facility does when compiling Foo.gf

read the module header of Foo.gf, and recursively all headers from the modules it depends on (i.e. extends or opens)
build a dependency graph of these modules, and do topological sorting
starting from the first module in topological order, compare the modification times of each gf and gfc file:
- if gf is later, compile the module and all modules depending on it
- if gfc is later, just read in the module

Inside the GF shell, also time stamps of modules read into memory are taken into account. Thus a module need not be read from a file if the module is in the memory and the file has not been modified. If the compilation of a grammar fails at some module, the state of the GF shell contains all modules read up to that point. This makes it faster to compile the faulty module again after fixing it.

Use the command po = print_options to see what modules are in the state.

To force compilation:

The flag -src in the import command forces compilation from source even if more recent object files exist. This is useful when testing new versions of GF.
The flag -retain in the import command forces reading in gfr files in addition to gfc files. This is useful when testing operations with the cc command.

Module search paths

Modules can reside in different directories. Use the path flag to extend the directory search path. For instance,

  -path=.:../resource/russian:../prelude

enables files to be found in three different directories. By default, only the current directory is included. If a path flag is given, the current directory . must be explicitly included if it is wanted.

The path flag can be set in any of the following places:

when invoking GF: gf -path=xxx
when importing a module: i -path=xxx Foo.gf
as a pragma in a topmost file: --# -path=xxx

A flag set on a command line overrides ones set in files.

The value of the environment variable GF_LIB_PATH is appended to the user-given path.

To do

Testing

Documentation

Packaging

Nasty details

Readline in Solaris

Proper treatment file search paths

Unicode fonts in GUIs

directionality of Semitic alphabets