GF Resource Grammar Library

Second Version, Gothenburg, 1 March 2005
First Draft, Gothenburg, 7 February 2005

Aarne Ranta

aarne@cs.chalmers.se

GF = Grammatical Framework

A grammar formalism based on functional programming and type theory.

Designed to be nice for ordinary programmers to use.

Mission: to make natural-language applications available for ordinary programmers, in tasks like

software documentation
domain-specific translation
human-computer interaction
dialogue systems

Thus not primarily another theoretical framework for linguists.

Multilingual grammars

Abstract syntax: language-independent representation

  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;

Concrete syntax: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...)

  lin Even x = {s = x.s ++ "is" ++ "even"} ; 

  lin Even x = {s = x.s ++ "est" ++ "pair"} ;

  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;

  lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;

We can translate between language via the abstract syntax.

Is it really so simple?

Difficulties with concrete syntax

Most languages have rules of inflection, agreement, and word order, which have to be obeyed when putting together expressions.

The previous multilingual grammar breaks these rules in many situations:

2 and 3 is even
la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt

Solving the difficulties

GF has tools for expressing the linguistic rules that are needed to produce correct translations in different languages.

Instead of just strings, we need

parameters, tables, and record types. For instance, French:

  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
      }
    } ;

Language + Libraries

Writing natural language grammars still requires theoretical knowledge about the language.

Which kind of a programmer is easier to find?

one who can write a sorting algorithm
one who can write a grammar for Swedish determiners

In main-stream programming, sorting algorithms are not written by hand but taken from libraries.

In the same way, we want to create grammar libraries that encapsulate basic linguistic facts.

Cf. the Java success story: the language is just a half of the success - libraries are another half.

Example of library-based grammar writing

To define a Swedish expression of a mathematical predicate from scratch:

  Even x =
    let jämn = case <x.n,x.g> of {
      <Sg,Utr>   => "jämn" ;
      <Sg,Neutr> => "jämnt" ;
      <Pl,_>     => "jämna"
      }
    in
    {s = table {
      Main => x.s ! Nom ++ "är" ++ jämn ;
      Inv  => "är" ++ x.s ! Nom ++ jämn ;
      Sub  => x.s ! Nom ++ "är" ++ jämn
      }
    }

To use library functions for syntax and morphology:

  Even = predA (regA "jämn") ;

Questions in grammar library design

What should there be in the library?

morphology, lexicon, syntax, semantics,...

How do we organize and present the library?

division into modules, level of granularity

"school grammar" vs. sophisticated linguistic concepts

Where do we get the data from?

automatic extraction or hand-writing?

reuse of existing resources?

Extra constraint: we want open-source free software.

The scope of the resource grammar library

All morphological paradigms

Basic lexicon of structural, common, and irregular words

Basic syntactic structures

Currently,

no semantics,

no language-specific structures if not necessary for expressivity.

Success criteria

Grammatical correctness

Semantic coverage: you can express whatever you want.

Usability as library for non-linguists.

(Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF.

These are not our success criteria

Language coverage: you can parse all expressions. Example: the French passé simple tense, although covered by the morhology, is not used in the language-independent API, but only the passé composé is.

Semantic correctness

  colourless green ideas sleep furiously

  the time is seventy past forty-two

(Warning for linguists:) theoretical innovation in syntax (and it will all be hidden anyway!)

So where is semantics?

GF incorporates a Logical Framework and is therefore capable of expressing logical semantics à la Montague or any other flavour, including anaphora and discourse.

But we do not try to give semantics once and for all for the whole language.

Instead, we expect semantics to be given in application grammars built on semantic models of different domains.

Example application: number theory

  fun Even : Nat -> Prop ;         -- a mathematical predicate

  lin Even = predA (regA "even") ; -- English translation
  lin Even = predA (regA "pair") ; -- French translation
  lin Even = predA (regA "jämn") ; -- Swedish translation

How could the resource predict that just these translations are correct in this domain?

Application grammars are built by experts of these domains who - thanks to resource grammars - do no more need to be experts in linguistics.

Languages

The current GF Resource Project covers ten languages:

Danish
English
Finnish
French
German
Italian
Norwegian
Russian
Spanish
Swedish

The first three letters (Dan etc) are used in grammar module names

Library structure 1: language-independent API

syntactic Categories (parts of speech, word classes), e.g.

  V ; NP ; CN ; Det ;  -- verb, noun phrase, common noun, determiner

Rules for combining words and phrases, e.g.

  DetNP : Det -> CN -> NP ; -- combine Det and CN into NP

the most common Structural words (determiners, conjunctions, pronouns), e.g.

  and_Conj : Conj ;

Library structure 2: language-dependent modules

morphological Paradigms, e.g.

  mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns
  mkN : Str -> N ;                                -- regular nouns

irregular Verbs, e.g.

  angripa_V = irregV "angripa" "angrep" "angripit" ;

Lexicon of frequent words

  man_N = mkN "man" "mannen" "män" "männen" masculine ;

Extended syntax with language-specific rules

  PassBli : V2 -> NP -> VP ;  -- bli överkörd av ngn

How much can be language-independent?

For the ten languages we have considered, it is possible to implement the current API.

Reservations:

does not necessarily extend to all other languages
does not necessarily cover the most idiomatic expressions of each language
may not be the easiest API to implement (e.g. negation and inversion with do in English suggest that some other structure would be more natural)

does not guarantee that same structure has the same semantics in different languages

Library structure: language-independent API

Library structure: test bed for the language-independent API

API documentation

Paradigms documentation

English paradigms
example use of English oaradigms
English verbs

French paradigms
example use of French paradigms
French verbs

Italian paradigms
example use of Italian paradigms
Italian verb conjugations

Norwegian paradigms
example use of Norwegian paradigms
Norwegian verbs

Spanish paradigms
Spanish verb conjugations

Swedish paradigms
example use of Swedish paradigms
Swedish verbs

Use as top-level grammar: testing

Import a set of $LangX$ grammars:

  i english/LangEng.gf
  i swedish/LangSwe.gf

Test with random generation, translation, morphological analysis...

Use as top-level grammar: language learning quizzes

Morpho quiz with words:

Morpho quiz with phrases:

Translation quiz with sentences:

Use as library

Import directly by open:

  concrete AppNor of App = open LangNor, ParadigmsNor in {...}

No more dummy reuse modules and bulky .gfr files!

If you need to convert resource category records to/from strings, use

  Predef.toStr : (L : Type) -> L -> Str ;

L must be a linearization type. For instance,

  toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N))
  ---> "gammel bil"

Use as library through parser

Use the parser when developing a resource.

  > p -cat=S -v "jag ska åka till Chalmers"
  unknown tokens [TS "åka",TS "Chalmers"]

  > p -cat=S "jag ska gå till Danmark"
  UseCl (PosTP TFuture ASimul)
    (AdvCl (SPredV i_NP go_V)
    (AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark)))))

Extend vocabulary at need.

  åka_V = lexV "åker" ; 
  Chalmers = regPN "Chalmers" neutrum ;

Example application: a small translation system

You can say things like the following:

  who chases mice ?
  whom does the lion chase ?
  the dog chases cats

Source modules:

Compiling the example application

The resources are bulky, and it takes a therefore a lot of time and memory to load the grammars. However, they can be compiled into the gfcm (GF canonical multilingual) format, which is almost one thousand times smaller and faster to load for this set of grammars.

Just issue the following GF commands

  i -src AnimalsEng.gf ;; s
  i -src AnimalsFre.gf ;; s
  i -src AnimalsSwe.gf ;; s
  pm | wf animals.gfcm

and you get an end-user grammar animals.gfcm.

You can also write the commands in a gfs (GF script) file, say mkAnimals.gfs, and then call GF with

  gf <mkAnimals.gfs

Further simplifications of the application grammar

Step 1: use a simplified access to present-tense sentences, SentenceX (to be written...)

Step 2: factor out the categories and purely combinational rules into an incomplete module (to be shown... but this does not work for French, which uses different structures: e.g. Qui aime les lions ? with a definite phrase where English has Who loves lions?

Implementation details: the structure of low-level files

The use of parametrized modules

In two language families:

Romance: French, Italian, Spanish
Scandinavian: Danish, Norwegian, Swedish

Current status

Language	v0.6	API	Paradigms	Basic lex	Verbs
Danish		X
English	X	X	X	X	X
Finnish	X
French	X	X	X	X	X
German	X		*
Italian	X	X	X	X	X
Norwegian		X	X	X	X
Russian	X	*	*
Spanish		X	X		X
Swedish	X	X	X	X	X

Obtaining it

Now on CVS at Chalmers:

  cvs -d /users/cs/aarne/cvs checkout GF2.0/lib

To appear later at GF Homepage:

http://www.cs.chalmers.se/~aarne/GF