The GF Resource Grammar Library Version 1.0

Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: Wed Mar 8 08:31:08 2006

Plan

Purpose

Background

Coverage

Structure

How to use

How to implement a new language

How to extend the API

Purpose

Library for applications

High-level access to grammatical rules

E.g. You have k new messages rendered in ten languages X

    render X (Have (You (Number (k (New Message)))))

Usability for different purposes

translation systems
software localization
dialogue systems
language teaching

Grammar as parser

Often in NLP, a grammar is just high-level code for a parser.

But writing a grammar can be inadequate for parsing:

too much manual work
too inefficient
not robust
too ambiguous

Moreover, a grammar fine-tuned for parsing may not be reusable

for generation
for specialized grammars
as library

Grammar as language definition

Linguistic ontology: abstract syntax

E.g. adjectival modification

    AdjCN : AP -> CN -> CN ;

Rendering in different languages: concrete syntax

Resource grammars have generation perspective, rather than parsing

abstract syntax serves as a key to expressions in different languages

Usability by non-linguists

Division of labour: resource grammars hide linguistic details

AdjCN : AP -> CN -> CN hides agreement, word order,...

Presentation: "school grammar" concepts, dictionary-like conventions

    bird_N = reg2N "Vogel" "Vögel" masculine

API = Application Programmer's Interface

Documentation: gfdoc

produces html from gf

IDE = Interactive Development Environment (forthcoming)

library browser and syntax editor for grammar writing

Example-based grammar writing

    render Ita (parse Eng "you have k messages")

Scientific interest

Linguistics

definition of linguistic ontology
describing language on this level of abstraction
coping with different problems in different languages
sharing concrete-syntax code between languages
creating a resource for other NLP applications

Computer science

datastructures for grammar rules
type systems for grammars
algorithms: parsing, generation, grammar compilation
domain-specific programming language (GF)
module system

Background

History

2002: v. 0.2

English, French, German, Swedish

2003: v. 0.6

module system
added Finnish, Italian, Russian
used in KeY

2005: v. 0.9

tenses
added Danish, Norwegian, Spanish; no German
used in WebALT

2006: v. 1.0

approximate CLE coverage
reorganized module system and implementation
not yet (4/3/2006) for Danish and Russian

Authors

Janna Khegai (Russian modules, forthcoming), Bjorn Bringert (many Swadesh lexica), Inger Andersson and Therese Söderberg (Spanish morphology), Ludmilla Bogavac (Russian morphology), Carlos Gonzalia (Spanish cardinals), Partik Jansson (Swedish cardinals), Aarne Ranta.

We are grateful for contributions and comments to several other people who have used this and the previous versions of the resource library, including Ana Bove, David Burke, Lauri Carlson, Gloria Casanellas, Karin Cavallin, Hans-Joachim Daniels, Kristofer Johannisson, Anni Laine, Wanjiku Ng'ang'a, Jordi Saludes.

Related work

CLE (Core Language Engine, Book 1992)

English, Swedish, French, Danish
uses Definita Clause Grammars, implementation in Prolog
coverage for SACTI corpus, Spoken Language Translator (2001)
grammar specialization via explanation-based learning

Slightly less related work

LinGO Grammar Matrix

English, German, Japanese, Spanish, ...
uses HPSG, implementation in LKB
a check list for parallel grammar implementations

Pargram

Aimed: Arabic, Chinese, English, French, German, Hungarian, Japanese, Malagasy, Norwegian, Turkish, Urdu, Vietnamese, and Welsh
uses LFG
one set of big grammars, transfer rules

Rosetta Machine Translation (Book 1994)

Dutch, English, French
uses M-grammars, compositional translation inspired by Montague
compositional transfer rules

Coverage

Languages

The current GF Resource Project covers ten languages:

Danish
English
Finnish
French
German
Italian
Norwegian (bokmål)
Russian
Spanish
Swedish

In addition, parts (morphology) of Arabic, Estonian, Latin, and Urdu

API 1.0 not yet implemented for Danish and Russian

Morphology and lexicon

Complete inflection engine

all word classes
all forms
all inflectional paradigms

Basic lexicon

100 structural words
340 content words, mainly for testing
these include the 207 Swadesh words

It is more important to enable lexicon extensions than to provide a huge lexicon.

technical lexica can have very special words, which tend to be regular

Syntactic structures

Texts: sequences of phrases with punctuation

Phrases: declaratives, questions, imperatives, vocatives

Tense, mood, and polarity: present, past, future, conditional ; simultaneous, anterior ; positive, negative

Questions: yes-no, "wh" ; direct, indirect

Clauses: main, relative, embedded (subject, object, adverbial)

Verb phrases: intransitive, transitive, ditransitive, prepositional

Noun phrases: proper names, pronouns, determiners, possessives, cardinals and ordinals

Quantitative measures

67 categories

150 abstract syntax combination rules

100 structural words

340 content words in a test lexicon

35 kLines of source code (4/3/2006):

    abstract     1131
    english      2344
    german       2386
    finnish      3396
    norwegian    1257
    swedish      1465
    scandinavian 1023
    french       3246 -- Besch + Irreg + Morpho 2111
    italian      7797 -- Besch 6512
    spanish      7120 -- Besch 5877
    romance      1066

Structure of the API

Language-independent ground API

The structure of a text sentence

  John walks.
  
  TFullStop              : Phr -> Text -> Text              | TQuestMark, TExclMark
    (PhrUtt              : PConj -> Utt -> Voc -> Phr       | PhrYes, PhrNo, ...
      NoPConj                                               | but_PConj, ...
      (UttS              : S -> Utt                         | UttQS, UttImp, UttNP, ...
        (UseCl           : Tense -> Anter -> Pol -> Cl -> S
          TPres              
          ASimul 
          PPos 
          (PredVP        : NP -> VP -> Cl                   | ImpersNP, ExistNP, ...
            (UsePN       : PN -> NP 
              john_PN) 
            (UseV        : V  -> VP                         | ComplV2, UseComp, ...
              walk_V)))) 
      NoVoc)                                                | VocNP, please_Voc, ...
    TEmpty

Structure in syntax editor

Language-dependent paradigm modules

Regular paradigms

Every language implements these regular patterns that take "dictionary forms" as arguments.

    regN : Str -> N
    regA : Str -> A 
    regV : Str -> V

Their usefulness varies. For instance, they all are quite good in Finnish and English. In Swedish, less so:

    regN "val" ---> val, valen, valar, valarna

Initializing a lexicon with regXs is usually a good starting point in grammar development.

Regular paradigms

In Swedish, giving the gender of N improves a lot

    regGenN "val" neutrum ---> val, valet, val, valen

There are also special constructs taking other forms:

    mk2N : (nyckel,nycklar : Str) -> N
  
    mk1N : (bilarna : Str) -> N
  
    irregV : (dricka, drack, druckit : Str) -> V

Regular verbs are actually implemented the Lexin way

    regV : (talar : Str) -> N

Worst-case paradigms

To cover all situations, worst-case paradigms are given. E.g. Swedish

    mkN : (apa,apan,apor,aporna : Str) -> N
  
    mkA : (liten, litet, lilla, små, mindre, minst, minsta : Str) -> A
  
    mkV : (supa,super,sup,söp,supit,supen : Str) -> V

Irregular words

Iregular words in IrregX, e.g. Swedish:

      draga_V : V = 
        mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"}) 
            (variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;

Goal: eliminate the user's need of worst-case functions.

Language-dependent syntax extensions

Syntactic structures that are not shared by all languages.

Not implemented yet.

Candidates:

Nor post-possessives: bilen min
Fre question forms: est-ce que tu dors ?

Special-purpose APIs

Mathematical

Multimodal

Present

Minimal

Shallow

How to use the resource as top-level grammar

Compiling

It is a good idea to compile the library, so that it can be opened faster

    GF/lib/resource-1.0% make
  
    writes GF/lib/alltenses
           GF/lib/present
           GF/lib/resource-1.0/langs.gfcm

If you don't intend to change the library, you never need to process the source files again. Just do some of

    gf -nocf langs.gfcm                                    -- all 8 languages
   
    gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
  
    gf -nocf -path=alltenses:prelude present/LangSwe.gfc   -- Swedish only, present tense only

Parsing

The default parser does not work!

The MCFG parser works in some languages, after waiting appr. 20 seconds

    p -mcfg -lang=LangEng -cat=S "I would see her"
  
    p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"

Parsing in present/ versions is quicker.

Treebank generation

Multilingual treebank entry = tree + linearizations

Some examples on treebank generation, assuming langs.gfcm

    gr -cat=S   -number=10 -cf | tb                  -- 10 random S
  
    gt -cat=Phr -depth=4       | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml

Regression testing

    rf ex.xml | tb -c      -- read treebank from file and compare to present grammars

Updating a treebank

    rf old.xml | tb -trees | tb -xml | wf new.xml    -- read old from file, write new to file

Treebank-based parsing

Brute-force method that helps if real parsing is more expensive.

    make treebank                     -- make treebank with all languages
  
    gf -treebank langs.xml            -- start GF by reading the treebank
  
    > ut -strings -treebank=LangIta   -- show all Ita strings
  
    > ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
  
    > i -nocf langs.gfcm              -- read grammar to be able to linearize
  
    > ut -treebank=LangIta "Quello non si romperebbe" | l -multi  -- translate to all

Morphology

Use morphological analyser

    gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
    > ma "jag kan inte höra vad du säger"

Try out a morphology quiz

    > mq -cat=V

Try out inflection patterns

    gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
    > cc regV "lyser"

Syntax editing

We start a demo by

  gfeditor langs.gfcm

Efficient parsing via application grammar

Get rid of discontinuous constituents

Examples: mathematical/Predication, examples/bronzeage

How to use as library

Specialization through parametrized modules

The application grammar is implemented with reference to the resource API

Individual languages are instantiations

Example: tram

Compile-time transfer

Instead of parametrized modules:

select resource functions differently for different languages

Example: imperative vs. infinitive in mathematical exercises

A natural division into modules

Lexicon in language-dependent moduls

Combination rules in a parametrized module

Example-based grammar writing

Example: animal

The GF Resource Grammar Library Version 1.0

Plan

Purpose

Library for applications

Grammar as parser

Grammar as language definition

Usability by non-linguists

Scientific interest

Background

History

Authors

Related work

Slightly less related work

Coverage

Languages

Morphology and lexicon

Syntactic structures

Quantitative measures

Structure of the API

Language-independent ground API

The structure of a text sentence

Structure in syntax editor

Language-dependent paradigm modules

Regular paradigms

Regular paradigms

Worst-case paradigms

Irregular words

Language-dependent syntax extensions

Special-purpose APIs

How to use the resource as top-level grammar

Compiling

Parsing

Treebank generation

Treebank-based parsing

Morphology

Syntax editing

Efficient parsing via application grammar

How to use as library

Specialization through parametrized modules

Compile-time transfer

A natural division into modules

Example-based grammar writing

How to implement a new language

Ordinary modules

Parametrized modules

The kernel of the API

How to proceed

How to extend the API

Extend old modules or add a new one?