The GF Resource Grammar Library Version 1.0
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: %%date(%c)

% NOTE: this is a txt2tags file.
% Create an html file from this file using:
% txt2tags --toc clt2006.txt

%!target:html

%!postproc(html): #NEW <!-- NEW -->


#NEW 

==Plan==

Purpose

Background

Coverage

Structure

How to use

How to implement a new language

How to extend the API


#NEW

==Purpose==

===Library for applications===

High-level access to grammatical rules

E.g. //You have k new messages// rendered in ten languages //X//
```
  render X (Have (You (Number (k (New Message)))))
```

Usability for different purposes
- translation systems
- software localization
- dialogue systems
- language teaching


#NEW

===Grammar as parser===

Often in NLP, a grammar is just high-level code for a parser.

But writing a grammar can be inadequate for parsing:
- too much manual work
- too inefficient
- not robust
- too ambiguous


Moreover, a grammar fine-tuned for parsing may not be reusable
- for generation
- for specialized grammars
- as library


#NEW

===Grammar as language definition===

Linguistic ontology: **abstract syntax**

E.g. adjectival modification
```
  AdjCN : AP -> CN -> CN ;
```

Rendering in different languages: **concrete syntax**

Resource grammars have generation perspective, rather than parsing
- abstract syntax serves as a key to expressions in different languages


#NEW

===Usability by non-linguists===

Division of labour: resource grammars hide linguistic details
- ``AdjCN : AP -> CN -> CN`` hides agreement, word order,...


Presentation: "school grammar" concepts, dictionary-like conventions
```
  bird_N = reg2N "Vogel" "Vögel" masculine
```
API = Application Programmer's Interface

Documentation: ``gfdoc`` 
- produces html from gf


IDE = Interactive Development Environment (forthcoming)
- library browser and syntax editor for grammar writing


Example-based grammar writing
```
  render Ita (parse Eng "you have k messages")
```


#NEW

===Scientific interest===

Linguistics
- definition of linguistic ontology
- describing language on this level of abstraction
- coping with different problems in different languages
- sharing concrete-syntax code between languages
- creating a resource for other NLP applications


Computer science
- datastructures for grammar rules
- type systems for grammars
- algorithms: parsing, generation, grammar compilation
- domain-specific programming language (GF)
- module system


#NEW

==Background==

===History===

2002: v. 0.2
- English, French, German, Swedish


2003: v. 0.6
- module system
- added Finnish, Italian, Russian
- used in KeY


2005: v. 0.9 
- tenses
- added Danish, Norwegian, Spanish; no German
- used in WebALT


2006: v. 1.0
- approximate CLE coverage
- reorganized module system and implementation
- not yet (4/3/2006) for Danish and Russian


#NEW

===Authors===

Janna Khegai (Russian modules, forthcoming),
Bjorn Bringert (many Swadesh lexica),
Inger Andersson and Therese Söderberg (Spanish morphology),
Ludmilla Bogavac (Russian morphology),
Carlos Gonzalia (Spanish cardinals), 
Partik Jansson (Swedish cardinals),
Aarne Ranta.

We are grateful for contributions and 
comments to several other people who have used this and 
the previous versions of the resource library, including
Ana Bove,
David Burke,
Lauri Carlson,
Gloria Casanellas,
Karin Cavallin,
Hans-Joachim Daniels,
Kristofer Johannisson,
Anni Laine,
Wanjiku Ng'ang'a,
Jordi Saludes.


#NEW

===Related work===

CLE (Core Language Engine, 
[Book 1992 http://mitpress.mit.edu/catalog/item/default.asp?tid=7739&ttype=2])
- English, Swedish, French, Danish
- uses Definita Clause Grammars, implementation in Prolog
- coverage for the ATIS corpus, 
  [Spoken Language Translator (2001) http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521770777]
- grammar specialization via explanation-based learning


#NEW

===Slightly less related work===

[LinGO Grammar Matrix http://www.delph-in.net/matrix/]
- English, German, Japanese, Spanish, ...
- uses HPSG, implementation in LKB
- a check list for parallel grammar implementations


[Pargram http://www2.parc.com/istl/groups/nltt/pargram/]
- Aimed: Arabic, Chinese, English, French, German, Hungarian, Japanese, 
Malagasy, Norwegian, Turkish, Urdu, Vietnamese, and Welsh
- uses LFG
- one set of big grammars, transfer rules


Rosetta Machine Translation ([Book 1994 http://citeseer.ist.psu.edu/181924.html])
- Dutch, English, French
- uses M-grammars, compositional translation inspired by Montague
- compositional transfer rules


#NEW

==Coverage==

===Languages===

The current GF Resource Project covers ten languages:
- ``Dan``ish
- ``Eng``lish
- ``Fin``nish
- ``Fre``nch
- ``Ger``man
- ``Ita``lian
- ``Nor``wegian (bokmål)
- ``Rus``sian
- ``Spa``nish
- ``Swe``dish


In addition, parts (morphology) of Arabic, Estonian, Latin, and Urdu

API 1.0 not yet implemented for Danish and Russian


#NEW

===Morphology and lexicon===

Complete inflection engine
- all word classes
- all forms
- all inflectional paradigms


Basic lexicon
- 100 structural words
- 340 content words, mainly for testing
- these include the 207 [Swadesh words http://en.wiktionary.org/wiki/Swadesh_List]


It is more important to enable lexicon extensions than to 
provide a huge lexicon.
- technical lexica can have very special words, which tend to be regular


#NEW

===Syntactic structures===

Texts: 
sequences of phrases with punctuation

Phrases: 
declaratives, questions, imperatives, vocatives

Tense, mood, and polarity: 
present, past, future, conditional ; simultaneous, anterior ; positive, negative

Questions: 
yes-no, "wh" ; direct, indirect

Clauses: 
main, relative, embedded (subject, object, adverbial)

Verb phrases: 
intransitive, transitive, ditransitive, prepositional

Noun phrases: 
proper names, pronouns, determiners, possessives, cardinals and ordinals


#NEW

===Quantitative measures===

67 categories

150 abstract syntax combination rules

100 structural words

340 content words in a test lexicon

35 kLines of source code (4/3/2006):
```
  abstract     1131
  english      2344
  german       2386
  finnish      3396
  norwegian    1257
  swedish      1465
  scandinavian 1023
  french       3246 -- Besch + Irreg + Morpho 2111
  italian      7797 -- Besch 6512
  spanish      7120 -- Besch 5877
  romance      1066
```


#NEW

==Structure of the API==

===Language-independent ground API===

[Lang.png]


#NEW

===The structure of a text sentence===

```
John walks.

TFullStop              : Phr -> Text -> Text              | TQuestMark, TExclMark
  (PhrUtt              : PConj -> Utt -> Voc -> Phr       | PhrYes, PhrNo, ...
    NoPConj                                               | but_PConj, ...
    (UttS              : S -> Utt                         | UttQS, UttImp, UttNP, ...
      (UseCl           : Tense -> Anter -> Pol -> Cl -> S
        TPres              
        ASimul 
        PPos 
        (PredVP        : NP -> VP -> Cl                   | ImpersNP, ExistNP, ...
          (UsePN       : PN -> NP 
            john_PN) 
          (UseV        : V  -> VP                         | ComplV2, UseComp, ...
            walk_V)))) 
    NoVoc)                                                | VocNP, please_Voc, ...
  TEmpty
```

#NEW

===Structure in syntax editor===

[editor.png]


#NEW

===Language-dependent paradigm modules===

====Regular paradigms====

Every language implements these regular patterns that take
"dictionary forms" as arguments.
```
  regN : Str -> N
  regA : Str -> A 
  regV : Str -> V
```
Their usefulness varies. For instance, they
all are quite good in Finnish and English.
In Swedish, less so:
```
  regN "val" ---> val, valen, valar, valarna
```
Initializing a lexicon with ``regX``s is
usually a good starting point in grammar development.


#NEW

====Regular paradigms====

In Swedish, giving the gender of ``N`` improves a lot
```
  regGenN "val" neutrum ---> val, valet, val, valen
```

There are also special constructs taking other forms:
```
  mk2N : (nyckel,nycklar : Str) -> N

  mk1N : (bilarna : Str) -> N

  irregV : (dricka, drack, druckit : Str) -> V
```

Regular verbs are actually implemented the 
[Lexin http://lexin.nada.kth.se/sve-sve.shtml] way
```
  regV : (talar : Str) -> N
```


#NEW

====Worst-case paradigms====

To cover all situations, worst-case paradigms are given. E.g. Swedish
```
  mkN : (apa,apan,apor,aporna : Str) -> N

  mkA : (liten, litet, lilla, små, mindre, minst, minsta : Str) -> A

  mkV : (supa,super,sup,söp,supit,supen : Str) -> V
```


#NEW

====Irregular words====

Iregular words in ``IrregX``, e.g. Swedish:
```
    draga_V : V = 
      mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"}) 
          (variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
```
Goal: eliminate the user's need of worst-case functions.


#NEW

===Language-dependent syntax extensions===

Syntactic structures that are not shared by all languages.

Not implemented yet.

Candidates:
- ``Nor`` post-possessives: ``bilen min``
- ``Fre`` question forms: ``est-ce que tu dors ?``


#NEW

===Special-purpose APIs===

Mathematical

Multimodal

Present

Minimal

Shallow


#NEW

===How to use the resource as top-level grammar===

===Compiling===

It is a good idea to compile the library, so that it can be opened faster
```
  GF/lib/resource-1.0% make

  writes GF/lib/alltenses
         GF/lib/present
         GF/lib/resource-1.0/langs.gfcm
```
If you don't intend to change the library, you never need to process the source
files again. Just do some of
```
  gf -nocf langs.gfcm                                    -- all 8 languages
 
  gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only

  gf -nocf -path=alltenses:prelude present/LangSwe.gfc   -- Swedish only, present tense only
```


#NEW

===Parsing===

The default parser does not work!

The MCFG parser works in some languages, after waiting appr. 20 seconds
```
  p -mcfg -lang=LangEng -cat=S "I would see her"

  p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
```
Parsing in ``present/`` versions is quicker.


#NEW

===Treebank generation===

Multilingual treebank entry = tree + linearizations

Some examples on treebank generation, assuming ``langs.gfcm``
```
  gr -cat=S   -number=10 -cf | tb                  -- 10 random S

  gt -cat=Phr -depth=4       | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
```
Regression testing
```
  rf ex.xml | tb -c      -- read treebank from file and compare to present grammars 
```
Updating a treebank
```
  rf old.xml | tb -trees | tb -xml | wf new.xml    -- read old from file, write new to file
```


#NEW

===Treebank-based parsing===

Brute-force method that helps if real parsing is more expensive.
```
  make treebank                     -- make treebank with all languages

  gf -treebank langs.xml            -- start GF by reading the treebank

  > ut -strings -treebank=LangIta   -- show all Ita strings

  > ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string

  > i -nocf langs.gfcm              -- read grammar to be able to linearize

  > ut -treebank=LangIta "Quello non si romperebbe" | l -multi  -- translate to all
```


#NEW

===Morphology===

Use morphological analyser
```
  gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
  > ma "jag kan inte höra vad du säger"
```

Try out a morphology quiz
```
  > mq -cat=V
```

Try out inflection patterns
```
  gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
  > cc regV "lyser"
```


#NEW

===Syntax editing===

We start a demo by
``` gfeditor langs.gfcm

[editor.png]


#NEW

===Efficient parsing via application grammar===

Get rid of discontinuous constituents 

Examples: ``mathematical/Predication``, ``examples/bronzeage``


#NEW

==How to use as library==

===Specialization through parametrized modules===

The application grammar is implemented with reference to
the resource API

Individual languages are instantiations

Example: [tram ../../../examples/tram/TramI.gf]


#NEW

===Compile-time transfer===

Instead of parametrized modules:

select resource functions differently for different languages

Example: imperative vs. infinitive in mathematical exercises


#NEW

===A natural division into modules===

Lexicon in language-dependent moduls

Combination rules in a parametrized module

#NEW

===Example-based grammar writing===

Example: [animal ../../../examples/animal/QuestionsI.gfe]

#NEW

==How to implement a new language==

See [Resource-HOWTO Resource-HOWTO.html]

==Ordinary modules==

Write a concrete syntax module for each abstract module in the API

Write a ``Paradigms`` module

Examples: English, Finnish, German, Russian

#NEW

==Parametrized modules==

Examples: Romance (French, Italian, Spanish), Scandinavian (Danish, Norwegian, Swedish)

Write a ``Diff`` interface for a family of languages

Write concrete syntaxes as functors opening the interface

Write separate ``Paradigms`` modules for each language

Advantages:
- easier maintenance of library
- insights into language families


Problems:
- more abstract thinking required
- individual grammars may not come out optimal in elegance and efficiency


#NEW

===The core of the API===

Everything else is variations of this
```
cat
  Cl ;   -- clause
  VP ;   -- verb phrase
  V2 ;   -- two-place verb
  NP ;   -- noun phrase
  CN ;   -- common noun
  Det ;  -- determiner
  AP ;   -- adjectival phrase

fun
  PredVP  : NP  -> VP -> Cl ;   -- predication
  ComplV2 : V2  -> NP -> VP ;   -- complementization
  DetCN   : Det -> CN -> NP ;   -- determination
  ModCN   : AP  -> CN -> CN ;   -- modification
```

#NEW

===The core of the API===

This [toy Latin grammar  latin.gf] shows in a nutshell how the core
can be implemented.

Use this API as a first approximation when designing the parameter system of a new
language. 


#NEW

===How to proceed===

+ put up a directory with dummy modules by copying from e.g. English and
commenting out the contents

+ so you will have a compilable ``LangX`` all the time

+ start with nouns and their inflection

+ proceed to verbs and their inflection

+ add some noun phrases

+ implement predication


#NEW

==How to extend the API==

Extend old modules or add a new one?

Usually better to start a new one: then you don't have to implement it
for all languages at once.

Exception: if you are working with a language-specific API extension,
you can work directly in that module.