gf-core/lib/resource/doc/old/gf-resource.txt

GF Resource Grammar Library
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: %%date(%c)

% NOTE: this is a txt2tags file.
% Create an html file from this file using:
% txt2tags --toc gf-resource.txt

%!target:html

%!postproc(html): #NEW <!-- NEW -->


#NEW
==GF = Grammatical Framework==

GF is a grammar formalism based on functional programming and type theory.


GF was designed to be nice for //ordinary programmers// to use: by this
we mean programmers without training in linguistics.


The mission of GF is to make natural-language applications available for
ordinary programmers, in tasks like

- software documentation
- domain-specific translation
- human-computer interaction
- dialogue systems

Thus GF is //not// primarily another theoretical framework for
linguists.


#NEW
==Multilingual grammars==

A GF grammar consists of an abstract syntax and a set
of concrete syntaxes.


**Abstract syntax**: language-independent representation
```
  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;
  fun NInt : Int -> Nat ;
```
**Concrete syntax**: mapping from abstract syntax trees to strings in a language
(English, French, German, Swedish,...)
```
  lin Even x = {s = x.s ++ "is" ++ "even"} ;

  lin Even x = {s = x.s ++ "est" ++ "pair"} ;

  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;

  lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
```
We can **translate** between languages via the abstract syntax:
```
  4 is even                  4 ist gerade
             \              /
               Even (NInt 4)
             /              \
  4 est pair                  4 är jämnt
```


But is it really so simple?


#NEW
==Difficulties with concrete syntax==

Most languages have rules of **inflection**, **agreement**,
and **word order**, which have to be obeyed when putting together
expressions.


The previous multilingual grammar breaks these rules in many situations:
//
2 and 3 is even
la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt
//
All these sentences are grammatically incorrect.


#NEW
==Solving the difficulties==

GF has tools for expressing the linguistic rules that are needed to
produce correct translations in different languages.


Instead of just strings, we need parameters**, **tables**,
and **record types**. For instance, French:
```
  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
      }
    } ;
```
To learn more about these constructs, consult GF documentation, e.g. the
[../../../doc/tutorial/gf-tutorial2.html New Grammarian's Tutorial].
However, in what follows we will show how to avoid learning them and
still write linguistically correct grammars.


#NEW
==Language + Libraries==

Writing natural language grammars still requires
theoretical knowledge about the language.


Which kind of a programmer is it easier to find?

- one who can write a sorting algorithm
- one who can write a grammar for Swedish determiners


In main-stream programming, sorting algorithms are not
written by hand but taken from **libraries**.


In the same way, we want to create grammar libraries that encapsulate
basic linguistic facts.


Cf. the Java success story: the language is just a half of the
success - libraries are another half.


#NEW
==Example of library-based grammar writing==

To define a Swedish expression of a mathematical predicate from scratch:
```
  Even x =
    let jämn = case &lt;x.n,x.g> of {
      &lt;Sg,Utr>   => "jämn" ;
      &lt;Sg,Neutr> => "jämnt" ;
      &lt;Pl,_>     => "jämna"
      }
    in
    {s = table {
      Main => x.s ! Nom ++ "är" ++ jämn ;
      Inv  => "är" ++ x.s ! Nom ++ jämn ;
      Sub  => x.s ! Nom ++ "är" ++ jämn
      }
    }
```
To use library functions for syntax and morphology:
```
  Even = predA (regA "jämn") ;
```
For the French version, we write
```
  Even = predA (regA "pair") ;
```


#NEW
==Questions in grammar library design==

What should there be in the library?

- morphology, lexicon, syntax, semantics,...


How do we organize and present the library?

- division into modules, level of granularity

- "school grammar" vs. sophisticated linguistic concepts


Where do we get the data from?

- automatic extraction or hand-writing?

- reuse of existing resources?

Extra constraint: we want open-source free software and
hence cannot use existing proprietary resources.


#NEW
==Answers to questions in grammar library design==

The current GF resource grammar library has
made the following decisions:

The library has, for each language

- complete morphology, some lexicon (500 words), representative fragment of syntax,
very little semantics,


Organization and presentation:

- division into top-level (API) modules, and internal modules (only
interesting for resource implementors)

- the API is, as much as possible, common in different languages

- we favour "school grammar" concepts rather than innovative linguistic theory


Where do we get the data from?

- morphology and syntax are hand-written

- the 500-word lexicon is hand-written, but a tool is provided
     for automatic lexicon extraction

- we have not reused existing resources

The resource grammar library is entirely
open-source free software (under GNU GPL license).


#NEW
==The scope of a resource grammar library for a language==

All morphological paradigms


Basic lexicon of structural, common, and irregular words


Basic syntactic structures


Currently,
- //no// semantics,
- //no// language-specific structures if not necessary for expressivity.


#NEW
==Success criteria==

Grammatical correctness


Semantic coverage: you can express whatever you want.


Usability as library for non-linguists.


(Bonus for linguists:) nice generalizations w.r.t. language
families, using the module system of GF.


#NEW
==These are not our success criteria==

Language coverage: to be able to parse all expressions.

Example:
the French //passé simple// tense, although covered by the
morphology, is not used in the language-independent API, but
only the //passé composé// is. However, an application
accessing the French-specific (or Romance-specific)
modules can use the passé simple.


Semantic correctness: only to produce meaningful expressions.

Example: the following sentences can be generated
```
  colourless green ideas sleep furiously

  the time is seventy past forty-two
```
However, an applicatio grammar can use a domain-specific
semantics to guarantee semantic well-formedness.


(Warning for linguists:) theoretical innovation in
syntax is not among the goals
(and it would be hidden from users anyway!).


#NEW
==So where is semantics?==

GF incorporates a **Logical Framework** and is therefore
capable of expressing logical semantics //à la// Montague
or any other flavour, including anaphora and discourse.


But we do //not// try to give semantics once and
for all for the whole language.


Instead, we expect semantics to be given in
**application grammars** built on semantic models
of different domains.


Example application: number theory
```
  fun Even : Nat -> Prop ;         -- a mathematical predicate

  lin Even = predA (regA "even") ; -- English translation
  lin Even = predA (regA "pair") ; -- French translation
  lin Even = predA (regA "jämn") ; -- Swedish translation
```
How could the resource predict that just //these//
translations are correct in this domain?


Application grammars are built by experts of these domains
who - thanks to resource grammars - do no more need to be
experts in linguistics.


#NEW
==Languages==

The current GF Resource Project covers ten languages:

-``Dan``ish
-``Eng``lish
-``Fin``nish
-``Fre``nch
-``Ger``man
-``Ita``lian
-``Nor``wegian
-``Rus``sian
-``Spa``nish
-``Swe``dish

The first three letters (``Dan`` etc) are used in grammar module names


#NEW
==Library structure 1: language-independent API==


- ``Lang`` is the top module collecting all of the following.


- syntactic ``Categories`` (parts of speech, word classes), e.g.
```
  V ; NP ; CN ; Det ;  -- verb, noun phrase, common noun, determiner
```
- ``Rules`` for combining words and phrases, e.g.
```
  DetNP : Det -> CN -> NP ; -- combine Det and CN into NP
```
- the most common ``Structural`` words (determiners,
conjunctions, pronouns) (now 83), e.g.
```
  and_Conj : Conj ;
```
- ``Numerals``, number words from 1 to 999,999 with their
inflections, e.g.
```
  n8 : Digit ;
```
- ``Basic`` lexicon of (now 218) frequent everyday words
```
  man_N : N ;
```


In addition, and not included in ``Lang``, there is
- ``SwadeshLex``, lexicon of (now 206) words from the
[http://en.wiktionary.org/wiki/Swadesh_List Swadesh list], e.g.
```
  squeeze_V : V ;
```
Of course, there is some overlap between ``SwadeshLex`` and the other modules.


#NEW
==Library structure 2: language-dependent modules==

- morphological ``Paradigms``, e.g. Swedish
```
  mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns
  mkN : Str -> N ;                                -- regular nouns
```
- (in some languages) irregular ``Verbs``, e.g.
```
  angripa_V = irregV "angripa" "angrep" "angripit" ;
```
- (not yet available) ``Ext``ended syntax with language-specific rules
```
  PassBli : V2 -> NP -> VP ;  -- bli överkörd av ngn
```


#NEW
==How much can be language-independent?==

For the ten languages we have considered, it //is// possible
to implement the current API.


Reservations:

- this does not necessarily extend to all other languages
- this does not necessarily cover the most idiomatic expressions
     of each language
- this may not be the easiest API to implement (e.g. negation and
inversion with  //do// in English suggest that some other
structure would be more natural)
- it is not guaranteed that same structure has the same semantics
in all different languages


#NEW
==Library structure: language-independent API==

%#center
  [src="Lang.gif]
%#center


#NEW
==API documentation==

[Categories.html Categories]


[Rules.html Rules]


Two alternative views on sentence formation by predication:
[Clause.html Clause],
[Verbphrase.html Verbphrase]


[Structural.html Structural]


[Time.html Time]


[Basic.html Basic]


[Lang.html Lang]


See also [../../resource-1.0/doc/gfdoc resource v 1.0 documentation],
now implemented for English, German, and Swedish.


#NEW
==Paradigms documentation==

[ParadigmsEng.html English paradigms]

[BasicEng.html example use of English oaradigms]

[VerbsEng.html English verbs]


[ParadigmsFin.html Finnish paradigms]

[BasicFin.html example use of Finnish oaradigms]


[ParadigmsFre.html French paradigms]

[BasicFre.html example use of French paradigms]

[VerbsFre.html French verbs]


[ParadigmsIta.html Italian paradigms]

[BasicIta.html example use of Italian paradigms]

[BeschIta.html Italian verb conjugations]


[ParadigmsNor.html Norwegian paradigms]

[BasicNor.html example use of Norwegian paradigms]

[VerbsNor.html Norwegian verbs]


[ParadigmsSpa.html Spanish paradigms]

[BasicSpa.html example use of Spanish paradigms]

[BeschSpa.html Spanish verb conjugations]


[ParadigmsSwe.html Swedish paradigms]

[BasicSwe.html example use of Swedish paradigms]

[VerbsSwe.html Swedish verbs]


#NEW
==Use as top-level grammar: testing==

Import a set of ``LangX`` grammars:
```
  i english/LangEng.gf
  i swedish/LangSwe.gf
```
Alternatively, you can ``make`` a precompiled package of
all the languages by using ``lib/resource/Makefile``:
```
  make
  gf langs.gfcm
```
Then you can test with translation, random generation, morphological analysis...
```
  > p -lang=LangEng "I have loved her." | l -lang=LangFre
  Je l' ai aimée.

  > gr -cat=NP | l -multi
  The sock
  Strumpan
  Strømpen
  La media
  La calza
  La chaussette
  Sukka
```


#NEW
==Use as top-level grammar: language learning quizzes==

Morpho quiz with words (e.g. French verbs):
```
  i french/VerbsFre.gf
  mq -cat=V
```
Morpho quiz with phrases (e.g. Swedish clauses):
```
  i swedish/LangSwe.gf
  mq -cat=Cl
```
Translation quiz with sentences (e.g. sentences from English to Swedish):
```
  i swedish/LangEng.gf
  i swedish/LangSwe.gf
  tq -cat=S LangEng LangSwe
```


#NEW
==Use as library==

Import directly by ``open``:
```
  concrete AppNor of App = open LangNor, ParadigmsNor in {...}
```
(Note for the users of GF 2.1 and older:
the dummy ``reuse`` modules and their bulky ``.gfr`` versions
are no longer needed!)


If you need to convert resource records to strings, and don't want to know
the concrete type (as you never should), you can use
```
  Predef.toStr : (L : Type) -> L -> Str ;
```
``L`` must be a linearization type. For instance,
```
  toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N))
  ---> "gammel bil"
```


#NEW
==Use as library through parser==

You can use the parser with a ``LangX`` grammar
when developing a resource.


Using the ``-v`` option shows if the parser fails because
of unknown words.
```
  > p -cat=S -v -lexer=words "jag ska åka till Chalmers"
  unknown tokens [TS "åka",TS "Chalmers"]
```
Then try to select words that ``LangX`` recognizes:
```
  > p -cat=S "jag ska springa till Danmark"
  UseCl (PosTP TFuture ASimul)
    (AdvCl (SPredV i_NP run_V)
    (AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark)))))
```
Use these API structures and extend vocabulary to match your need.
```
  åka_V = lexV "åker" ;
  Chalmers = regPN "Chalmers" neutrum ;
```

#NEW
==Syntax editor as library browser==

You can run the syntax editor on ``LangX`` to
find resource API functions through context-sensitive menus.
For instance, the shell command
```
  gfeditor LangEng.gf LangFre.gf
```
opens the editor with English and French views. The
[http://www.cs.chalmers.se/%7Eaarne/GF2.0/doc/javaGUImanual/javaGUImanual.htm
Editor User Manual] gives more information on the use of the editor.


A restriction of the editor is that it does not give access to
``ParadigmsX`` modules. An IDE environment extending the editor
to a grammar programming tool is work in progress.


#NEW
==Example application: a small translation system==

In this system, you can express questions and answers of
the following forms:
```
  Who chases mice ?
  Whom does the lion chase ?
  The dog chases cats.
```
We build the abstract syntax in two phases:

- [example/Questions.gf>Questions] defines question and
  answer forms independently of domain
- [example/Animals.gf>Animals] defines a lexicon with
  animals and things that animals do.


The concrete syntax of English is built in three phases:

- [example/HandQuestionsI.gf QuestionsI] is a parametrized module
  using the API module ``Resource``.
- [example/QuestionsEng.gf QuestionsEng] is an instantiation
  of the API with ``ResourceEng``.
- [example/AnimalsEng.gf AnimalsEng] is a concrete syntax
  of ``Animals`` using ``ParadigmsEng`` and ``VerbsEng``.


The concrete syntax of Swedish is built upon ``QuestionsI``
in a similar way, with the modules
[example/QuestionsSwe.gf QuestionsSwe] and.
[example/AnimalsSwe.gf AnimalsSwe].


The concrete syntax of French consists similarly of the modules
[example/QuestionsFre.gf QuestionsFre] and
[example/AnimalsFre.gf AnimalsFre].


#NEW
==Compiling the example application==

The resources are bulky, and it takes a therefore a lot of
time and memory to load the grammars. However, they can be
compiled into the ``gfcm``
(**GF canonical multilingual**) format,
which is almost one thousand times smaller and faster to load
for this set of grammars.


To produce an end-user multilingual grammar ``animals.gfcm``,
write the sequence of compilation commands in a ``gfs`` (**GF script**)
file, say
[example/mkAnimals.gfs ``mkAnimals.gfs``],
and then call GF with
```
  gf &lt;mkAnimals.gfs
```
To try out the grammar,
```
  > i animals.gfcm

  > gr | l -multi
  vem jagar hundar ?
  qui chasse des chiens ?
  who chases dogs ?
```


#NEW

==Grammar writing by examples==

(New in GF 2.3)


You can use the resource grammar as a parser on a special file format,
``.gfe`` ("GF examples"). Here is the real source,
[example/QuestionsI.gfe QuestionsI.gfe], which
generated
[example/QuestionsI.gf QuestionsI.gf].
when you executed the GF command
```
  i -ex AnimalsEng.gf
```
Since ``QuestionsI`` is an incomplete module ("functor"),
it need only be built once. This is why only the first
command in ``mkAnimals.gfs`` needs the flag ``-ex``.


Of course, the grammar of any language can be created by
parsing any language, as long as they have a common resource API.
The use of English resource is generally recommended, because it
is smaller and faster to parse than the other languages.


#NEW
==Constants and variables in examples==

The file [example/QuestionsI.gfe QuestionsI.gfe] uses
as resource ``LangEng``, which contains all resource syntax and
a lexicon of ca. 300 words. A linearization rule, such as
```
  lin Who love_V2 man_N = in Phr "who loves men ?" ;
```
uses as argument variables constants for words that can be found in
the lexicon. It is due to this that the example can be parsed.
When the resulting rule,
```
  lin Who love_V2 man_N =
    QuestPhrase (UseQCl (PosTP TPresent ASimul)
      (QPredV2 who8one_IP love_V2 (IndefNumNP NoNum (UseN man_N)))) ;
```
is read by the GF compiler, the identifiers ``love_V2`` and
``man_N`` are not treated as constants, but, following
the normal binding rules of functional languages, as bound variables.
This is what gives the example method the generality that is needed.


To write linearization rules by examples one thus has to know at
least one abstract syntax constant for each category for which
one needs a variable.


#NEW
==Extending the lexicon on the fly==

The greatest limitation of the example method is that the lexicon
may lack many of the words that are needed in examples. If parsing
fails because of this, the compiler gives a list of unknown words
in its error message. An obvious solution is,
of course, to extend the resource lexicon and try again.
A more light-weight solution is to add a **substitution** to
the example. For instance, if you want the example "the pope"
but the lexicon does not have the word "pope", you can write
```
  lin Pope = in NP "the man" {man_N = regN "pope"} ;
```
The resulting linearization rule is initially
```
  lin Pope = DefOneNP (UseN man_N) ;
```
but the substitution changes this to
```
  lin Pope = DefOneNP (UseN (regN "pope")) ;
```
In this way, you do not have to extend the resource lexicon, but you
need to open the Paradigms module to compile the resulting term.


Of course, the substituted expressions may come from another language
than the main language of the example:
```
  lin Pope = in NP "the man" {man_N = regN "pape" masculine} ;
```
If many substitutions are needed, semicolons are used as separators:
```
  {man_N = regN "pope" ; walk_N = regV "pray"} ;
```


#NEW
==Implementation details: low-level files==

**For developers of resource grammars.**
The modules listed in this section should never be imported in application
grammars.


Each of the API implementations uses the following auxiliary resource modules:

- ``Types``, the morphological paradigms and word classes
- ``Morpho``, inflection machinery
- ``Syntax``, complex categories and their combinations

In addition, the following language-independent modules from ``lib/prelude``
are used.

- ``Predef``, operations whose definitions are hard-coded in GF
- ``Prelude``, generic string and boolean operations
- ``Coordination``, coordination structures for arbitrary categories


#NEW
==Implementation details: the structure of low-level files==

%#center
  [Low.gif]
%#center


#NEW
==How to change a resource grammar?==

In many cases, the source of a bug is in one of
the low-level modules. Try to trace it back there
by starting from the high-level module.


(Much more to be written...)


#NEW
==How to write a resource grammar?==

Start with a more limited goal, e.g. to implement
the ``stoneage`` grammar (``examples/stoneage``)
for your language.


For this, you need

- most of ``Types``
- most of ``Morpho``
- some of ``Syntax``
- most of ``Paradigms``


A useful command to test ``oper``s:
```
  i -retain MorphoRot.gf
  cc regNoun "foo"
```


See also [../../resource-1.0/doc/Resource-HOWTO.html Resource-HOWTO]
(under construction).


#NEW
==The use of parametrized modules==

In two language families, a lot of code is shared.
- Romance: French, Italian, Spanish
- Scandinavian: Danish, Norwegian, Swedish


The structure looks like this.

 []


#NEW
==Current status==

 | Language  | v0.6 | v0.9 | v1.0 | Paradigms | Lexicon | Verbs |
 | Arabic    | -    | -    | +    | X         | X       | -
 | Danish    | -    | X    | X    | X         | X       | X
 | English   | X    | X    | X    | X         | X       | X
 | Finnish   | X    | +    | X    | X         | X       | 0
 | French    | X    | X    | X    | X         | X       | X
 | German    | X    | -    | X    | X         | X       | X
 | Italian   | X    | X    | X    | X         | X       | X
 | Norwegian | -    | X    | X    | X         | X       | X
 | Russian   | X    | X    | X    | X         | X       | -
 | Spanish   | -    | X    | X    | X         | X       | X
 | Swedish   | X    | X    | X    | X         | X       | X

X = implemented (few exceptions may occur)

+ = implemented for a large part

* = linguistic material ready for implementation

- = not implemented

0 = not applicable


#NEW
==Known bugs and limitations==

(//The listed limitations are ones that do not follow from the table on
the previous page//.)

Danish

English

Finnish:
compiling the heuristic regular paradigms is slow;
possessive and interrogative suffixes have no proper lexer.

French:
no inverted questions

German

Italian:
no binding of clitics with infinitive

Norwegian

Russian:
missing rules for ordinal numbers

Spanish

Swedish


#NEW
==Obtaining it==

Get the grammar package from
[http://sourceforge.net/project/showfiles.php?group_id=132285
GF Download Page]. The current libraries are in
``lib/resource-1.0``. Version 0.9 is in
``lib/resource-0.9``. Version 0.6 is in
``lib/resource-0.6``.


The very very latest version of GF and its libraries is in the
[Darcs repository http://www.cs.chalmers.se/Cs/Research/Language-technology/darcs/GF/doc/darcs.html].