gf-core/lib/resource/doc/gslt-sem-2006.txt

Grammars as Software Libraries
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: %%date(%c)

% NOTE: this is a txt2tags file.
% Create an html file from this file using:
% txt2tags --toc gslt-sem-2006.txt

%!target:html

%!postproc(html): #NEW <!-- NEW -->

#NEW

==Setting==

Current funding
- VR: Library-Based Grammar Engineering (2006-2008)
  - Lars Borin (Swedish)
  - Robin Cooper (Computational Linguistics)
  - Sibylle Schupp and Aarne Ranta (Computer Science)


Previous funding
- VR: Record Types and Dialogue Semantics (2003-2005)
- VINNOVA: Interactive Language Technology (2001-2004)


Main applications
- TALK: multilingual and multimodal dialogue systems
- WebALT: multilingual generation of mathematical teaching material
- KeY: multilingual authoring of software specifications


#NEW

==People==

Staff contributions to grammar libraries:
- Björn Bringert
- Markus Forsberg
- Harald Hammarström
- Janna Khegai
- Aarne Ranta


Student projects on grammar libraries:
- Inger Andersson & Therese Söderberg: Spanish morphology
- Ludmilla Bogavac: Russian morphology
- Karin Cavallin: comparison with Svenska Akademins Grammatik
- Ali El Dada: Arabic morphology and syntax
- Muhammad Humayoun: Urdu morphology
- Michael Pellauer: Estonian morphology


Technology, also:
- Håkan Burden
- Hans-Joachim Daniels
- Kristofer Johannisson
- Peter Ljunglöf


Various grammar library contributions from the multilingual Chalmers community:
- Ana Bove, Koen Claessen, Carlos Gonzalía, Patrik Jansson,
Wojciech Mostowski, Karol Ostrovský, David Wahlstedt


Resource library patches and suggestions from the WebALT staff:
- Lauri Carlson, Glòria Casanellas, Anni Laine, Wanjiku Ng'ang'a, Jordi Saludes


#NEW

==Software Libraries==

The main device of **division of labour** in programming.

Instead of writing a sorting algorithm over and over again,
the programmers take it from a library. You write (in Haskell),
```
  Data.List.sort xs
```
instead of a lot of code actually implementing sorting.

Practical advantages:
- faster development of new software
- quality guarantee and automatic improvements


#NEW

==Abstraction==

Libraries promote **abstraction**: you abstract away from details.

The use of libraries is therefore a good programming style.

It is also **scientifically interesting** to create libraries:
you have to think about abstractions on your domain of expertise.

Notice: libraries can bring abstraction to almost any language,
if it just has a support for functions or macros.


#NEW

==Grammars as libraries?==

Example: we want to create a GUI (Graphical User Interface) button
that says //yes//, and **localize** it to different languages:
```
  Yes   Ja   Kyllä   Oui   Ja   Sì
```
Possible ways to do this:
+ Go around dictionaries to find the word in different languages
```
  yesButton english = button "Yes"
  yesButton swedish = button "Ja"
  yesButton finnish = button "Kyllä"
```

+ Hire more programmers to perform localization in different languages


#NEW

3. Use a library ``Text`` such that you can write
```
  yesButton lang = button (Text.render lang Text.Yes)
```
The library has an API (Application Programmer's Interface) with:
+ A repository of text elements such as
```
  Yes    : Text
  No     : Text
```
+ A function rendering text elements in different languages:
```
  render : Language -> Text -> String
```


#NEW

==A slightly more advanced example==

This is what you often see as a feedback from a program:
```
  You have 1 messages.
```
Or perhaps with a little more thought:
```
  You have 1 message(s).
```
The code that should be written is of course
```
  mess n = "You have" +++ show n +++ messages ++ "."
    where
      messages = if n==1 then "message" else "messages"
```
(E.g. VoiceXML supports this.)


#NEW

==Problems with the more advanced example==

The same as with "Yes": you have to know the words "you",
"have", "message".

//Moreover//, you have to know the inflection of the equivalent
of "message":
```
  if n == 1 then "meddelande" else "meddelanden"
```
//Moreover//, you have to know the congruence with different numbers
(e.g. Arabic):
```
  if n == 1 then "risAlaö" else
  if n == 2 then "risAlatAn" else
  if n < 11 then "rasA'il" else
                 "risAlaö"
```

#NEW

==More problems with the advanced example==

You also have to know the case required by the verb "have"
e.g. Finnish:
```
  1 viesti   -- nominative
  4 viestiä  -- partitive
```
//Moreover//, you have to know what is the proper way to politely
address the user:
```
  Du har 3 meddelanden / Ni har 3 meddelanden
  Vous avez 3 messages / Tu as 3 messages
```
(This can also depend on country and the kind of program.)


#NEW

==A library-based solution==

In analogy with the "Yes" case, you write
```
  mess lang n = render lang (Text.YouHaveMessages n)
```
Hmm, is this so smart? What about if you want to say
```
  You have 4 documents.
  You have 5 jewels.
  I have 7 surprises.
```
It is time to move from **canned text** to a **grammar**.


#NEW

==An improved library-based solution==

You may want to write
```
  mess  lang n = render lang (Have PolYou (Num n Message))
  sword lang n = render lang (Have FamYou (Num n Jewel))
  surpr lang n = render lang (Have I      (Num n Surprise))
```
For this purpose, you need a library with the API
```
  Have    : NounPhrase -> NounPhrase -> Sentence

  PolYou  : NounPhrase
  FamYou  : NounPhrase

  Num     : Int -> Noun -> NounPhrase

  Message : Noun
  Jewel   : Noun
```


#NEW

==The ultimate solution?==

The library API for language will certainly grow big and become
difficult to use. Why couldn't I just write
```
  mess lang n = render lang (parse english "you have n messages")
```
To this end, the API should provide the top-level function
```
  parse : Language -> String -> Sentence
```
The library that we will present actually has this as well!


#NEW

The only complication is that ``parse`` does not always return
just one sentence. Those may be zero:
```
  "you have n mesaggse"

```
or many:
```
  "you have n messages"

  Have PolYou  (Num n Message)
  Have FamYou  (Num n Message)
  Have PlurYou (Num n Message)
```
Thus some amount of interaction is needed.


#NEW

==The components of a grammar library==

The library has **construction functions** like
```
  Have   : NounPhrase -> NounPhrase -> Sentence
  PolYou : NounPhrase
```
These functions build **grammatical structures**, which
can have different realizations in different languages.

Therefore we also need **realization functions**,
```
  render : Language -> Sentence -> String
  parse  : Language -> String   -> [Sentence]
```
Both of them require linguistic expertise to write - but,
one this is done, they can be used with very little linguistic
knowledge by application programmers!


#NEW

==Implementing a grammar library in GF==

GF = Grammatical Framework

Those who know GF have already seen the introduction as a
seduction argument leading to GF.

In GF,
- construction functions = **abstract syntax**
- realization functions = **concrete syntax**


#NEW

Simplest possible example:
```
  abstract Text = {
    cat Text ;
    fun Yes : Text ;
    fun No  : Text ;
    }

  concrete TextEng of Text = {
    lin Yes = ss "yes" ;
    lin No  = ss "no" ;
    }

  concrete TextFin of Text = {
    lin Yes = ss "kyllä" ;
    lin No  = ss "ei" ;
    }
```


#NEW

==Linearization and parsing==

The realizatin function is, for each language, implemented by
**linearization rules** (``lin``).

The linearization rules directly give the ``render`` method:
```
  render english x = TextEng.lin x
```
The GF formalism moreover has the property of **reversibility**:
- a set of linearization rules automatically generates a parser.


%While reversibility has a minor importance for the applications
%shown above, it is crucial for other applications of GF grammars.


#NEW

==Applying GF==

**multilingual grammar** = abstract syntax + concrete syntaxes

Examples of the idea:
- domain-specific translation
- multilingual authoring
- dialogue systems


#NEW

==Domain, ontology, idiom==

An abstract syntax has other names:
- a **semantic model**
- an **ontology**


The concrete syntax defines how the ontology
is represented in a language.

The following requirements are made:
- linguistic correctness (inflection, agreement, word order,...)
- semantic correctness (express the concepts properly)
- conformance to the domain idiom (use proper terms and phrasing)


Benefit: translation via semantic model of domain can reach high quality.

Problem: the expertise of both a linguist and a domain expert are required.


#NEW

==Example domain==

Arithmetic of natural numbers: abstract syntax
```
  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;
```
**Concrete syntax**: mapping from abstract syntax trees to strings in a language
(English, French, German, Swedish,...)
```
  lin Even x = {s = x.s ++ "is"  ++ "even"} ;
  lin Even x = {s = x.s ++ "est" ++ "pair"} ;
  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;
  lin Even x = {s = x.s ++ "är"  ++ "jämnt"} ;
```

#NEW

==Translation system==

We can translate using the abstract syntax as interlingua:
```
  4 is even                  4 ist gerade
             \              /
               Even (NInt 4)
             /              \
  4 est pair                  4 är jämnt
```
This idea is used e.g. in the WebALT project to generate mathematical
teaching material in 7 languages.

But is it really so simple?


#NEW
==Difficulties with concrete syntax==

The previous multilingual grammar breaks these rules in many situations:
```
  2 and 3 is even
  la somme de 3 et de 5 est pair
  wenn 2 ist gerade, dann 2+2 ist gerade
  om x är jämnt, summan av x och 2 är jämnt
```
All these sentences are grammatically incorrect.


#NEW

==Solving the difficulties==

GF //can// express the linguistic rules that are needed to
produce correct translations:

In addition to strings, we use **parameters**, **tables**,
and **record types**. For instance, French:
```
  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat  = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
        }
      } ;
```
Linguistic knowledge dominates in the size of this grammar.


#NEW

==Application grammars vs. resource grammars==

Application grammar ("semantic grammar")
- abstract syntax: domain semantics
- concrete syntax: "controlled language"
- author: domain expert


Resource grammar ("syntactic grammar")
- abstract syntax: linguistic structures
- concrete syntax: (approximation of) entire language
- author: linguist


#NEW
==GF as programming language==

The expressive power is between TAG and HPSG.

The language is more high-level: a modern, **typed functional programming language**.

It enables linguistic generalizations and abstractions.

But we don't want to bother application grammarians with these details.

We have built a **module system** that can hide details.


#NEW

==Concrete syntax using library==

Assume the following API
```
  cat S ; NP ; A ;

  fun predA : A -> NP -> S ;

  oper regA : Str -> A ;
```
Now implement ``Even`` for four languages
```
  lincat
    Prop = S ;
    Nat  = NP ;
  lin
    Even = predA (regA "even") ;   -- English
    Even = predA (regA "jämn") ;   -- Swedish
    Even = predA (regA "pair") ;   -- French
    Even = predA (regA "gerade") ; -- German
```
Notice: the choice of adjective is domain expert knowledge.


#NEW
==Design questions for the grammar library==

What should there be in the library?
- morphology, lexicon, syntax, semantics,...


How do we organize and present the library?
- division into modules, level of granularity
- "school grammar" vs. sophisticated linguistic concepts


Where to get the data from?
- automatic extraction or hand-writing?
- reuse of existing resources?


Extra constraint: we want open-source free software and
hence cannot use existing proprietary resources.


#NEW
==Design decisions==

Coverage, for each language:
- complete morphology
- lexicon of the most important structural words
- test lexicon of ca. 300 content words
- representative fragment of syntax (cf. CLE (Core Language Engine))
- rather flat semantics (cf. Quasi-Logical Form of CLE)


Organization:
- top-level (API) modules
- Ground API + special-purpose APIs ("macro packages")
- "school grammar" concepts rather than advanced linguistic theory


Presentation:
- tool ``gfdoc`` for generating HTML from grammars
- example collections


#NEW
==Design decisions, cont'd==

Where do we get the data from?
- morphology and syntax are hand-written
- the test lexicon is hand-written
- APIs for manual lexicon extension
- tool for automatic lexicon extraction
- we have not reused existing resources


The resource grammar library is entirely open-source free software
(under GNU GPL license).


#NEW
==Success criteria and evaluation==

Grammatical correctness of everything generated.

Semantic coverage: you can express whatever you want.

Usability as library for non-linguists.

Evaluation: tested in third-party projects.

Tools for regression testing (treebank generation and comparison)


#NEW
==These are not our success criteria==

Language coverage:
- to be able to parse all expressions.
- Example: French //passé simple//, although covered by the
morphology, is not available through the language-independent API.
- But: reconsidered to improve example-based grammar writing


Semantic correctness:
- only to produce meaningful expressions.
- Example: the following sentences can be generated
```
  colourless green ideas sleep furiously
  the time is seventy past forty-two
```


Linguistic innovation in syntax:
- rather a presentation of "known facts"
- innovation would be hidden from users anyway...


#NEW
==Where is semantics?==

Application grammars use domain-specific
semantics to guarantee semantic well-formedness.

GF incorporates a **Logical Framework** and can express
- logical semantics //à la// Montague
- anaphora and discourse using dependent types


Language-independent API is a rough semantic model.

But we do //not// try to give semantics once and
for all for the whole language.


#NEW
==Representations in different APIs==

**Grammar composition**: any grammar can serve as resource to another one.

No fixed set of representation levels; here some examples for
```
  2 is even
  2 är jämnt
```
In ``Arithm``
```
  Even 2
```
In ``Predication`` (high level resource API)
```
  predA (IntNP 2) (regA "even")
  predA (IntNP 2) (regA "jämn")
```
In ``Lang`` (ground level resource API)
```
  UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2))
    (UseComp (CompAP (PositA (regA "even")))))
  UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2))
    (UseComp (CompAP (PositA (regA "jämn")))))
```


#NEW
==Languages==

The current GF Resource Project covers ten languages:
- ``Dan``ish
- ``Eng``lish
- ``Fin``nish
- ``Fre``nch
- ``Ger``man
- ``Ita``lian
- ``Nor``wegian (bokmål)
- ``Rus``sian
- ``Spa``nish
- ``Swe``dish


Implementation of API v 1.0 projected for the end of February.

In addition, we have parts (morphology) of Arabic, Estonian, Latin, and Urdu


#NEW
==Library structure 1: language-independent API==

[Lang.png]

[Resource index page index.html]

[Examples of each category  gfdoc/Cat.html]

Cf. "matrix" in BLARK, LinGo


#NEW
==Library structure 2: language-dependent APIs==

- morphological paradigms, e.g. ``ParadigmsSwe``
```
  mkN  : (man,mannen,män,männen : Str) -> N ;   -- worst-case nouns
  regV : (leker : Str) -> V ;                   -- regular verbs
```
- irregular words esp. verbs, e.g. ``IrregSwe``
```
  angripa_V = irregV "angripa" "angrep" "angripit" ;
```
- exended syntax with language-specific rules, e.g. ``ExtNor``
```
  PostPoss : CN -> Pron -> NP ;     -- bilen min
```


#NEW
==Difficulties encountered==

English: negation and auxiliary vs. non-auxiliary verbs

Finnish: object case

German: double infinitives

Romance: clitic pronouns

Scandinavian: determiners

//In particular//: how to make the grammars efficient


#NEW
==How much can be language-independent?==

For the ten languages we have considered, it //is// possible
to implement the current API.

Reservations:

- does not necessarily extend to all other languages
- does not necessarily cover the most idiomatic expressions of each language
- may not be the easiest API to implement
  - e.g. negation and inversion with  //do// in English suggest that some other
  structure would be more natural


- the structures may not have the same semantics in all different languages


#NEW
==Using the library==

Simplest case: use the API in the same way for all languages.
- **+** grammar localization for free
- **-** not the best idioms for each language


In practice: use the API in different ways for different languages
```
  -- Eng: x's name is y
  Name x y = predNP (GenCN x (regN "name")) (StringNP y)
  -- Swe: x heter y
  Name x y = predV2 x heta_V2 (StringNP y)
```
This amounts to **compile-time transfer**.

Surprisingly, writing an application grammar requires more native-speaker knowledge
than writing a resource grammar!


#NEW
==Parametrized modules==

We can go even farther than share an abstract API: we can share implementations
among related languages.

Exploited in two families:
- Romance: French, Italian, Spanish
- Scanndinavian: Danish, Norwegian, Swedish


[The declarations of Scandinavian syntax differences  ../scandinavian/DiffScand.gf]


#NEW
==Lexicon extension==

We cannot anticipate all vocabulary needed in application grammars.

Therefore we provide high-level paradigms to add new words.

Example heuristic, from [ParadigsSwe gfdoc/ParadigmsSwe.html]:
```
  regV : (leker : Str) -> V ;

  regV leker = case leker of {
    lek + ("a" | "ar")  => conj1 (lek + "a") ;
    lek + "er"          => conj2 (lek + "a") ;
    bo  + "r"           => conj3 bo
    }
```

#NEW
==Example low-level morphological definition==

```
  decl2Noun : Str -> N = \bil ->
    let
      bb : Str * Str = case bil of {
        pojk + "e"                 => <pojk + "ar",    bil  + "n"> ;
        nyck + "e" + l@("l" | "r") => <nyck + l + "ar",bil  + "n"> ;
        sock + "e" + "n"           => <sock + "nar",   sock + "nen"> ;
        _                          => <bil + "ar",     bil  + "en">
        } ;
    in mkN bil bb.p2 bb.p1 (bb.p1 + "na") ;
```


#NEW
==Some formats that can be generated from GF grammars==

```
-printer=lbnf           BNF Converter, thereby C/Bison, Java/JavaCup
-printer=fullform       full-form lexicon, short format
-printer=xml            XML: DTD for the pg command, object for st
-printer=gsl            Nuance GSL speech recognition grammar
-printer=jsgf           Java Speech Grammar Format
-printer=srgs_xml       SRGS XML format
-printer=srgs_xml_prob  SRGS XML format, with weights
-printer=slf            a finite automaton in the HTK SLF format
-printer=regular        a regular grammar in a simple BNF
-printer=gfc-prolog     gfc in prolog format (also pg)
```


#NEW
==Use as program components==

Haskell, Java, Prolog

Parsing, generation, translation

Push-button creation of spoken language translators (using Nuance)


#NEW
==Grammar library as linguistic resource==

Can we use the libraries outside domain-specific fragments?

We seem to be approaching full coverage from below.

The resource API is not good for heavy-duty parsing (too abstract and
therefore too inefficient).

Two ideas:
- write shallow parsers as application grammars
- generate corpora and use statistic parsing methods


#NEW
==Corpus generation==

The most general format is **multilingual treebank** generation:
```
  > gr -tr | l -multi
  UseCl TCond AAnter PNeg (PredVP (DetCN (DetSg DefSg NoOrd)
    (AdjCN (PositA young_A) (UseN woman_N))) (ComplV2 love_V2 (UsePron he_Pron)))

  The young woman wouldn't have loved him
  Den unga kvinnan skulle inte ha älskat honom
  Den unge kvinna ville ikke ha elska ham
  La joven mujer no lo habría amado
  La giovane donna non lo avrebbe amato
  La jeune femme ne l' aurait pas aimé
  Nuori nainen ei olisi rakastanut häntä
```
This is either exhaustive or random, possibly
with probability weights attached to constructors.

A special case is **corpus generation**: just leave one language.

Can this be useful? Cf. Rebecca Jonson this afternoon.


#NEW
==Related work==

CLE = Core Language Engine
- the closest point of comparison as for coverage and purpose
- resource API similar to "Quasi-Logical Form"
- parametrized modules instead of grammar porting via macro packages
- grammar specialization via partial evaluation instead of explanation-based learning
  - therefore, transfer at compile time as often as possible


LinGo Matrix project (HPSG)
- methodology rather than formal discipline for multilingual grammars
- not aimed as library, no grammar specialization?
- wider coverage - parsing real texts


Parsing detached from grammar (Nivre) - grammar detached from parsing

#NEW
==Demo==

Stoneage grammar, based on the Swadesh word list.

Implemented as application on top of the resource grammar.

Illustrate generation and spoken-language parsing.


%http://www.boost.org/