gf-core/lib/resource/doc/gf-resource.html

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html><head><title></title></head>
 <body bgcolor="#ffffff" text="#000000">
<center>

<img src="gf-logo.gif">

<h1>GF Resource Grammar Library</h1>

<p>

Second Version, Gothenburg, 18 February 2005
<br>
First Draft, Gothenburg, 7 February 2005

</p><p>

Aarne Ranta

</p><p>

<tt>aarne@cs.chalmers.se</tt>
</p></center>


<!-- NEW -->
<h2>GF = Grammatical Framework</h2>

A grammar formalism based on functional programming and type theory.

<p>

Designed to be nice for <i>ordinary programmers</i> to use.

<p>

Mission: to make natural-language applications available for
ordinary programmers, in tasks like
<ul>
<li> software documentation
<li> domain-specific translation
<li> human-computer interaction
<li> dialogue systems
</ul>
Thus <i>not</i> primarily another theoretical framework for
linguists.


<!-- NEW -->
<h2>Language + Libraries</h2>

Writing natural language grammars still requires
theoretical knowledge about the language.

<p>

Which kind of programmer is easier to find?
<ul>
<li> one who can write a sorting algorithm
<li> one who can write a grammar for Swedish determiners
</ul>

<p>

In main-stream programming, sorting algorithms are not
written by hand but taken from <b>libraries</b>.

<p>

In the same way, we want to create grammar libraries that encapsulate
basic linguistic facts.

<p>

Cf. the Java success story: the language is just a half of the
success - libraries are another half.


<!-- NEW -->
<h2>Example of library-based grammar writing</h2>

To define Swedish definite phrases form scratch:
<pre>

</pre>
To use a library function for Swedish definite phrases:
<pre>

</pre>


<!-- NEW -->
<h2>Questions in grammar library design</h2>

What should there be in the library?
<br>
<li> morphology, lexicon, syntax, semantics,...

<p>

How do we organize and present the library?
<br>
<li> division into modules, level of granularity
<br>
<li> "school grammar" vs. sophisticated linguistic concepts

<p>

Where do we get the data from?
<br>
<li> automatic extraction or hand-writing?
<br>
<li> reuse of existing resources?

<p>

Extra constraint: we want open-source free software.


<!-- NEW -->
<h2>The scope of the resource grammar library</h2>

All morphological paradigms

<p>

Basic lexicon of structural, common, and irregular words

<p>

Basic syntactic structures

<p>

Currently <i>no</i> semantics,
<i>no</i> language-specific structures if not
necessary for expressivity.


<!-- NEW -->
<h2>Success criteria</h2>

Grammatical correctness

<p>

Semantic coverage: you can express whatever you want.

<p>

Usability as library for non-linguists.

<p>

(Bonus for linguists:) nice generalizations w.r.t. language
families, using the module system of GF.


<!-- NEW -->
<h2>These are not our success criteria</h2>

Language coverage: you can parse all expressions. Example:
the French <i>passé simple</i> tense, although covered by the
morhology, is not used in the language-independent API, but
only the <i>passé composé</i> is.

<p>

Semantic correctness
<pre>
  colourless green ideas sleep furiously

  the time is seventy past forty-two
</pre>

<p>

(Warning for linguists:) theoretical innovation in
syntax


<!-- NEW -->
<h2>So where is semantics?</h2>

GF incorporates a <b>Logical Framework</b> and is therefore
capable of expressing logical semantics <i>à la</i> Montague
or any other flavour, including anaphora and discourse.

<p>

But we do <i>not</i> believe semantics can be given once and
for all for a natural language.

<p>

Instead, we expect semantics to be given in
<b>application grammars</b> built on semantic models
of different domains.

<p>

Example application: number theory
<pre>
  fun Even : Nat -> Prop ;         -- a mathematical predicate

  lin Even = predA (regA "even") ; -- English translation
  lin Even = predA (regA "pair") ; -- French translation
  lin Even = predA (regA "jämn") ; -- Swedish translation
</pre>
How could the resource predict that just <i>these</i>
translations are correct in this domain?

<p>

Application grammars are built by experts of these domains
who - thanks to resource grammars - do no more need to be
experts in linguistics.


<!-- NEW -->
<h2>Languages</h2>

The current GF Resource Project covers ten languages:
<ul>
<li><tt>Dan</tt>ish
<li><tt>Eng</tt>lish
<li><tt>Fin</tt>nish
<li><tt>Fre</tt>nch
<li><tt>Ger</tt>man
<li><tt>Ita</tt>lian
<li><tt>Nor</tt>wegian
<li><tt>Rus</tt>sian
<li><tt>Spa</tt>nish
<li><tt>Swe</tt>dish
</ul>>
The first three letters (<tt>Dan</tt> etc) are used in grammar module names


<!-- NEW -->
<h2>Library structure 1: language-independent API</h2>

<li> syntactic <tt>Categories</tt> (parts of speech, word classes), e.g.
<pre>
  V ; NP ; CN ; Det ;  -- verb, noun phrase, common noun, determiner
</pre>
<li> <tt>Rules</tt> for combining words and phrases, e.g.
<pre>
  DetNP : Det -> CN -> NP ; -- combine Det and CN into NP
</pre>
<li> the most common <tt>Structural</tt> words (determiners,
conjunctions, pronouns), e.g.
<pre>
  and_Conj : Conj ;
</pre>

<!-- NEW -->
<h2>Library structure 2: language-dependent modules</h2>

<li> morphological <tt>Paradigms</tt>, e.g.
<pre>
  mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns
  mkN : Str -> N ;                                -- regular nouns
</pre>
<li> irregular <tt>Verbs</tt>, e.g.
<pre>
  angripa_V = irregV "angripa" "angrep" "angripit" ;
</pre>
<li> <tt>Lexicon</tt> of frequent words
<pre>
  man_N = mkN "man" "mannen" "män" "männen" masculine ;
</pre>
<li> <tt>Ext</tt>ended syntax with language-specific rules
<pre>
  PassBli : V2 -> NP -> VP ;  -- bli överkörd av ngn
</pre>


<!-- NEW -->
<h2>How much can be language-independent?</h2>

For the ten languages we have considered, it <i>is</i> possible
to implement the current API.

<p>

Reservations:
<ul>
<li> does not necessarily extend to all other languages
<li> does not necessarily cover the most idiomatic expressions
     of each language
<li> may not be the easiest API to implement (e.g. negation and
inversion with  <i>do</i> in English suggest that some other
structure would be more natural)
<li> does not guarantee that same structure has the same semantics
in different languages
<p>


<!-- NEW -->
<h2>Library structure: language-independent API</h2>

<center>
<img src="Resource.gif">
</center>


<!-- NEW -->
<h2>Library structure: test bed for the language-independent API</h2>

<center>
<img src="Lang.gif">
</center>


<!-- NEW -->
<h2>API documentation</h2>

<a href="Categories.html">Categories</a>

<p>
<a href="Rules.html">Rules</a>

<p>
Alternative views on sentence formation:
<a href="Clause.html">Clause</a>,
<a href="Verbphrase.html">Verbphrase</a>

<p>
<a href="Structural.html">Structural</a>

<p>

<a href="Time.html">Time</a>

<p>
<a href="Basic.html">Basic</a>

<p>

<a href="Lang.html">Lang</a>

<p>


<!-- NEW -->
<h2>Paradigms documentation</h2>

<a href="ParadigmsEng.html">English paradigms</a>
<br>
<a href="BasicEng.html">example use of English oaradigms</a>
<br>
<a href="VerbsEng.html">English verbs</a>

<p>

<a href="ParadigmsFre.html">French paradigms</a>
<br>
<a href="BasicFre.html">example use of French paradigms</a>
<br>
<a href="VerbsFre.html">French verbs</a>

<p>

<a href="ParadigmsSwe.html">Swedish paradigms</a>
<br>
<a href="BasicSwe.html">example use of Swedish paradigms</a>
<br>
<a href="VerbsSwe.html">Swedish verbs</a>


<!-- NEW -->
<h2>Use as top-level grammar: testing</h2>

Import a set of $LangX$ grammars:
<pre>
  i english/LangEng.gf
  i swedish/LangSwe.gf
</pre>
Test with random generation, translation, morphological analysis...
<pre>


</pre>

<!-- NEW -->
<h2>Use as top-level grammar: language learning quizzes</h2>

Morpho quiz with words:
<pre>


</pre>
Morpho quiz with phrases:
<pre>


</pre>
Translation quiz with sentences:
<pre>


</pre>


<!-- NEW -->
<h2>Use as library</h2>

Import directly by <tt>open</tt>:
<pre>
  concrete AppNor of App = open LangNor, ParadigmsNor in {...}
</pre>
No more dummy <tt>reuse</tt> modules and bulky <tt>.gfr</tt> files!

<p>

If you need to convert resource category records to/from strings, use
<pre>
  Predef.toStr : (L : Type) -> L -> Str ;
</pre>
<tt>L</tt> must be a linearization type. For instance,
<pre>
  toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N))
  ---> "gammel bil"
</pre>


<!-- NEW -->
<h2>Use as library through parser</h2>

Use the parser when developing a resource.
<pre>
  > p -cat=S -v "jag ska åka till Chalmers"
  unknown tokens [TS "åka",TS "Chalmers"]

  > p -cat=S "jag ska gå till Danmark"
  UseCl (PosTP TFuture ASimul)
    (AdvCl (SPredV i_NP go_V)
    (AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark)))))
</pre>
Extend vocabulary at need.
<pre>
  åka_V = lexV "åker" ;
  Chalmers = regPN "Chalmers" neutrum ;
</pre>


<!-- NEW -->
<h2>Example application: a small translation system</h2>

You can say things like the following:
<pre>
  who chases mice ?
  whom does the lion chase ?
  the dog chases cats
</pre>
Source modules:

<p>

<a href=example/Animals.gf>Animals</a>

<p>

<a href=example/AnimalsEng.gf>AnimalsEng</a>

<p>

<a href=example/AnimalsFre.gf>AnimalsFre</a>

<p>

<a href=example/AnimalsSwe.gf>AnimalsSwe</a>


<!-- NEW -->
<h2>Compiling the example application</h2>

The resources are bulky, and it takes a therefore a lot of
time and memory to load the grammars. However, they can be
compiled into the <tt>gfcm</tt>
(<b>GF canonical multilingual</b>) format,
which is almost one thousand times smaller and faster to load
for this set of grammars.

<p>

Just issue the following GF commands
<pre>
  i -src AnimalsEng.gf ;; s
  i -src AnimalsFre.gf ;; s
  i -src AnimalsSwe.gf ;; s
  pm | wf animals.gfcm
</pre>
and you get an end-user grammar <tt>animals.gfcm</tt>.

<p>

You can also write the commands in a <tt>gfs</tt> (<b>GF script</b>)
file, say
<a href=mkAnimals.gfc><tt>mkAnimals.gfs</tt></a>,
and then call GF with
<pre>
  gf &lt;mkAnimals.gfs
</pre>


<!-- NEW -->
<h2>Further simplifications of the application grammar</h2>

Step 1: use a simplified access to present-tense sentences,
<tt>SentenceX</tt> (to be written...)

<p>

Step 2: factor out the categories and purely combinational
rules into an <tt>incomplete</tt> module (to be shown... but
this does not work for French, which uses different structures:
e.g. <i>Qui aime les lions ?</i> with a definite phrase
where English has <i>Who loves lions?</i>


<!-- NEW -->
<h2>Implementation details: the structure of low-level files</h2>

<center>
<img src="Low.gif">
</center>


<!-- NEW -->
<h2>The use of parametric modules</h2>

In two language families:
<ul>
<li> Romance: French, Italian, Spanish
<li> Scandinavian: Danish, Norwegian, Swedish
</ul>
<center>
<img src="Scand.gif">
</center>


<!-- NEW -->
<h2>Current status</h2>

<table border=1>
<tr><td>Language</td> <td>v0.6</td> <td>API</td> <td>Paradigms</td> <td>Basic lex</td> <td>Verbs</td></tr>
<tr><td>Danish</td>    <td> </td> <td>X</td> <td> </td> <td> </td> <td> </tr>
<tr><td>English</td>   <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>Finnish</td>   <td>X</td> <td> </td> <td> </td> <td> </td> <td> </tr>
<tr><td>French</td>    <td>X</td> <td>*</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>German</td>    <td>X</td> <td> </td> <td>*</td> <td> </td> <td> </tr>
<tr><td>Italian</td>   <td>X</td> <td>*</td> <td>*</td> <td> </td> <td>*</tr>
<tr><td>Norwegian</td> <td> </td> <td>X</td> <td> </td> <td> </td> <td> </tr>
<tr><td>Russian</td>   <td>X</td> <td>*</td> <td>*</td> <td> </td> <td> </tr>
<tr><td>Spanish</td>   <td> </td> <td>*</td> <td> </td> <td> </td> <td>*</tr>
<tr><td>Swedish</td>   <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
</table>

<!-- NEW -->
<h2>Obtaining it</h2>

Now on CVS at Chalmers:
<pre>
  cvs -d /users/cs/aarne/cvs checkout GF2.0/lib
</pre>

<p>

To appear later at GF Homepage:<p>

<a href="http://www.cs.chalmers.se/%7Eaarne/GF">
<tt>http://www.cs.chalmers.se/~aarne/GF</tt></a>
</p></body></html>