Files
gf-core/lib/resource/doc/gf-resource.html
2005-06-03 20:51:58 +00:00

885 lines
20 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html><head><title></title></head>
<body bgcolor="#ffffff" text="#000000">
<center>
<img src="gf-logo.gif">
<h1>GF Resource Grammar Library</h1>
<p>
Third Version, 22 May 2005
<br>
Second Version, 1 March 2005
<br>
First Draft, 7 February 2005
</p><p>
Aarne Ranta
</p><p>
<tt>aarne@cs.chalmers.se</tt>
</p></center>
<!-- NEW -->
<h2>GF = Grammatical Framework</h2>
A grammar formalism based on functional programming and type theory.
<p>
Designed to be nice for <i>ordinary programmers</i> to use: by this
we mean programmers without training in linguistics.
<p>
Mission: to make natural-language applications available for
ordinary programmers, in tasks like
<ul>
<li> software documentation
<li> domain-specific translation
<li> human-computer interaction
<li> dialogue systems
</ul>
Thus <i>not</i> primarily another theoretical framework for
linguists.
<!-- NEW -->
<h2>Multilingual grammars</h2>
<b>Abstract syntax</b>: language-independent representation
<pre>
cat Prop ; Nat ;
fun Even : Nat -> Prop ;
</pre>
<b>Concrete syntax</b>: mapping from abstract syntax trees to strings in a language
(English, French, German, Swedish,...)
<pre>
lin Even x = {s = x.s ++ "is" ++ "even"} ;
lin Even x = {s = x.s ++ "est" ++ "pair"} ;
lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;
lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
</pre>
We can <b>translate</b> between language via the abstract syntax.
<p>
Is it really so simple?
<!-- NEW -->
<h2>Difficulties with concrete syntax</h2>
Most languages have rules of <b>inflection</b>, <b>agreement</b>,
and <b>word order</b>, which have to be obeyed when putting together
expressions.
<p>
The previous multilingual grammar breaks these rules in many situations:
<p><i>
2 and 3 is even<br>
la somme de 3 et de 5 est pair<br>
wenn 2 ist gerade, dann 2+2 ist gerade<br>
om 2 är jämnt, 2+2 är jämnt<br>
</i>
<!-- NEW -->
<h2>Solving the difficulties</h2>
GF has tools for expressing the linguistic rules that are needed to
produce correct translations in different languages.
<p>
Instead of just strings, we need <p>parameters</b>, <b>tables</b>,
and <b>record types</b>. For instance, French:
<pre>
param Mod = Ind | Subj ;
param Gen = Masc | Fem ;
lincat Nat = {s : Str ; g : Gen} ;
lincat Prop = {s : Mod => Str} ;
lin Even x = {s =
table {
m => x.s ++
case m of {Ind => "est" ; Subj => "soit"} ++
case x.g of {Masc => "pair" ; Fem => "paire"}
}
} ;
</pre>
<!-- NEW -->
<h2>Language + Libraries</h2>
Writing natural language grammars still requires
theoretical knowledge about the language.
<p>
Which kind of a programmer is easier to find?
<ul>
<li> one who can write a sorting algorithm
<li> one who can write a grammar for Swedish determiners
</ul>
<p>
In main-stream programming, sorting algorithms are not
written by hand but taken from <b>libraries</b>.
<p>
In the same way, we want to create grammar libraries that encapsulate
basic linguistic facts.
<p>
Cf. the Java success story: the language is just a half of the
success - libraries are another half.
<!-- NEW -->
<h2>Example of library-based grammar writing</h2>
To define a Swedish expression of a mathematical predicate from scratch:
<pre>
Even x =
let jämn = case &lt;x.n,x.g> of {
&lt;Sg,Utr> => "jämn" ;
&lt;Sg,Neutr> => "jämnt" ;
&lt;Pl,_> => "jämna"
}
in
{s = table {
Main => x.s ! Nom ++ "är" ++ jämn ;
Inv => "är" ++ x.s ! Nom ++ jämn ;
Sub => x.s ! Nom ++ "är" ++ jämn
}
}
</pre>
To use library functions for syntax and morphology:
<pre>
Even = predA (regA "jämn") ;
</pre>
<!-- NEW -->
<h2>Questions in grammar library design</h2>
What should there be in the library?
<br>
<li> morphology, lexicon, syntax, semantics,...
<p>
How do we organize and present the library?
<br>
<li> division into modules, level of granularity
<br>
<li> "school grammar" vs. sophisticated linguistic concepts
<p>
Where do we get the data from?
<br>
<li> automatic extraction or hand-writing?
<br>
<li> reuse of existing resources?
<br>
Extra constraint: we want open-source free software and
hence cannot use existing proprietary resources.
<!-- NEW -->
<h2>The scope of a resource grammar library for a language</h2>
All morphological paradigms
<p>
Basic lexicon of structural, common, and irregular words
<p>
Basic syntactic structures
<p>
Currently,<br>
<li> <i>no</i> semantics,<br>
<li> <i>no</i> language-specific structures if not necessary for expressivity.
<!-- NEW -->
<h2>Success criteria</h2>
Grammatical correctness
<p>
Semantic coverage: you can express whatever you want.
<p>
Usability as library for non-linguists.
<p>
(Bonus for linguists:) nice generalizations w.r.t. language
families, using the module system of GF.
<!-- NEW -->
<h2>These are not our success criteria</h2>
Language coverage: to be able to parse all expressions.
<br>
Example:
the French <i>passé simple</i> tense, although covered by the
morphology, is not used in the language-independent API, but
only the <i>passé composé</i> is. However, an application
accessing the French-specific (or Romance-specific)
modules can use the passé simple.
<p>
Semantic correctness: only to produce meaningful expressions.
<br>
Example: the following sentences can be generated
<pre>
colourless green ideas sleep furiously
the time is seventy past forty-two
</pre>
However, an applicatio grammar can use a domain-specific
semantics to guarantee semantic well-formedness.
<p>
(Warning for linguists:) theoretical innovation in
syntax is not among the goals
(and it would be hidden from users anyway!).
<!-- NEW -->
<h2>So where is semantics?</h2>
GF incorporates a <b>Logical Framework</b> and is therefore
capable of expressing logical semantics <i>à la</i> Montague
or any other flavour, including anaphora and discourse.
<p>
But we do <i>not</i> try to give semantics once and
for all for the whole language.
<p>
Instead, we expect semantics to be given in
<b>application grammars</b> built on semantic models
of different domains.
<p>
Example application: number theory
<pre>
fun Even : Nat -> Prop ; -- a mathematical predicate
lin Even = predA (regA "even") ; -- English translation
lin Even = predA (regA "pair") ; -- French translation
lin Even = predA (regA "jämn") ; -- Swedish translation
</pre>
How could the resource predict that just <i>these</i>
translations are correct in this domain?
<p>
Application grammars are built by experts of these domains
who - thanks to resource grammars - do no more need to be
experts in linguistics.
<!-- NEW -->
<h2>Languages</h2>
The current GF Resource Project covers ten languages:
<ul>
<li><tt>Dan</tt>ish
<li><tt>Eng</tt>lish
<li><tt>Fin</tt>nish
<li><tt>Fre</tt>nch
<li><tt>Ger</tt>man
<li><tt>Ita</tt>lian
<li><tt>Nor</tt>wegian
<li><tt>Rus</tt>sian
<li><tt>Spa</tt>nish
<li><tt>Swe</tt>dish
</ul>
The first three letters (<tt>Dan</tt> etc) are used in grammar module names
<!-- NEW -->
<h2>Library structure 1: language-independent API</h2>
<li> syntactic <tt>Categories</tt> (parts of speech, word classes), e.g.
<pre>
V ; NP ; CN ; Det ; -- verb, noun phrase, common noun, determiner
</pre>
<li> <tt>Rules</tt> for combining words and phrases, e.g.
<pre>
DetNP : Det -> CN -> NP ; -- combine Det and CN into NP
</pre>
<li> the most common <tt>Structural</tt> words (determiners,
conjunctions, pronouns), e.g.
<pre>
and_Conj : Conj ;
</pre>
<!-- NEW -->
<h2>Library structure 2: language-dependent modules</h2>
<li> morphological <tt>Paradigms</tt>, e.g.
<pre>
mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns
mkN : Str -> N ; -- regular nouns
</pre>
<li> irregular <tt>Verbs</tt>, e.g.
<pre>
angripa_V = irregV "angripa" "angrep" "angripit" ;
</pre>
<li> <tt>Lexicon</tt> of frequent words
<pre>
man_N = mkN "man" "mannen" "män" "männen" masculine ;
</pre>
<li> <tt>Ext</tt>ended syntax with language-specific rules
<pre>
PassBli : V2 -> NP -> VP ; -- bli överkörd av ngn
</pre>
<!-- NEW -->
<h2>How much can be language-independent?</h2>
For the ten languages we have considered, it <i>is</i> possible
to implement the current API.
<p>
Reservations:
<ul>
<li> does not necessarily extend to all other languages
<li> does not necessarily cover the most idiomatic expressions
of each language
<li> may not be the easiest API to implement (e.g. negation and
inversion with <i>do</i> in English suggest that some other
structure would be more natural)
<li> does not guarantee that same structure has the same semantics
in different languages
<p>
<!-- NEW -->
<h2>Library structure: language-independent API</h2>
<center>
<img src="Resource.gif">
</center>
<!-- NEW -->
<h2>Library structure: test bed for the language-independent API</h2>
<center>
<img src="Lang.gif">
</center>
<!-- NEW -->
<h2>API documentation</h2>
<a href="Categories.html">Categories</a>
<p>
<a href="Rules.html">Rules</a>
<p>
Alternative views on sentence formation:
<a href="Clause.html">Clause</a>,
<a href="Verbphrase.html">Verbphrase</a>
<p>
<a href="Structural.html">Structural</a>
<p>
<a href="Time.html">Time</a>
<p>
<a href="Basic.html">Basic</a>
<p>
<a href="Lang.html">Lang</a>
<p>
<!-- NEW -->
<h2>Paradigms documentation</h2>
<a href="ParadigmsEng.html">English paradigms</a>
<br>
<a href="BasicEng.html">example use of English oaradigms</a>
<br>
<a href="VerbsEng.html">English verbs</a>
<p>
<a href="ParadigmsFre.html">French paradigms</a>
<br>
<a href="BasicFre.html">example use of French paradigms</a>
<br>
<a href="VerbsFre.html">French verbs</a>
<p>
<a href="ParadigmsIta.html">Italian paradigms</a>
<br>
<a href="BasicIta.html">example use of Italian paradigms</a>
<br>
<a href="BeschIta.html">Italian verb conjugations</a>
<p>
<a href="ParadigmsNor.html">Norwegian paradigms</a>
<br>
<a href="BasicNor.html">example use of Norwegian paradigms</a>
<br>
<a href="VerbsNor.html">Norwegian verbs</a>
<p>
<a href="ParadigmsSpa.html">Spanish paradigms</a>
<br>
<a href="BasicSpa.html">example use of Spanish paradigms</a>
<br>
<a href="BeschSpa.html">Spanish verb conjugations</a>
<p>
<a href="ParadigmsSwe.html">Swedish paradigms</a>
<br>
<a href="BasicSwe.html">example use of Swedish paradigms</a>
<br>
<a href="VerbsSwe.html">Swedish verbs</a>
<!-- NEW -->
<h2>Use as top-level grammar: testing</h2>
Import a set of <tt>LangX</tt> grammars:
<pre>
i english/LangEng.gf
i swedish/LangSwe.gf
</pre>
Test with random generation, translation, morphological analysis...
<pre>
</pre>
<!-- NEW -->
<h2>Use as top-level grammar: language learning quizzes</h2>
Morpho quiz with words (e.g. French verbs):
<pre>
i french/VerbsFre.gf
mq -cat=V
</pre>
Morpho quiz with phrases (e.g. Swedish clauses):
<pre>
i swedish/LangSwe.gf
mq -cat=Cl
</pre>
Translation quiz with sentences (e.g. sentences from English to Swedish):
<pre>
i swedish/LangEng.gf
i swedish/LangSwe.gf
tq -cat=S LangEng LangSwe
</pre>
<!-- NEW -->
<h2>Use as library</h2>
Import directly by <tt>open</tt>:
<pre>
concrete AppNor of App = open LangNor, ParadigmsNor in {...}
</pre>
(Note for the users of GF 2.1 and older:
the dummy <tt>reuse</tt> modules and their bulky <tt>.gfr</tt> versions
are no longer needed!)
<p>
If you need to convert resource records to strings, and don't want to know
the concrete type (as you never should), you can use
<pre>
Predef.toStr : (L : Type) -> L -> Str ;
</pre>
<tt>L</tt> must be a linearization type. For instance,
<pre>
toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N))
---> "gammel bil"
</pre>
<!-- NEW -->
<h2>Use as library through parser</h2>
You can use the parser with a <tt>LangX</tt> grammar
when developing a resource.
<p>
Using the <tt>-v</tt> option shows if the parser fails because
of unknown words.
<pre>
> p -cat=S -v "jag ska åka till Chalmers"
unknown tokens [TS "åka",TS "Chalmers"]
</pre>
Then try to select words that <tt>LangX</tt> recognizes:
<pre>
> p -cat=S "jag ska gå till Danmark"
UseCl (PosTP TFuture ASimul)
(AdvCl (SPredV i_NP go_V)
(AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark)))))
</pre>
Use these API structures and extend vocabulary to match your need.
<pre>
åka_V = lexV "åker" ;
Chalmers = regPN "Chalmers" neutrum ;
</pre>
<!-- NEW -->
<h2>Syntax editor as library browser</h2>
You can run the syntax editor on <tt>LangX</tt> to
find resource API functions through context-sensitive menus.
For instance, the shell command
<pre>
jgf LangEng.gf LangFre.gf
</pre>
opens the editor with English and French views. The
<a href="http://www.cs.chalmers.se/%7Eaarne/GF2.0/doc/javaGUImanual/javaGUImanual.htm">
Editor User Manual</a> gives more information on the use of the editor.
<p>
A restriction of the editor is that it does not give access to
<tt>ParadigmsX</tt> modules. An IDE environment extending the editor
to a grammar programming tool is work in progress.
<!-- NEW -->
<h2>Example application: a small translation system</h2>
In this system, you can express questions and answers of
the following forms:
<pre>
Who chases mice ?
Whom does the lion chase ?
The dog chases cats.
</pre>
We build the abstract syntax in two phases:
<ul>
<li> <a href=example/Questions.gf>Questions</a> defines question and
answer forms independently of domain
<li> <a href=example/Animals.gf>Animals</a> defines a lexicon with
animals and things that animals do.
</ul>
<p>
The concrete syntax of English is built in three phases:
<ul>
<li> <a href="example/HandQuestionsI.gf">QuestionsI</a> is a parametrized module
using the API module <tt>Resource</tt>.
<li> <a href="example/QuestionsEng.gf">QuestionsEng</a> is an instantiation
of the API with <tt>ResourceEng</tt>.
<li> <a href="example/AnimalsEng.gf">AnimalsEng</a> is a concrete syntax
of <tt>Animals</tt> using <tt>ParadigmsEng</tt> and <tt>VerbsEng</tt>.
</ul>
<p>
The concrete syntax of Swedish is built upon <tt>QuestionsI</tt>
in a similar way, with the modules
<a href=example/QuestionsSwe.gf>QuestionsSwe</a> and.
<a href=example/AnimalsSwe.gf>AnimalsSwe</a>.
<p>
The concrete syntax of French consists similarly of the modules
<a href=example/QuestionsFre.gf>QuestionsFre</a> and
<a href=example/AnimalsFre.gf>AnimalsFre</a>.
<!-- NEW -->
<h2>Compiling the example application</h2>
The resources are bulky, and it takes a therefore a lot of
time and memory to load the grammars. However, they can be
compiled into the <tt>gfcm</tt>
(<b>GF canonical multilingual</b>) format,
which is almost one thousand times smaller and faster to load
for this set of grammars.
<p>
Just issue the following GF commands
<pre>
i -src AnimalsEng.gf ;; s
i -src AnimalsFre.gf ;; s
i -src AnimalsSwe.gf ;; s
pm | wf animals.gfcm
</pre>
and you get an end-user grammar <tt>animals.gfcm</tt>.
<p>
You can also write the commands in a <tt>gfs</tt> (<b>GF script</b>)
file, say
<a href="example/mkAnimals.gfs"><tt>mkAnimals.gfs</tt></a>,
and then call GF with
<pre>
gf &lt;mkAnimals.gfs
</pre>
<!-- NEW -->
<h2>Grammar writing by examples</h2>
(New in GF 3/6/2005)
<p>
You can use the resource grammar as a parser on a special file format,
<tt>.gfe</tt> ("GF examples"). Here is the new source,
<a href="example/QuestionsI.gfe">QuestionsI.gfe</a>, which
generates
<a href="example/QuestionsI.gf">QuestionsI.gf</a>,
when you execute the command
<pre>
gf -examples QuestionsI.gfe
</pre>
Of course, the grammar of any language can be created by
parsing any language, as long as they have a common resource API.
The use of English resource is generally recommended, because it
is smaller and faster to parse than the other languages.
<!-- NEW -->
<h2>Constants and variables in examples</h2>
The file <a href="example/QuestionsI.gfe">QuestionsI.gfe</a> uses
as resource <tt>LangEng</tt>, which contains all resource syntax and
a lexicon of ca. 300 words. A linearization rule, such as
<pre>
lin Who love_V2 man_N = in Phr "who loves men ?" ;
</pre>
uses as argument variables constants for words that can be found in
the lexicon. It is due to this that the example can be parsed.
When the resulting rule,
<pre>
lin Who love_V2 man_N =
QuestPhrase (UseQCl (PosTP TPresent ASimul)
(QPredV2 who8one_IP love_V2 (IndefNumNP NoNum (UseN man_N)))) ;
</pre>
is read by the GF compiler, the identifiers <tt>love_V2</tt> and
<tt>man_N</tt> are not treated as constants, but, following
the normal binding rules of functional languages, as bound variables.
This is what gives the example method the generality that is needed.
<p>
To write linearization rules by examples one thus has to know at
least one abstract syntax constant for each category for which
one needs a variable.
<!-- NEW -->
<h2>Extending the lexicon on the fly</h2>
The greatest limitation of the example method is that the lexicon
may lack many of the words that are needed in examples. If parsing
fails because of this, the compiler gives a list of unknown words
in its error message. An obvious solution is,
of course, to extend the resource lexicon and try again.
A more light-weight solution is to add a <b>substitution</b> to
the example. For instance, if you want the example "the pope"
but the lexicon does not have the word "pope", you can write
<pre>
lin Pope = in NP "the man" {man_N = regN "pope"} ;
</pre>
The resulting linearization rule is initially
<pre>
lin Pope = DefOneNP (UseN man_N) ;
</pre>
but the substitution changes this to
<pre>
lin Pope = DefOneNP (UseN (regN "pope")) ;
</pre>
In this way, you do not have to extend the resource lexicon, but you
need to open the Paradigms module to compile the resulting term.
<p>
Of course, the substituted expressions may come from another language
than the main language of the example:
<pre>
lin Pope = in NP "the man" {man_N = regN "pape" masculine} ;
</pre>
If many substitutions are needed, semicolons are used as separators:
<pre>
{man_N = regN "pope" ; walk_N = regV "pray"} ;
</pre>
<!-- NEW -->
<h2>Implementation details: the structure of low-level files</h2>
<center>
<img src="Low.gif">
</center>
<!-- NEW -->
<h2>The use of parametrized modules</h2>
In two language families:
<ul>
<li> Romance: French, Italian, Spanish
<li> Scandinavian: Danish, Norwegian, Swedish
</ul>
<center>
<img src="Scand.gif">
</center>
<!-- NEW -->
<h2>Current status</h2>
<table border=1>
<tr><td>Language</td> <td>v0.6</td> <td>v0.9 API</td> <td>Paradigms</td> <td>Basic lex</td> <td>Verbs</td></tr>
<tr><td>Danish</td> <td>-</td> <td>X</td> <td>-</td> <td>-</td> <td>-</tr>
<tr><td>English</td> <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>Finnish</td> <td>X</td> <td>-</td> <td>-</td> <td>-</td> <td>-</tr>
<tr><td>French</td> <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>German</td> <td>X</td> <td>-</td> <td>*</td> <td>-</td> <td>-</tr>
<tr><td>Italian</td> <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>Norwegian</td> <td>-</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>Russian</td> <td>X</td> <td>*</td> <td>*</td> <td>-</td> <td>-</tr>
<tr><td>Spanish</td> <td>-</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
<tr><td>Swedish</td> <td>X</td> <td>X</td> <td>X</td> <td>X</td> <td>X</tr>
</table>
X = implemented (few exceptions may occur)
<br>
* = linguistic material ready for implementation
<br>
- = not implemented
<!-- NEW -->
<h2>Known bugs and limitations</h2>
(<i>The listed limitations are ones that do not follow from the table on
the previous page</i>.)
<p>
Danish
<p>
English:
<p>
Finnish
<p>
French:
no inverted questions;
some verbs in Basic should be reflexive
<p>
German
<p>
Italian:
no omission of unstressed subject pronouns;
some verbs in Basic should be reflexive;
bad forms of reflexive infinitives
<p>
Norwegian:
possessives <i>bilen min</i> not included
<p>
Russian
<p>
Spanish:
no omission of unstressed subject pronouns;
no switch to dative case for human objects;
some verbs in Basic should be reflexive;
bad forms of reflexive infinitives;
spurious parameter for verb auxiliary inherited from Romance
<p>
Swedish:
<!-- NEW -->
<h2>Obtaining it</h2>
Get the grammar package from
<a href="http://sourceforge.net/project/showfiles.php?group_id=132285">
GF Download Page</a>. The current libraries are in
<tt>lib/resource</tt>. Version 0.6 is in
<tt>lib/resource-0.6</tt>.
</body></html>