mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-10 05:29:30 -06:00
1066 lines
22 KiB
HTML
1066 lines
22 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
|
|
<TITLE>The GF Resource Grammar Library Version 1.0</TITLE>
|
|
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
|
<P ALIGN="center"><CENTER><H1>The GF Resource Grammar Library Version 1.0</H1>
|
|
<FONT SIZE="4">
|
|
<I>Author: Aarne Ranta <aarne (at) cs.chalmers.se></I><BR>
|
|
Last update: Sat Jan 13 17:48:13 2007
|
|
</FONT></CENTER>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Plan</H2>
|
|
<P>
|
|
Purpose
|
|
</P>
|
|
<P>
|
|
Background
|
|
</P>
|
|
<P>
|
|
Coverage
|
|
</P>
|
|
<P>
|
|
Structure
|
|
</P>
|
|
<P>
|
|
How to use
|
|
</P>
|
|
<P>
|
|
How to implement a new language
|
|
</P>
|
|
<P>
|
|
How to extend the API
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Purpose</H2>
|
|
<H3>Library for applications</H3>
|
|
<P>
|
|
High-level access to grammatical rules
|
|
</P>
|
|
<P>
|
|
E.g. <I>You have k new messages</I> rendered in ten languages <I>X</I>
|
|
</P>
|
|
<PRE>
|
|
render X (Have (You (Number (k (New Message)))))
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
Usability for different purposes
|
|
</P>
|
|
<UL>
|
|
<LI>translation systems
|
|
<LI>software localization
|
|
<LI>dialogue systems
|
|
<LI>language teaching
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Not primarily code for a parser</H3>
|
|
<P>
|
|
Often in NLP, a grammar is just high-level code for a parser.
|
|
</P>
|
|
<P>
|
|
But writing a grammar can be inadequate for parsing:
|
|
</P>
|
|
<UL>
|
|
<LI>too much manual work
|
|
<LI>too inefficient
|
|
<LI>not robust
|
|
<LI>too ambiguous
|
|
</UL>
|
|
|
|
<P>
|
|
Moreover, a grammar fine-tuned for parsing may not be reusable
|
|
</P>
|
|
<UL>
|
|
<LI>for generation
|
|
<LI>for specialized grammars
|
|
<LI>as library
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Grammar as language definition</H3>
|
|
<P>
|
|
Linguistic ontology: <B>abstract syntax</B>
|
|
</P>
|
|
<P>
|
|
E.g. adjectival modification rule
|
|
</P>
|
|
<PRE>
|
|
AdjCN : AP -> CN -> CN ;
|
|
</PRE>
|
|
<P>
|
|
Rendering in different languages: <B>concrete syntax</B>
|
|
</P>
|
|
<PRE>
|
|
AdjCN (PositA even_A) (UseN number_N)
|
|
|
|
even number, even numbers
|
|
|
|
jämnt tal, jämna tal
|
|
|
|
nombre pair, nombres pairs
|
|
</PRE>
|
|
<P>
|
|
Abstract away from inflection, agreement, word order.
|
|
</P>
|
|
<P>
|
|
Resource grammars have generation perspective, rather than parsing
|
|
</P>
|
|
<UL>
|
|
<LI>abstract syntax serves as a key to renderings in different languages
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Usability by non-linguists</H3>
|
|
<P>
|
|
Division of labour: resource grammars hide linguistic details
|
|
</P>
|
|
<UL>
|
|
<LI><CODE>AdjCN : AP -> CN -> CN</CODE> hides agreement, word order,...
|
|
</UL>
|
|
|
|
<P>
|
|
Presentation: "school grammar" concepts, dictionary-like conventions
|
|
</P>
|
|
<PRE>
|
|
bird_N = reg2N "Vogel" "Vögel" masculine
|
|
</PRE>
|
|
<P>
|
|
API = Application Programmer's Interface
|
|
</P>
|
|
<P>
|
|
Documentation: <CODE>gfdoc</CODE>
|
|
</P>
|
|
<UL>
|
|
<LI>produces html from gf
|
|
</UL>
|
|
|
|
<P>
|
|
IDE = Interactive Development Environment (forthcoming)
|
|
</P>
|
|
<UL>
|
|
<LI>library browser and syntax editor for grammar writing
|
|
</UL>
|
|
|
|
<P>
|
|
Example-based grammar writing
|
|
</P>
|
|
<PRE>
|
|
render Ita (parse Eng "you have k messages")
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Scientific interest</H3>
|
|
<P>
|
|
Linguistics
|
|
</P>
|
|
<UL>
|
|
<LI>definition of linguistic ontology
|
|
<LI>describing language on this level of abstraction
|
|
<LI>coping with different problems in different languages
|
|
<LI>sharing concrete-syntax code between languages
|
|
<LI>creating a resource for other NLP applications
|
|
</UL>
|
|
|
|
<P>
|
|
Computer science
|
|
</P>
|
|
<UL>
|
|
<LI>datastructures for grammar rules
|
|
<LI>type systems for grammars
|
|
<LI>algorithms: parsing, generation, grammar compilation
|
|
<LI>domain-specific programming language (GF)
|
|
<LI>module system
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Background</H2>
|
|
<H3>History</H3>
|
|
<P>
|
|
2002: v. 0.2
|
|
</P>
|
|
<UL>
|
|
<LI>English, French, German, Swedish
|
|
</UL>
|
|
|
|
<P>
|
|
2003: v. 0.6
|
|
</P>
|
|
<UL>
|
|
<LI>module system
|
|
<LI>added Finnish, Italian, Russian
|
|
<LI>used in KeY
|
|
</UL>
|
|
|
|
<P>
|
|
2005: v. 0.9
|
|
</P>
|
|
<UL>
|
|
<LI>tenses
|
|
<LI>added Danish, Norwegian, Spanish; no German
|
|
<LI>used in WebALT
|
|
</UL>
|
|
|
|
<P>
|
|
2006: v. 1.0
|
|
</P>
|
|
<UL>
|
|
<LI>approximate CLE coverage
|
|
<LI>reorganized module system and implementation
|
|
<LI>not yet (4/3/2006) for Danish and Russian
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Authors</H3>
|
|
<P>
|
|
Janna Khegai (Russian modules, forthcoming),
|
|
Bjorn Bringert (many Swadesh lexica),
|
|
Inger Andersson and Therese Söderberg (Spanish morphology),
|
|
Ludmilla Bogavac (Russian morphology),
|
|
Carlos Gonzalia (Spanish cardinals),
|
|
Harald Hammarström (German morphology),
|
|
Partik Jansson (Swedish cardinals),
|
|
Aarne Ranta.
|
|
</P>
|
|
<P>
|
|
We are grateful for contributions and
|
|
comments to several other people who have used this and
|
|
the previous versions of the resource library, including
|
|
Ana Bove,
|
|
David Burke,
|
|
Lauri Carlson,
|
|
Gloria Casanellas,
|
|
Karin Cavallin,
|
|
Hans-Joachim Daniels,
|
|
Kristofer Johannisson,
|
|
Anni Laine,
|
|
Wanjiku Ng'ang'a,
|
|
Jordi Saludes.
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Related work</H3>
|
|
<P>
|
|
CLE (Core Language Engine,
|
|
<A HREF="http://mitpress.mit.edu/catalog/item/default.asp?tid=7739&ttype=2">Book 1992</A>)
|
|
</P>
|
|
<UL>
|
|
<LI>English, Swedish, French, Danish
|
|
<LI>uses Definita Clause Grammars, implementation in Prolog
|
|
<LI>coverage for the ATIS corpus,
|
|
<A HREF="http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521770777">Spoken Language Translator (2001)</A>
|
|
<LI>grammar specialization via explanation-based learning
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Slightly less related work</H3>
|
|
<P>
|
|
<A HREF="http://www.delph-in.net/matrix/">LinGO Grammar Matrix</A>
|
|
</P>
|
|
<UL>
|
|
<LI>English, German, Japanese, Spanish, ...
|
|
<LI>uses HPSG, implementation in LKB
|
|
<LI>a check list for parallel grammar implementations
|
|
</UL>
|
|
|
|
<P>
|
|
<A HREF="http://www2.parc.com/istl/groups/nltt/pargram/">Pargram</A>
|
|
</P>
|
|
<UL>
|
|
<LI>Aimed: Arabic, Chinese, English, French, German, Hungarian, Japanese,
|
|
Malagasy, Norwegian, Turkish, Urdu, Vietnamese, and Welsh
|
|
<LI>uses LFG
|
|
<LI>one set of big grammars, transfer rules
|
|
</UL>
|
|
|
|
<P>
|
|
Rosetta Machine Translation (<A HREF="http://citeseer.ist.psu.edu/181924.html">Book 1994</A>)
|
|
</P>
|
|
<UL>
|
|
<LI>Dutch, English, French
|
|
<LI>uses M-grammars, compositional translation inspired by Montague
|
|
<LI>compositional transfer rules
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Coverage</H2>
|
|
<H3>Languages</H3>
|
|
<P>
|
|
The current GF Resource Project covers ten languages:
|
|
</P>
|
|
<UL>
|
|
<LI><CODE>Dan</CODE>ish
|
|
<LI><CODE>Eng</CODE>lish
|
|
<LI><CODE>Fin</CODE>nish
|
|
<LI><CODE>Fre</CODE>nch
|
|
<LI><CODE>Ger</CODE>man
|
|
<LI><CODE>Ita</CODE>lian
|
|
<LI><CODE>Nor</CODE>wegian (bokmål)
|
|
<LI><CODE>Rus</CODE>sian
|
|
<LI><CODE>Spa</CODE>nish
|
|
<LI><CODE>Swe</CODE>dish
|
|
</UL>
|
|
|
|
<P>
|
|
In addition, parts of Arabic, Estonian, Latin, and Urdu
|
|
</P>
|
|
<P>
|
|
API 1.0 not yet implemented for Danish and Russian
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Morphology and lexicon</H3>
|
|
<P>
|
|
Complete inflection engine
|
|
</P>
|
|
<UL>
|
|
<LI>all word classes
|
|
<LI>all forms
|
|
<LI>all inflectional paradigms
|
|
</UL>
|
|
|
|
<P>
|
|
Basic lexicon
|
|
</P>
|
|
<UL>
|
|
<LI>100 structural words
|
|
<LI>340 content words, mainly for testing
|
|
<LI>these include the 207 <A HREF="http://en.wiktionary.org/wiki/Swadesh_List">Swadesh words</A>
|
|
</UL>
|
|
|
|
<P>
|
|
It is more important to enable lexicon extensions than to
|
|
provide a huge lexicon.
|
|
</P>
|
|
<UL>
|
|
<LI>technical lexica can have very special words, which tend to be regular
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Syntactic structures</H3>
|
|
<P>
|
|
Texts:
|
|
sequences of phrases with punctuation
|
|
</P>
|
|
<P>
|
|
Phrases:
|
|
declaratives, questions, imperatives, vocatives
|
|
</P>
|
|
<P>
|
|
Tense, mood, and polarity:
|
|
present, past, future, conditional ; simultaneous, anterior ; positive, negative
|
|
</P>
|
|
<P>
|
|
Questions:
|
|
yes-no, "wh" ; direct, indirect
|
|
</P>
|
|
<P>
|
|
Clauses:
|
|
main, relative, embedded (subject, object, adverbial)
|
|
</P>
|
|
<P>
|
|
Verb phrases:
|
|
intransitive, transitive, ditransitive, prepositional
|
|
</P>
|
|
<P>
|
|
Noun phrases:
|
|
proper names, pronouns, determiners, possessives, cardinals and ordinals
|
|
</P>
|
|
<P>
|
|
Coordination:
|
|
lists of sentences, noun phrases, adverbs, adjectival phrases
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Quantitative measures</H3>
|
|
<P>
|
|
67 categories
|
|
</P>
|
|
<P>
|
|
150 abstract syntax combination rules
|
|
</P>
|
|
<P>
|
|
100 structural words
|
|
</P>
|
|
<P>
|
|
340 content words in a test lexicon
|
|
</P>
|
|
<P>
|
|
35 kLines of source code (4/3/2006):
|
|
</P>
|
|
<PRE>
|
|
abstract 1131
|
|
english 2344
|
|
german 2386
|
|
finnish 3396
|
|
norwegian 1257
|
|
swedish 1465
|
|
scandinavian 1023
|
|
french 3246 -- Besch + Irreg + Morpho 2111
|
|
italian 7797 -- Besch 6512
|
|
spanish 7120 -- Besch 5877
|
|
romance 1066
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Structure of the API</H2>
|
|
<H3>Language-independent ground API</H3>
|
|
<P>
|
|
<IMG ALIGN="middle" SRC="Lang.png" BORDER="0" ALT="">
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The structure of a text sentence</H3>
|
|
<PRE>
|
|
John walks.
|
|
|
|
TFullStop : Phr -> Text -> Text | TQuestMark, TExclMark
|
|
(PhrUtt : PConj -> Utt -> Voc -> Phr | PhrYes, PhrNo, ...
|
|
NoPConj | but_PConj, ...
|
|
(UttS : S -> Utt | UttQS, UttImp, UttNP, ...
|
|
(UseCl : Tense -> Anter -> Pol -> Cl -> S
|
|
TPres
|
|
ASimul
|
|
PPos
|
|
(PredVP : NP -> VP -> Cl | ImpersNP, ExistNP, ...
|
|
(UsePN : PN -> NP
|
|
john_PN)
|
|
(UseV : V -> VP | ComplV2, UseComp, ...
|
|
walk_V))))
|
|
NoVoc) | VocNP, please_Voc, ...
|
|
TEmpty
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The structure in the syntax editor</H3>
|
|
<P>
|
|
<IMG ALIGN="middle" SRC="editor.png" BORDER="0" ALT="">
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Language-dependent paradigm modules</H3>
|
|
<H4>Regular paradigms</H4>
|
|
<P>
|
|
Every language implements these regular patterns that take
|
|
"dictionary forms" as arguments.
|
|
</P>
|
|
<PRE>
|
|
regN : Str -> N
|
|
regA : Str -> A
|
|
regV : Str -> V
|
|
</PRE>
|
|
<P>
|
|
Their usefulness varies. For instance, they
|
|
all are quite good in Finnish and English.
|
|
In Swedish, less so:
|
|
</P>
|
|
<PRE>
|
|
regN "val" ---> val, valen, valar, valarna
|
|
</PRE>
|
|
<P>
|
|
Initializing a lexicon with <CODE>regX</CODE> for every entry is
|
|
usually a good starting point in grammar development.
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H4>Regular paradigms</H4>
|
|
<P>
|
|
In Swedish, giving the gender of <CODE>N</CODE> improves a lot
|
|
</P>
|
|
<PRE>
|
|
regGenN "val" neutrum ---> val, valet, val, valen
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
There are also special constructs taking other forms:
|
|
</P>
|
|
<PRE>
|
|
mk2N : (nyckel,nycklar : Str) -> N
|
|
|
|
mk1N : (bilarna : Str) -> N
|
|
|
|
irregV : (dricka, drack, druckit : Str) -> V
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
Regular verbs are actually implemented the
|
|
<A HREF="http://lexin.nada.kth.se/sve-sve.shtml">Lexin</A> way
|
|
</P>
|
|
<PRE>
|
|
regV : (talar : Str) -> V
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H4>Worst-case paradigms</H4>
|
|
<P>
|
|
To cover all situations, worst-case paradigms are given. E.g. Swedish
|
|
</P>
|
|
<PRE>
|
|
mkN : (apa,apan,apor,aporna : Str) -> N
|
|
|
|
mkA : (liten, litet, lilla, små, mindre, minst, minsta : Str) -> A
|
|
|
|
mkV : (supa,super,sup,söp,supit,supen : Str) -> V
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H4>Irregular words</H4>
|
|
<P>
|
|
Iregular words in <CODE>IrregX</CODE>, e.g. Swedish:
|
|
</P>
|
|
<PRE>
|
|
draga_V : V =
|
|
mkV
|
|
(variants { "dra" ; "draga"})
|
|
(variants { "drar" ; "drager"})
|
|
(variants { "dra" ; "drag" })
|
|
"drog"
|
|
"dragit"
|
|
"dragen" ;
|
|
</PRE>
|
|
<P>
|
|
Goal: eliminate the user's need of worst-case functions.
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Language-dependent syntax extensions</H3>
|
|
<P>
|
|
Syntactic structures that are not shared by all languages.
|
|
</P>
|
|
<P>
|
|
Alternative (and often more idiomatic) ways to say what is already covered by the API.
|
|
</P>
|
|
<P>
|
|
Not implemented yet.
|
|
</P>
|
|
<P>
|
|
Candidates:
|
|
</P>
|
|
<UL>
|
|
<LI>Norwegian post-possessives: <CODE>bilen min</CODE>
|
|
<LI>French question forms: <CODE>est-ce que tu dors ?</CODE>
|
|
<LI>Romance simple past tenses
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Special-purpose APIs</H3>
|
|
<P>
|
|
Mathematical
|
|
</P>
|
|
<P>
|
|
Multimodal
|
|
</P>
|
|
<P>
|
|
Present
|
|
</P>
|
|
<P>
|
|
Minimal
|
|
</P>
|
|
<P>
|
|
Shallow
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>How to use the resource as top-level grammar</H3>
|
|
<H3>Compiling</H3>
|
|
<P>
|
|
It is a good idea to compile the library, so that it can be opened faster
|
|
</P>
|
|
<PRE>
|
|
GF/lib/resource-1.0% make
|
|
|
|
writes GF/lib/alltenses
|
|
GF/lib/present
|
|
GF/lib/resource-1.0/langs.gfcm
|
|
</PRE>
|
|
<P>
|
|
If you don't intend to change the library, you never need to process the source
|
|
files again. Just do some of
|
|
</P>
|
|
<PRE>
|
|
gf -nocf langs.gfcm -- all 8 languages
|
|
|
|
gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
|
|
|
|
gf -nocf -path=present:prelude present/LangSwe.gfc -- Swedish in present tense only
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Parsing</H3>
|
|
<P>
|
|
The default parser does not work! (It is obsolete anyway.)
|
|
</P>
|
|
<P>
|
|
The MCFG parser (the new standard) works in theory, but can
|
|
in practice be too slow to build.
|
|
</P>
|
|
<P>
|
|
But it does work in some languages, after waiting appr. 20 seconds
|
|
</P>
|
|
<PRE>
|
|
p -mcfg -lang=LangEng -cat=S "I would see her"
|
|
|
|
p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
|
|
</PRE>
|
|
<P>
|
|
Parsing in <CODE>present/</CODE> versions is quicker.
|
|
</P>
|
|
<P>
|
|
Remedies:
|
|
</P>
|
|
<UL>
|
|
<LI>write application grammars for parsing
|
|
<LI>use treebank lookup instead
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Treebank generation</H3>
|
|
<P>
|
|
Multilingual treebank entry = tree + linearizations
|
|
</P>
|
|
<P>
|
|
Some examples on treebank generation, assuming <CODE>langs.gfcm</CODE>
|
|
</P>
|
|
<PRE>
|
|
gr -cat=S -number=10 -cf | tb -- 10 random S
|
|
|
|
gt -cat=Phr -depth=4 | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
|
|
</PRE>
|
|
<P>
|
|
Regression testing
|
|
</P>
|
|
<PRE>
|
|
rf ex.xml | tb -c -- read treebank from file and compare to present grammars
|
|
</PRE>
|
|
<P>
|
|
Updating a treebank
|
|
</P>
|
|
<PRE>
|
|
rf old.xml | tb -trees | tb -xml | wf new.xml -- read old from file, write new to file
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The multilingual treebank format</H3>
|
|
<P>
|
|
Tree + linearizations
|
|
</P>
|
|
<PRE>
|
|
> gr -cat=Cl | tb
|
|
PredVP (UsePron they_Pron) (PassV2 seek_V2)
|
|
They are sought
|
|
Elles sont cherchées
|
|
Son buscadas
|
|
Vengono cercate
|
|
De blir sökta
|
|
De blir lette
|
|
Sie werden gesucht
|
|
Heidät etsitään
|
|
</PRE>
|
|
<P>
|
|
These can also be wrapped in XML tags (<CODE>tb -xml</CODE>)
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Treebank-based parsing</H3>
|
|
<P>
|
|
Brute-force method that helps if real parsing is more expensive.
|
|
</P>
|
|
<PRE>
|
|
make treebank -- make treebank with all languages
|
|
|
|
gf -treebank langs.xml -- start GF by reading the treebank
|
|
|
|
> ut -strings -treebank=LangIta -- show all Ita strings
|
|
|
|
> ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
|
|
|
|
> i -nocf langs.gfcm -- read grammar to be able to linearize
|
|
|
|
> ut -treebank=LangIta "Quello non si romperebbe" | l -multi -- translate to all
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Morphology</H3>
|
|
<P>
|
|
Use morphological analyser
|
|
</P>
|
|
<PRE>
|
|
gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
|
|
> ma "jag kan inte höra vad du säger"
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
Try out a morphology quiz
|
|
</P>
|
|
<PRE>
|
|
> mq -cat=V
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
Try out inflection patterns
|
|
</P>
|
|
<PRE>
|
|
gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
|
|
> cc regV "lyser"
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Syntax editing</H3>
|
|
<P>
|
|
The simplest way to start editing with all grammars is
|
|
</P>
|
|
<PRE>
|
|
gfeditor langs.gfcm
|
|
</PRE>
|
|
<P>
|
|
The forthcoming IDE will extend the syntax editor with
|
|
a <CODE>Paradigms</CODE> file browser and a control on what
|
|
parts of an application grammar remain to be implemented.
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Efficient parsing via application grammar</H3>
|
|
<P>
|
|
Get rid of discontinuous constituents (in particular, <CODE>VP</CODE>)
|
|
</P>
|
|
<P>
|
|
Example: <A HREF="gfdoc/Predication.html"><CODE>mathematical/Predication</CODE></A>:
|
|
</P>
|
|
<PRE>
|
|
predV2 : V2 -> NP -> NP -> Cl
|
|
</PRE>
|
|
<P>
|
|
instead of <CODE>PredVP np (ComplV2 v2 np')</CODE>
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>How to use as library</H2>
|
|
<H3>Specialization through parametrized modules</H3>
|
|
<P>
|
|
The application grammar is implemented with reference to
|
|
the resource API
|
|
</P>
|
|
<P>
|
|
Individual languages are instantiations
|
|
</P>
|
|
<P>
|
|
Example: <A HREF="../../../examples/tram/TramI.gf">tram</A>
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Compile-time transfer</H3>
|
|
<P>
|
|
Instead of parametrized modules:
|
|
</P>
|
|
<P>
|
|
select resource functions differently for different languages
|
|
</P>
|
|
<P>
|
|
Example: imperative vs. infinitive in mathematical exercises
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>A natural division into modules</H3>
|
|
<P>
|
|
Lexicon in language-dependent moduls
|
|
</P>
|
|
<P>
|
|
Combination rules in a parametrized module
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>Example-based grammar writing</H3>
|
|
<P>
|
|
Example: <A HREF="../../../examples/animal/QuestionsI.gfe">animal</A>
|
|
</P>
|
|
<PRE>
|
|
--# -resource=present/LangEng.gf
|
|
--# -path=.:present:prelude
|
|
|
|
-- to compile: gf -examples QuestionsI.gfe
|
|
|
|
incomplete concrete QuestionsI of Questions = open Lang in {
|
|
lincat
|
|
Phrase = Phr ;
|
|
Entity = N ;
|
|
Action = V2 ;
|
|
lin
|
|
Who love_V2 man_N = in Phr "who loves men" ;
|
|
Whom man_N love_V2 = in Phr "whom does the man love" ;
|
|
Answer woman_N love_V2 man_N = in Phr "the woman loves men" ;
|
|
}
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>How to implement a new language</H2>
|
|
<P>
|
|
See <A HREF="Resource-HOWTO.html">Resource-HOWTO</A>
|
|
</P>
|
|
<H2>Ordinary modules</H2>
|
|
<P>
|
|
Write a concrete syntax module for each abstract module in the API
|
|
</P>
|
|
<P>
|
|
Write a <CODE>Paradigms</CODE> module
|
|
</P>
|
|
<P>
|
|
Examples: English, Finnish, German, Russian
|
|
</P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>Parametrized modules</H2>
|
|
<P>
|
|
Examples: Romance (French, Italian, Spanish), Scandinavian (Danish, Norwegian, Swedish)
|
|
</P>
|
|
<P>
|
|
Write a <CODE>Diff</CODE> interface for a family of languages
|
|
</P>
|
|
<P>
|
|
Write concrete syntaxes as functors opening the interface
|
|
</P>
|
|
<P>
|
|
Write separate <CODE>Paradigms</CODE> modules for each language
|
|
</P>
|
|
<P>
|
|
Advantages:
|
|
</P>
|
|
<UL>
|
|
<LI>easier maintenance of library
|
|
<LI>insights into language families
|
|
</UL>
|
|
|
|
<P>
|
|
Problems:
|
|
</P>
|
|
<UL>
|
|
<LI>more abstract thinking required
|
|
<LI>individual grammars may not come out optimal in elegance and efficiency
|
|
</UL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The core API</H3>
|
|
<P>
|
|
Everything else is variations of this
|
|
</P>
|
|
<PRE>
|
|
cat
|
|
Cl ; -- clause
|
|
VP ; -- verb phrase
|
|
V2 ; -- two-place verb
|
|
NP ; -- noun phrase
|
|
CN ; -- common noun
|
|
Det ; -- determiner
|
|
AP ; -- adjectival phrase
|
|
|
|
fun
|
|
PredVP : NP -> VP -> Cl ; -- predication
|
|
ComplV2 : V2 -> NP -> VP ; -- complementization
|
|
DetCN : Det -> CN -> NP ; -- determination
|
|
ModCN : AP -> CN -> CN ; -- modification
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The core API in Latin: parameters</H3>
|
|
<P>
|
|
This <A HREF="latin.gf">toy Latin grammar</A> shows in a nutshell how the core
|
|
can be implemented.
|
|
</P>
|
|
<PRE>
|
|
param
|
|
Number = Sg | Pl ;
|
|
Person = P1 | P2 | P3 ;
|
|
Tense = Pres | Past ;
|
|
Polarity = Pos | Neg ;
|
|
Case = Nom | Acc | Dat ;
|
|
Gender = Masc | Fem | Neutr ;
|
|
oper
|
|
Agr = {g : Gender ; n : Number ; p : Person} ; -- agreement features
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The core API in Latin: linearization types</H3>
|
|
<PRE>
|
|
lincat
|
|
Cl = {
|
|
s : Tense => Polarity => Str
|
|
} ;
|
|
VP = {
|
|
verb : Tense => Polarity => Agr => Str ; -- finite verb
|
|
neg : Polarity => Str ; -- negation
|
|
compl : Agr => Str -- complement
|
|
} ;
|
|
V2 = {
|
|
s : Tense => Number => Person => Str ;
|
|
c : Case -- complement case
|
|
} ;
|
|
NP = {
|
|
s : Case => Str ;
|
|
a : Agr -- agreement features
|
|
} ;
|
|
CN = {
|
|
s : Number => Case => Str ;
|
|
g : Gender
|
|
} ;
|
|
Det = {
|
|
s : Gender => Case => Str ;
|
|
n : Number
|
|
} ;
|
|
AP = {
|
|
s : Gender => Number => Case => Str
|
|
} ;
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The core API in Latin: predication and complementization</H3>
|
|
<PRE>
|
|
lin
|
|
PredVP np vp = {
|
|
s = \\t,p =>
|
|
let
|
|
agr = np.a ;
|
|
subject = np.s ! Nom ;
|
|
object = vp.compl ! agr ;
|
|
verb = vp.neg ! p ++ vp.verb ! t ! p ! agr
|
|
in
|
|
subject ++ object ++ verb
|
|
} ;
|
|
|
|
ComplV2 v np = {
|
|
verb = \\t,p,a => v.s ! t ! a.n ! a.p ;
|
|
compl = \\_ => np.s ! v.c ;
|
|
neg = table {Pos => [] ; Neg => "non"}
|
|
} ;
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>The core API in Latin: determination and modification</H3>
|
|
<PRE>
|
|
DetCN det cn =
|
|
let
|
|
g = cn.g ;
|
|
n = det.n
|
|
in {
|
|
s = \\c => det.s ! g ! c ++ cn.s ! n ! c ;
|
|
a = {g = g ; n = n ; p = P3}
|
|
} ;
|
|
|
|
ModCN ap cn =
|
|
let
|
|
g = cn.g
|
|
in {
|
|
s = \\n,c => cn.s ! n ! c ++ ap.s ! g ! n ! c ;
|
|
g = g
|
|
} ;
|
|
</PRE>
|
|
<P></P>
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H3>How to proceed</H3>
|
|
<OL>
|
|
<LI>put up a directory with dummy modules by copying from e.g. English and
|
|
commenting out the contents
|
|
<P></P>
|
|
<LI>so you will have a compilable <CODE>LangX</CODE> all the time
|
|
<P></P>
|
|
<LI>start with nouns and their inflection
|
|
<P></P>
|
|
<LI>proceed to verbs and their inflection
|
|
<P></P>
|
|
<LI>add some noun phrases
|
|
<P></P>
|
|
<LI>implement predication
|
|
</OL>
|
|
|
|
<P>
|
|
<!-- NEW -->
|
|
</P>
|
|
<H2>How to extend the API</H2>
|
|
<P>
|
|
Extend old modules or add a new one?
|
|
</P>
|
|
<P>
|
|
Usually better to start a new one: then you don't have to implement it
|
|
for all languages at once.
|
|
</P>
|
|
<P>
|
|
Exception: if you are working with a language-specific API extension,
|
|
you can work directly in that module.
|
|
</P>
|
|
|
|
<!-- html code generated by txt2tags 2.0 (http://txt2tags.sf.net) -->
|
|
<!-- cmdline: txt2tags clt2006.txt -->
|
|
</BODY></HTML>
|