merging Lexicon with Swadesh

This commit is contained in:
aarne
2006-03-07 18:26:47 +00:00
parent c168b6c489
commit 6c46034c09
35 changed files with 3837 additions and 1947 deletions

View File

@@ -7,7 +7,7 @@
<P ALIGN="center"><CENTER><H1>The GF Resource Grammar Library Version 1.0</H1>
<FONT SIZE="4">
<I>Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;</I><BR>
Last update: Sat Mar 4 14:20:07 2006
Last update: Tue Mar 7 16:01:46 2006
</FONT></CENTER>
<P>
@@ -274,9 +274,7 @@ Rosetta Machine Translation (<A HREF="http://citeseer.ist.psu.edu/181924.html">B
<!-- NEW -->
</P>
<H2>Coverage</H2>
<P>
===Languages====
</P>
<H3>Languages</H3>
<P>
The current GF Resource Project covers ten languages:
</P>
@@ -302,9 +300,7 @@ API 1.0 not yet implemented for Danish and Russian
<P>
<!-- NEW -->
</P>
<P>
===Morphology====
</P>
<H3>Morphology and lexicon</H3>
<P>
Complete inflection engine
</P>
@@ -315,24 +311,20 @@ Complete inflection engine
</UL>
<P>
High-level access via <CODE>ParadigmsX</CODE>; e.g. Swedish:
Basic lexicon
</P>
<UL>
<LI>worst-case functions
<PRE>
mkV : (supa,super,sup,söp,supit,supen : Str) -&gt; V ;
</PRE>
<LI>common patterns
<PRE>
regV : (talar : Str) -&gt; V ;
irregV : (dricka, drack, druckit : Str) -&gt; V ;
</PRE>
<LI>irregular words in <CODE>IrregX</CODE>:
<PRE>
draga_V : V =
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
</PRE>
<LI>100 structural words
<LI>350 content words, mainly for testing
<LI>these include the 207 <A HREF="http://en.wiktionary.org/wiki/Swadesh_List">Swadesh words</A>
</UL>
<P>
It is more important to enable lexicon extensions than to
provide a huge lexicon.
</P>
<UL>
<LI>technical lexica can have very special words, which tend to be regular
</UL>
<P>
@@ -340,7 +332,32 @@ High-level access via <CODE>ParadigmsX</CODE>; e.g. Swedish:
</P>
<H3>Syntactic structures</H3>
<P>
<IMG ALIGN="middle" SRC="Lang.png" BORDER="0" ALT="">
Texts:
sequences of phrases with punctuation
</P>
<P>
Phrases:
declaratives, questions, imperatives, vocatives
</P>
<P>
Tense, mood, and polarity:
present, past, future, conditional ; similtaneous, anterior ; positive, negative
</P>
<P>
Questions:
yes-no, "wh" ; direct, indirect
</P>
<P>
Clauses:
main, relative, embedded (subject, object, adverbial)
</P>
<P>
Verb phrases:
intransitive, transitive, ditransitive, prepositional
</P>
<P>
Noun phrases:
proper names, pronouns, determiners, possessives, cardinals and ordinals
</P>
<P>
<!-- NEW -->
@@ -378,15 +395,125 @@ Lines of source code (4/3/2006):
<P>
<!-- NEW -->
</P>
<H2>Structure</H2>
<H2>Structure of the API</H2>
<H3>Language-independent ground API</H3>
<P>
<IMG ALIGN="middle" SRC="Lang.png" BORDER="0" ALT="">
</P>
<P>
<!-- NEW -->
</P>
<H3>Language-independent ground API</H3>
<H3>The structure of a text sentence</H3>
<PRE>
John walks.
TFullStop : Phr -&gt; Text -&gt; Text
(PhrUtt : PConj -&gt; Utt -&gt; Voc -&gt; Phr
NoPConj
(UttS : S -&gt; Utt
(UseCl : Tense -&gt; Anter -&gt; Pol -&gt; Cl -&gt; S
TPres
ASimul
PPos
(PredVP : NP -&gt; VP -&gt; Cl
(UsePN : PN -&gt; NP
john_PN)
(UseV : V -&gt; VP
walk_V))))
NoVoc)
TEmpty
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H3>Structure in syntax editor</H3>
<P>
<IMG ALIGN="middle" SRC="editor.png" BORDER="0" ALT="">
</P>
<P>
<!-- NEW -->
</P>
<H3>Language-dependent paradigm modules</H3>
<H4>Regular paradigms</H4>
<P>
Every language implements these regular patterns that take
"dictionary forms" as arguments.
</P>
<PRE>
regN : Str -&gt; N
regA : Str -&gt; A
regV : Str -&gt; V
</PRE>
<P>
Their usefulness varies. For instance, they
all are quite good in Finnish and English.
In Swedish, less so:
</P>
<PRE>
regN "val" ---&gt; val, valen, valar, valarna
</PRE>
<P>
Initializing a lexicon with <CODE>regX</CODE>s is
usually a good starting point in grammar development.
</P>
<P>
<!-- NEW -->
</P>
<H4>Regular paradigms</H4>
<P>
In Swedish, giving the gender of <CODE>N</CODE> improves a lot
</P>
<PRE>
regGenN "val" neutrum ---&gt; val, valet, val, valen
</PRE>
<P></P>
<P>
There are also special constructs taking other forms:
</P>
<PRE>
mk2N : (nyckel,nycklar : Str) -&gt; N
mk1N : (bilarna : Str) -&gt; N
irregV : (dricka, drack, druckit : Str) -&gt; V
</PRE>
<P></P>
<P>
Regular verbs are actually implemented the
<A HREF="http://lexin.nada.kth.se/sve-sve.shtml">Lexin</A> way
</P>
<PRE>
regV : (talar : Str) -&gt; N
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H4>Worst-case paradigms</H4>
<P>
To cover all situations, worst-case paradigms are given. E.g. Swedish
</P>
<PRE>
mkN : (apa,apan,apor,aporna : Str) -&gt; N
mkA : (liten, litet, lilla, sma, mindre, minst, minsta : Str) -&gt; A
mkV : (supa,super,sup,söp,supit,supen : Str) -&gt; V
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H4>Irregular words</H4>
<P>
Iregular words in <CODE>IrregX</CODE>, e.g. Swedish:
</P>
<PRE>
draga_V : V =
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
</PRE>
<P>
Goal: eliminate the user's need of worst-case functions.
</P>
<P>
<!-- NEW -->
</P>
@@ -395,6 +522,20 @@ Lines of source code (4/3/2006):
<!-- NEW -->
</P>
<H3>Special-purpose APIs</H3>
<P>
Syntactic structures that are not shared by all languages.
</P>
<P>
Not implemented yet.
</P>
<P>
Candidates:
</P>
<UL>
<LI><CODE>Nor</CODE> post-possessives: <CODE>bilen min</CODE>
<LI><CODE>Fre</CODE> question forms: <CODE>est-ce que tu dors ?</CODE>
</UL>
<P>
<!-- NEW -->
</P>
@@ -402,20 +543,127 @@ Lines of source code (4/3/2006):
<P>
<!-- NEW -->
</P>
<H3>Compiling</H3>
<P>
It is a good idea to compile the library, so that it can be opened faster
</P>
<PRE>
GF/lib/resource-1.0% make
writes GF/lib/alltenses
GF/lib/present
GF/lib/resource-1.0/langs.gfcm
</PRE>
<P>
If you don't intend to change the library, you never need to process the source
files again. Just do some of
</P>
<PRE>
gf -nocf langs.gfcm -- all 8 languages
gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
gf -nocf -path=alltenses:prelude present/LangSwe.gfc -- Swedish only, present tense only
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H3>Parsing</H3>
<P>
The default parser does not work!
</P>
<P>
The MCFG parser works in some languages, after waiting appr. 20 seconds
</P>
<PRE>
p -mcfg -lang=LangEng -cat=S "I would see her"
p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
p -mcfg -lang=LangNor -cat=S "jeg ville se henne"
</PRE>
<P>
Parsing in <CODE>present/</CODE> versions is quicker.
</P>
<P>
<!-- NEW -->
</P>
<H3>Treebank generation</H3>
<P>
Multilingual treebank entry = tree + linearizations
</P>
<P>
Some examples on treebank generation, assuming <CODE>langs.gfcm</CODE>
</P>
<PRE>
gr -cat=S -number=10 -cf | tb -- 10 random S
gt -cat=Phr -depth=4 | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
</PRE>
<P>
Regression testing
</P>
<PRE>
rf ex.xml | tb -c -- read treebank from file and compare to present grammars
</PRE>
<P>
Updating a treebank
</P>
<PRE>
rf old.xml | tb -trees | tb -xml | wf new.xml -- read old from file, write new to file
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H3>Treebank-based parsing</H3>
<P>
Brute-force method that helps if real parsing is more expensive.
</P>
<PRE>
make treebank -- make treebank with all languages
gf -treebank langs.xml -- start GF by reading the treebank
&gt; ut -strings -treebank=LangIta -- show all Ita strings
&gt; ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
&gt; i -nocf langs.gfcm -- read grammar to be able to linearize
&gt; ut -treebank=LangIta "Quello non si romperebbe" | l -multi -- translate to all
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<H3>Morphology</H3>
<P>
Use morphological analyser
</P>
<PRE>
gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
&gt; ma "jag kan inte höra vad du säger"
</PRE>
<P></P>
<P>
Try out a morphology quiz
</P>
<PRE>
&gt; mq -cat=V
</PRE>
<P></P>
<P>
Try out inflection patterns
</P>
<PRE>
gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
&gt; cc regV "lyser"
</PRE>
<P></P>
<P>
<!-- NEW -->
</P>
<P>
@@ -423,6 +671,16 @@ Lines of source code (4/3/2006):
</P>
<H3>Syntax editing</H3>
<P>
We start a demo by
</P>
<PRE>
gfeditor langs.gfcm
</PRE>
<P></P>
<P>
<IMG ALIGN="middle" SRC="editor.png" BORDER="0" ALT="">
</P>
<P>
<!-- NEW -->
</P>
<H3>Efficient parsing via application grammar</H3>
@@ -469,6 +727,6 @@ Lines of source code (4/3/2006):
</P>
<H3>Extend old modules or add a new one?</H3>
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
<!-- html code generated by txt2tags 2.0 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags clt2006.txt -->
</BODY></HTML>

View File

@@ -217,7 +217,7 @@ Rosetta Machine Translation ([Book 1994 http://citeseer.ist.psu.edu/181924.html]
==Coverage==
===Languages====
===Languages===
The current GF Resource Project covers ten languages:
- ``Dan``ish
@@ -240,7 +240,7 @@ API 1.0 not yet implemented for Danish and Russian
#NEW
===Morphology====
===Morphology and lexicon===
Complete inflection engine
- all word classes
@@ -248,24 +248,16 @@ Complete inflection engine
- all inflectional paradigms
High-level access via ``ParadigmsX``; e.g. Swedish:
- worst-case functions
```
mkV : (supa,super,sup,söp,supit,supen : Str) -> V ;
```
- common patterns
```
regV : (talar : Str) -> V ;
irregV : (dricka, drack, druckit : Str) -> V ;
```
- irregular words in ``IrregX``:
```
draga_V : V =
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
```
Basic lexicon
- 100 structural words
- 350 content words, mainly for testing
- these include the 207 [Swadesh words http://en.wiktionary.org/wiki/Swadesh_List]
It is more important to enable lexicon extensions than to
provide a huge lexicon.
- technical lexica can have very special words, which tend to be regular
@@ -274,7 +266,28 @@ High-level access via ``ParadigmsX``; e.g. Swedish:
===Syntactic structures===
[Lang.png]
Texts:
sequences of phrases with punctuation
Phrases:
declaratives, questions, imperatives, vocatives
Tense, mood, and polarity:
present, past, future, conditional ; similtaneous, anterior ; positive, negative
Questions:
yes-no, "wh" ; direct, indirect
Clauses:
main, relative, embedded (subject, object, adverbial)
Verb phrases:
intransitive, transitive, ditransitive, prepositional
Noun phrases:
proper names, pronouns, determiners, possessives, cardinals and ordinals
#NEW
@@ -307,16 +320,117 @@ Lines of source code (4/3/2006):
#NEW
==Structure==
==Structure of the API==
===Language-independent ground API===
[Lang.png]
#NEW
===Language-independent ground API===
===The structure of a text sentence===
```
John walks.
TFullStop : Phr -> Text -> Text
(PhrUtt : PConj -> Utt -> Voc -> Phr
NoPConj
(UttS : S -> Utt
(UseCl : Tense -> Anter -> Pol -> Cl -> S
TPres
ASimul
PPos
(PredVP : NP -> VP -> Cl
(UsePN : PN -> NP
john_PN)
(UseV : V -> VP
walk_V))))
NoVoc)
TEmpty
```
#NEW
===Structure in syntax editor===
[editor.png]
#NEW
===Language-dependent paradigm modules===
====Regular paradigms====
Every language implements these regular patterns that take
"dictionary forms" as arguments.
```
regN : Str -> N
regA : Str -> A
regV : Str -> V
```
Their usefulness varies. For instance, they
all are quite good in Finnish and English.
In Swedish, less so:
```
regN "val" ---> val, valen, valar, valarna
```
Initializing a lexicon with ``regX``s is
usually a good starting point in grammar development.
#NEW
====Regular paradigms====
In Swedish, giving the gender of ``N`` improves a lot
```
regGenN "val" neutrum ---> val, valet, val, valen
```
There are also special constructs taking other forms:
```
mk2N : (nyckel,nycklar : Str) -> N
mk1N : (bilarna : Str) -> N
irregV : (dricka, drack, druckit : Str) -> V
```
Regular verbs are actually implemented the
[Lexin http://lexin.nada.kth.se/sve-sve.shtml] way
```
regV : (talar : Str) -> N
```
#NEW
====Worst-case paradigms====
To cover all situations, worst-case paradigms are given. E.g. Swedish
```
mkN : (apa,apan,apor,aporna : Str) -> N
mkA : (liten, litet, lilla, sma, mindre, minst, minsta : Str) -> A
mkV : (supa,super,sup,söp,supit,supen : Str) -> V
```
#NEW
====Irregular words====
Iregular words in ``IrregX``, e.g. Swedish:
```
draga_V : V =
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
```
Goal: eliminate the user's need of worst-case functions.
#NEW
===Language-dependent syntax extensions===
@@ -325,34 +439,139 @@ Lines of source code (4/3/2006):
===Special-purpose APIs===
Syntactic structures that are not shared by all languages.
Not implemented yet.
Candidates:
- ``Nor`` post-possessives: ``bilen min``
- ``Fre`` question forms: ``est-ce que tu dors ?``
#NEW
===How to use as top-level grammar===
#NEW
===Compiling===
It is a good idea to compile the library, so that it can be opened faster
```
GF/lib/resource-1.0% make
writes GF/lib/alltenses
GF/lib/present
GF/lib/resource-1.0/langs.gfcm
```
If you don't intend to change the library, you never need to process the source
files again. Just do some of
```
gf -nocf langs.gfcm -- all 8 languages
gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
gf -nocf -path=alltenses:prelude present/LangSwe.gfc -- Swedish only, present tense only
```
#NEW
===Parsing===
The default parser does not work!
The MCFG parser works in some languages, after waiting appr. 20 seconds
```
p -mcfg -lang=LangEng -cat=S "I would see her"
p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
p -mcfg -lang=LangNor -cat=S "jeg ville se henne"
```
Parsing in ``present/`` versions is quicker.
#NEW
===Treebank generation===
Multilingual treebank entry = tree + linearizations
Some examples on treebank generation, assuming ``langs.gfcm``
```
gr -cat=S -number=10 -cf | tb -- 10 random S
gt -cat=Phr -depth=4 | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
```
Regression testing
```
rf ex.xml | tb -c -- read treebank from file and compare to present grammars
```
Updating a treebank
```
rf old.xml | tb -trees | tb -xml | wf new.xml -- read old from file, write new to file
```
#NEW
===Treebank-based parsing===
Brute-force method that helps if real parsing is more expensive.
```
make treebank -- make treebank with all languages
gf -treebank langs.xml -- start GF by reading the treebank
> ut -strings -treebank=LangIta -- show all Ita strings
> ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
> i -nocf langs.gfcm -- read grammar to be able to linearize
> ut -treebank=LangIta "Quello non si romperebbe" | l -multi -- translate to all
```
#NEW
===Morphology===
Use morphological analyser
```
gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
> ma "jag kan inte höra vad du säger"
```
Try out a morphology quiz
```
> mq -cat=V
```
Try out inflection patterns
```
gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
> cc regV "lyser"
```
#NEW
#NEW
===Syntax editing===
We start a demo by
``` gfeditor langs.gfcm
[editor.png]
#NEW
===Efficient parsing via application grammar===

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB