mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-27 21:42:50 -06:00
merging Lexicon with Swadesh
This commit is contained in:
@@ -7,7 +7,7 @@
|
||||
<P ALIGN="center"><CENTER><H1>The GF Resource Grammar Library Version 1.0</H1>
|
||||
<FONT SIZE="4">
|
||||
<I>Author: Aarne Ranta <aarne (at) cs.chalmers.se></I><BR>
|
||||
Last update: Sat Mar 4 14:20:07 2006
|
||||
Last update: Tue Mar 7 16:01:46 2006
|
||||
</FONT></CENTER>
|
||||
|
||||
<P>
|
||||
@@ -274,9 +274,7 @@ Rosetta Machine Translation (<A HREF="http://citeseer.ist.psu.edu/181924.html">B
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H2>Coverage</H2>
|
||||
<P>
|
||||
===Languages====
|
||||
</P>
|
||||
<H3>Languages</H3>
|
||||
<P>
|
||||
The current GF Resource Project covers ten languages:
|
||||
</P>
|
||||
@@ -302,9 +300,7 @@ API 1.0 not yet implemented for Danish and Russian
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<P>
|
||||
===Morphology====
|
||||
</P>
|
||||
<H3>Morphology and lexicon</H3>
|
||||
<P>
|
||||
Complete inflection engine
|
||||
</P>
|
||||
@@ -315,24 +311,20 @@ Complete inflection engine
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
High-level access via <CODE>ParadigmsX</CODE>; e.g. Swedish:
|
||||
Basic lexicon
|
||||
</P>
|
||||
<UL>
|
||||
<LI>worst-case functions
|
||||
<PRE>
|
||||
mkV : (supa,super,sup,söp,supit,supen : Str) -> V ;
|
||||
</PRE>
|
||||
<LI>common patterns
|
||||
<PRE>
|
||||
regV : (talar : Str) -> V ;
|
||||
irregV : (dricka, drack, druckit : Str) -> V ;
|
||||
</PRE>
|
||||
<LI>irregular words in <CODE>IrregX</CODE>:
|
||||
<PRE>
|
||||
draga_V : V =
|
||||
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
|
||||
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
|
||||
</PRE>
|
||||
<LI>100 structural words
|
||||
<LI>350 content words, mainly for testing
|
||||
<LI>these include the 207 <A HREF="http://en.wiktionary.org/wiki/Swadesh_List">Swadesh words</A>
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
It is more important to enable lexicon extensions than to
|
||||
provide a huge lexicon.
|
||||
</P>
|
||||
<UL>
|
||||
<LI>technical lexica can have very special words, which tend to be regular
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
@@ -340,7 +332,32 @@ High-level access via <CODE>ParadigmsX</CODE>; e.g. Swedish:
|
||||
</P>
|
||||
<H3>Syntactic structures</H3>
|
||||
<P>
|
||||
<IMG ALIGN="middle" SRC="Lang.png" BORDER="0" ALT="">
|
||||
Texts:
|
||||
sequences of phrases with punctuation
|
||||
</P>
|
||||
<P>
|
||||
Phrases:
|
||||
declaratives, questions, imperatives, vocatives
|
||||
</P>
|
||||
<P>
|
||||
Tense, mood, and polarity:
|
||||
present, past, future, conditional ; similtaneous, anterior ; positive, negative
|
||||
</P>
|
||||
<P>
|
||||
Questions:
|
||||
yes-no, "wh" ; direct, indirect
|
||||
</P>
|
||||
<P>
|
||||
Clauses:
|
||||
main, relative, embedded (subject, object, adverbial)
|
||||
</P>
|
||||
<P>
|
||||
Verb phrases:
|
||||
intransitive, transitive, ditransitive, prepositional
|
||||
</P>
|
||||
<P>
|
||||
Noun phrases:
|
||||
proper names, pronouns, determiners, possessives, cardinals and ordinals
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
@@ -378,15 +395,125 @@ Lines of source code (4/3/2006):
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H2>Structure</H2>
|
||||
<H2>Structure of the API</H2>
|
||||
<H3>Language-independent ground API</H3>
|
||||
<P>
|
||||
<IMG ALIGN="middle" SRC="Lang.png" BORDER="0" ALT="">
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Language-independent ground API</H3>
|
||||
<H3>The structure of a text sentence</H3>
|
||||
<PRE>
|
||||
John walks.
|
||||
|
||||
TFullStop : Phr -> Text -> Text
|
||||
(PhrUtt : PConj -> Utt -> Voc -> Phr
|
||||
NoPConj
|
||||
(UttS : S -> Utt
|
||||
(UseCl : Tense -> Anter -> Pol -> Cl -> S
|
||||
TPres
|
||||
ASimul
|
||||
PPos
|
||||
(PredVP : NP -> VP -> Cl
|
||||
(UsePN : PN -> NP
|
||||
john_PN)
|
||||
(UseV : V -> VP
|
||||
walk_V))))
|
||||
NoVoc)
|
||||
TEmpty
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Structure in syntax editor</H3>
|
||||
<P>
|
||||
<IMG ALIGN="middle" SRC="editor.png" BORDER="0" ALT="">
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Language-dependent paradigm modules</H3>
|
||||
<H4>Regular paradigms</H4>
|
||||
<P>
|
||||
Every language implements these regular patterns that take
|
||||
"dictionary forms" as arguments.
|
||||
</P>
|
||||
<PRE>
|
||||
regN : Str -> N
|
||||
regA : Str -> A
|
||||
regV : Str -> V
|
||||
</PRE>
|
||||
<P>
|
||||
Their usefulness varies. For instance, they
|
||||
all are quite good in Finnish and English.
|
||||
In Swedish, less so:
|
||||
</P>
|
||||
<PRE>
|
||||
regN "val" ---> val, valen, valar, valarna
|
||||
</PRE>
|
||||
<P>
|
||||
Initializing a lexicon with <CODE>regX</CODE>s is
|
||||
usually a good starting point in grammar development.
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H4>Regular paradigms</H4>
|
||||
<P>
|
||||
In Swedish, giving the gender of <CODE>N</CODE> improves a lot
|
||||
</P>
|
||||
<PRE>
|
||||
regGenN "val" neutrum ---> val, valet, val, valen
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
There are also special constructs taking other forms:
|
||||
</P>
|
||||
<PRE>
|
||||
mk2N : (nyckel,nycklar : Str) -> N
|
||||
mk1N : (bilarna : Str) -> N
|
||||
|
||||
irregV : (dricka, drack, druckit : Str) -> V
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
Regular verbs are actually implemented the
|
||||
<A HREF="http://lexin.nada.kth.se/sve-sve.shtml">Lexin</A> way
|
||||
</P>
|
||||
<PRE>
|
||||
regV : (talar : Str) -> N
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H4>Worst-case paradigms</H4>
|
||||
<P>
|
||||
To cover all situations, worst-case paradigms are given. E.g. Swedish
|
||||
</P>
|
||||
<PRE>
|
||||
mkN : (apa,apan,apor,aporna : Str) -> N
|
||||
mkA : (liten, litet, lilla, sma, mindre, minst, minsta : Str) -> A
|
||||
mkV : (supa,super,sup,söp,supit,supen : Str) -> V
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H4>Irregular words</H4>
|
||||
<P>
|
||||
Iregular words in <CODE>IrregX</CODE>, e.g. Swedish:
|
||||
</P>
|
||||
<PRE>
|
||||
draga_V : V =
|
||||
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
|
||||
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
|
||||
</PRE>
|
||||
<P>
|
||||
Goal: eliminate the user's need of worst-case functions.
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
@@ -395,6 +522,20 @@ Lines of source code (4/3/2006):
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Special-purpose APIs</H3>
|
||||
<P>
|
||||
Syntactic structures that are not shared by all languages.
|
||||
</P>
|
||||
<P>
|
||||
Not implemented yet.
|
||||
</P>
|
||||
<P>
|
||||
Candidates:
|
||||
</P>
|
||||
<UL>
|
||||
<LI><CODE>Nor</CODE> post-possessives: <CODE>bilen min</CODE>
|
||||
<LI><CODE>Fre</CODE> question forms: <CODE>est-ce que tu dors ?</CODE>
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
@@ -402,20 +543,127 @@ Lines of source code (4/3/2006):
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Compiling</H3>
|
||||
<P>
|
||||
It is a good idea to compile the library, so that it can be opened faster
|
||||
</P>
|
||||
<PRE>
|
||||
GF/lib/resource-1.0% make
|
||||
|
||||
writes GF/lib/alltenses
|
||||
GF/lib/present
|
||||
GF/lib/resource-1.0/langs.gfcm
|
||||
</PRE>
|
||||
<P>
|
||||
If you don't intend to change the library, you never need to process the source
|
||||
files again. Just do some of
|
||||
</P>
|
||||
<PRE>
|
||||
gf -nocf langs.gfcm -- all 8 languages
|
||||
|
||||
gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
|
||||
|
||||
gf -nocf -path=alltenses:prelude present/LangSwe.gfc -- Swedish only, present tense only
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Parsing</H3>
|
||||
<P>
|
||||
The default parser does not work!
|
||||
</P>
|
||||
<P>
|
||||
The MCFG parser works in some languages, after waiting appr. 20 seconds
|
||||
</P>
|
||||
<PRE>
|
||||
p -mcfg -lang=LangEng -cat=S "I would see her"
|
||||
|
||||
p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
|
||||
|
||||
p -mcfg -lang=LangNor -cat=S "jeg ville se henne"
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
Parsing in <CODE>present/</CODE> versions is quicker.
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Treebank generation</H3>
|
||||
<P>
|
||||
Multilingual treebank entry = tree + linearizations
|
||||
</P>
|
||||
<P>
|
||||
Some examples on treebank generation, assuming <CODE>langs.gfcm</CODE>
|
||||
</P>
|
||||
<PRE>
|
||||
gr -cat=S -number=10 -cf | tb -- 10 random S
|
||||
|
||||
gt -cat=Phr -depth=4 | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
|
||||
</PRE>
|
||||
<P>
|
||||
Regression testing
|
||||
</P>
|
||||
<PRE>
|
||||
rf ex.xml | tb -c -- read treebank from file and compare to present grammars
|
||||
</PRE>
|
||||
<P>
|
||||
Updating a treebank
|
||||
</P>
|
||||
<PRE>
|
||||
rf old.xml | tb -trees | tb -xml | wf new.xml -- read old from file, write new to file
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Treebank-based parsing</H3>
|
||||
<P>
|
||||
Brute-force method that helps if real parsing is more expensive.
|
||||
</P>
|
||||
<PRE>
|
||||
make treebank -- make treebank with all languages
|
||||
|
||||
gf -treebank langs.xml -- start GF by reading the treebank
|
||||
|
||||
> ut -strings -treebank=LangIta -- show all Ita strings
|
||||
|
||||
> ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
|
||||
|
||||
> i -nocf langs.gfcm -- read grammar to be able to linearize
|
||||
|
||||
> ut -treebank=LangIta "Quello non si romperebbe" | l -multi -- translate to all
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Morphology</H3>
|
||||
<P>
|
||||
Use morphological analyser
|
||||
</P>
|
||||
<PRE>
|
||||
gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
|
||||
> ma "jag kan inte höra vad du säger"
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
Try out a morphology quiz
|
||||
</P>
|
||||
<PRE>
|
||||
> mq -cat=V
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
Try out inflection patterns
|
||||
</P>
|
||||
<PRE>
|
||||
gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
|
||||
> cc regV "lyser"
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<P>
|
||||
@@ -423,6 +671,16 @@ Lines of source code (4/3/2006):
|
||||
</P>
|
||||
<H3>Syntax editing</H3>
|
||||
<P>
|
||||
We start a demo by
|
||||
</P>
|
||||
<PRE>
|
||||
gfeditor langs.gfcm
|
||||
</PRE>
|
||||
<P></P>
|
||||
<P>
|
||||
<IMG ALIGN="middle" SRC="editor.png" BORDER="0" ALT="">
|
||||
</P>
|
||||
<P>
|
||||
<!-- NEW -->
|
||||
</P>
|
||||
<H3>Efficient parsing via application grammar</H3>
|
||||
@@ -469,6 +727,6 @@ Lines of source code (4/3/2006):
|
||||
</P>
|
||||
<H3>Extend old modules or add a new one?</H3>
|
||||
|
||||
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
|
||||
<!-- html code generated by txt2tags 2.0 (http://txt2tags.sf.net) -->
|
||||
<!-- cmdline: txt2tags clt2006.txt -->
|
||||
</BODY></HTML>
|
||||
|
||||
Reference in New Issue
Block a user