redocumenting resource

This commit is contained in:
aarne
2006-01-25 13:52:15 +00:00
parent 3a69241209
commit 9dc877cead
73 changed files with 392 additions and 263 deletions

View File

@@ -7,7 +7,7 @@
<P ALIGN="center"><CENTER><H1>Resource grammar writing HOWTO</H1>
<FONT SIZE="4">
<I>Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;</I><BR>
Last update: Thu Jan 5 23:19:40 2006
Last update: Wed Jan 25 14:52:10 2006
</FONT></CENTER>
<P></P>
@@ -29,21 +29,22 @@ Last update: Thu Jan 5 23:19:40 2006
<LI><A HREF="#toc10">Lock fields</A>
<LI><A HREF="#toc11">Lexicon construction</A>
</UL>
<LI><A HREF="#toc12">Inside phrase category modules</A>
<LI><A HREF="#toc12">Inside grammar modules</A>
<UL>
<LI><A HREF="#toc13">Noun</A>
<LI><A HREF="#toc14">Verb</A>
<LI><A HREF="#toc15">Adjective</A>
<LI><A HREF="#toc13">The category system</A>
<LI><A HREF="#toc14">Phrase category modules</A>
<LI><A HREF="#toc15">Resource modules</A>
<LI><A HREF="#toc16">Lexicon</A>
</UL>
<LI><A HREF="#toc16">Lexicon extension</A>
<LI><A HREF="#toc17">Lexicon extension</A>
<UL>
<LI><A HREF="#toc17">The irregularity lexicon</A>
<LI><A HREF="#toc18">Lexicon extraction from a word list</A>
<LI><A HREF="#toc19">Lexicon extraction from raw text data</A>
<LI><A HREF="#toc20">Extending the resource grammar API</A>
<LI><A HREF="#toc18">The irregularity lexicon</A>
<LI><A HREF="#toc19">Lexicon extraction from a word list</A>
<LI><A HREF="#toc20">Lexicon extraction from raw text data</A>
<LI><A HREF="#toc21">Extending the resource grammar API</A>
</UL>
<LI><A HREF="#toc21">Writing an instance of parametrized resource grammar implementation</A>
<LI><A HREF="#toc22">Parametrizing a resource grammar implementation</A>
<LI><A HREF="#toc22">Writing an instance of parametrized resource grammar implementation</A>
<LI><A HREF="#toc23">Parametrizing a resource grammar implementation</A>
</UL>
<P></P>
@@ -72,16 +73,8 @@ The following figure gives the dependencies of these modules.
<IMG ALIGN="left" SRC="Lang.png" BORDER="0" ALT="">
</P>
<P>
It is advisable to start with a simpler subset of the API, which
leaves out certain complicated but not always necessary things:
tenses and most part of the lexicon.
</P>
<P>
<IMG ALIGN="middle" SRC="Test.png" BORDER="0" ALT="">
</P>
<P>
The module structure is rather flat: almost every module is a direct
parent of the top module (<CODE>Lang</CODE> or <CODE>Test</CODE>). The idea
parent of the top module <CODE>Lang</CODE>. The idea
is that you can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors.
</P>
@@ -137,20 +130,6 @@ can skip the <CODE>lincat</CODE> definition of a category and use the default
<CODE>{s : Str}</CODE> until you need to change it to something else. In
English, for instance, most categories do have this linearization type!
</P>
<P>
As a slight asymmetry in the module diagrams, you find the following
modules:
</P>
<UL>
<LI><CODE>Tense</CODE>: defines the parameters of polarity, anteriority, and tense
<LI><CODE>Tensed</CODE>: defines how sentences use those parameters
<LI><CODE>Untensed</CODE>: makes sentences use the polarity parameter only
</UL>
<P>
The full resource API (<CODE>Lang</CODE>) uses <CODE>Tensed</CODE>, whereas the
restricted <CODE>Test</CODE> API uses <CODE>Untensed</CODE>.
</P>
<A NAME="toc4"></A>
<H3>Lexical modules</H3>
<P>
@@ -165,29 +144,22 @@ API, the latter rule is sometimes violated in some languages.
Another characterization of lexical is that lexical units can be added
almost <I>ad libitum</I>, and they cannot be defined in terms of already
given rules. The lexical modules of the resource API are thus more like
samples than complete lists. There are three such modules:
samples than complete lists. There are two such modules:
</P>
<UL>
<LI><CODE>Structural</CODE>: structural words (determiners, conjunctions,...)
<LI><CODE>Basic</CODE>: basic everyday content words (nouns, verbs,...)
<LI><CODE>Lex</CODE>: a very small sample of both structural and content words
<LI><CODE>Lexicon</CODE>: basic everyday content words (nouns, verbs,...)
</UL>
<P>
The module <CODE>Structural</CODE> aims for completeness, and is likely to
be extended in future releases of the resource. The module <CODE>Basic</CODE>
be extended in future releases of the resource. The module <CODE>Lexicon</CODE>
gives a "random" list of words, which enable interesting testing of syntax,
and also a check list for morphology, since those words are likely to include
most morphological patterns of the language.
</P>
<P>
The module <CODE>Lex</CODE> is used in <CODE>Test</CODE> instead of the two
larger modules. Its purpose is to provide a quick way to test the
syntactic structures of the phrase category modules without having to implement
the larger lexica.
</P>
<P>
In the case of <CODE>Basic</CODE> it may come out clearer than anywhere else
In the case of <CODE>Lexicon</CODE> it may come out clearer than anywhere else
in the API that it is impossible to give exact translation equivalents in
different languages on the level of a resource grammar. In other words,
application grammars are likely to use the resource in different ways for
@@ -254,9 +226,9 @@ of resource v. 1.0.
lines in the previous step) - but you uncommenting the first
and the last lines will actually do the job for many of the files.
<P></P>
<LI>Now you can open the grammar <CODE>TestGer</CODE> in GF:
<LI>Now you can open the grammar <CODE>LangGer</CODE> in GF:
<PRE>
gf TestGer.gf
gf LangGer.gf
</PRE>
You will get lots of warnings on missing rules, but the grammar will compile.
<P></P>
@@ -267,7 +239,7 @@ of resource v. 1.0.
</PRE>
tells you what exactly is missing.
<P></P>
Here is the module structure of <CODE>TestGer</CODE>. It has been simplified by leaving out
Here is the module structure of <CODE>LangGer</CODE>. It has been simplified by leaving out
the majority of the phrase category modules. Each of them has the same dependencies
as e.g. <CODE>VerbGer</CODE>.
<P></P>
@@ -296,7 +268,7 @@ only one. So you will find yourself iterating the following steps:
<P></P>
<LI>To be able to test the construction,
define some words you need to instantiate it
in <CODE>LexGer</CODE>. Again, it can be helpful to define some simple-minded
in <CODE>LexiconGer</CODE>. Again, it can be helpful to define some simple-minded
morphological paradigms in <CODE>ResGer</CODE>, in particular worst-case
constructors corresponding to e.g.
<CODE>ResEng.mkNoun</CODE>.
@@ -307,8 +279,8 @@ only one. So you will find yourself iterating the following steps:
cc mkNoun "Brief" "Briefe" Masc
</PRE>
<P></P>
<LI>Uncomment <CODE>NounGer</CODE> and <CODE>LexGer</CODE> in <CODE>TestGer</CODE>,
and compile <CODE>TestGer</CODE> in GF. Then test by parsing, linearization,
<LI>Uncomment <CODE>NounGer</CODE> and <CODE>LexiconGer</CODE> in <CODE>LangGer</CODE>,
and compile <CODE>LangGer</CODE> in GF. Then test by parsing, linearization,
and random generation. In particular, linearization to a table should
be used so that you see all forms produced:
<PRE>
@@ -321,8 +293,9 @@ only one. So you will find yourself iterating the following steps:
<P>
You are likely to run this cycle a few times for each linearization rule
you implement, and some hundreds of times altogether. There are 159
<CODE>funs</CODE> in <CODE>Test</CODE> (at the moment).
you implement, and some hundreds of times altogether. There are 66 <CODE>cat</CODE>s and
458 <CODE>funs</CODE> in <CODE>Lang</CODE> at the moment; 149 of the <CODE>funs</CODE> are outside the two
lexicon modules).
</P>
<P>
Of course, you don't need to complete one phrase category module before starting
@@ -335,7 +308,8 @@ Here is a <A HREF="../german/log.txt">live log</A> of the actual process of
building the German implementation of resource API v. 1.0.
It is the basis of the more detailed explanations, which will
follow soon. (You will found out that these explanations involve
a rational reconstruction of the live process!)
a rational reconstruction of the live process! Among other things, the
API was changed during the actual process to make it more intuitive.)
</P>
<A NAME="toc8"></A>
<H3>Resource modules used</H3>
@@ -343,8 +317,9 @@ a rational reconstruction of the live process!)
These modules will be written by you.
</P>
<UL>
<LI><CODE>ResGer</CODE>: parameter types and auxiliary operations
<LI><CODE>MorphoGer</CODE>: complete inflection engine; not needed for <CODE>Test</CODE>.
<LI><CODE>ParamGer</CODE>: parameter types
<LI><CODE>ResGer</CODE>: auxiliary operations (a resource for the resource grammar!)
<LI><CODE>MorphoGer</CODE>: complete inflection engine
</UL>
<P>
@@ -439,7 +414,7 @@ the application grammarian may need to use, e.g.
<P>
These constants are defined in terms of parameter types and constructors
in <CODE>ResGer</CODE> and <CODE>MorphoGer</CODE>, which modules are are not
accessible to the application grammarian.
visible to the application grammarian.
</P>
<A NAME="toc10"></A>
<H3>Lock fields</H3>
@@ -509,16 +484,54 @@ use of the paradigms in <CODE>BasicGer</CODE> gives a good set of examples for
those who want to build new lexica.
</P>
<A NAME="toc12"></A>
<H2>Inside phrase category modules</H2>
<H2>Inside grammar modules</H2>
<P>
So far we just give links to the implementations of each API.
More explanation iś to follow - but many detail implementation tricks
are only found in the cooments of the modules.
</P>
<A NAME="toc13"></A>
<H3>Noun</H3>
<H3>The category system</H3>
<UL>
<LI><A HREF="gfdoc/Cat.html">Cat</A>, <A HREF="gfdoc/CatGer.html">CatGer</A>
</UL>
<A NAME="toc14"></A>
<H3>Verb</H3>
<H3>Phrase category modules</H3>
<UL>
<LI><A HREF="gfdoc/Tense.html">Tense</A>, <A HREF="../german/TenseGer.gf">TenseGer</A>
<LI><A HREF="gfdoc/Noun.html">Noun</A>, <A HREF="../german/NounGer.gf">NounGer</A>
<LI><A HREF="gfdoc/Adjective.html">Adjective</A>, <A HREF="../german/AdjectiveGer.gf">AdjectiveGer</A>
<LI><A HREF="gfdoc/Verb.html">Verb</A>, <A HREF="../german/VerbGer.gf">VerbGer</A>
<LI><A HREF="gfdoc/Adverb.html">Adverb</A>, <A HREF="../german/AdverbGer.gf">AdverbGer</A>
<LI><A HREF="gfdoc/Numeral.html">Numeral</A>, <A HREF="../german/NumeralGer.gf">NumeralGer</A>
<LI><A HREF="gfdoc/Sentence.html">Sentence</A>, <A HREF="../german/SentenceGer.gf">SentenceGer</A>
<LI><A HREF="gfdoc/Question.html">Question</A>, <A HREF="../german/QuestionGer.gf">QuestionGer</A>
<LI><A HREF="gfdoc/Relative.html">Relative</A>, <A HREF="../german/RelativeGer.gf">RelativeGer</A>
<LI><A HREF="gfdoc/Conjunction.html">Conjunction</A>, <A HREF="../german/ConjunctionGer.gf">ConjunctionGer</A>
<LI><A HREF="gfdoc/Phrase.html">Phrase</A>, <A HREF="../german/PhraseGer.gf">PhraseGer</A>
<LI><A HREF="gfdoc/Lang.html">Lang</A>, <A HREF="../german/LangGer.gf">LangGer</A>
</UL>
<A NAME="toc15"></A>
<H3>Adjective</H3>
<H3>Resource modules</H3>
<UL>
<LI><A HREF="../german/ParamGer.gf">ParamGer</A>
<LI><A HREF="../german/ResGer.gf">ResGer</A>
<LI><A HREF="../german/MorphoGer.gf">MorphoGer</A>
<LI><A HREF="gfdoc/ParadigmsGer.html">ParadigmsGer</A>, <A HREF="../german/ParadigmsGer.gf">ParadigmsGer.gf</A>
</UL>
<A NAME="toc16"></A>
<H2>Lexicon extension</H2>
<H3>Lexicon</H3>
<UL>
<LI><A HREF="gfdoc/Structural.html">Structural</A>, <A HREF="../german/StructuralGer.gf">StructuralGer</A>
<LI><A HREF="gfdoc/Lexicon.html">Lexicon</A>, <A HREF="../german/LexiconGer.gf">LexiconGer</A>
</UL>
<A NAME="toc17"></A>
<H2>Lexicon extension</H2>
<A NAME="toc18"></A>
<H3>The irregularity lexicon</H3>
<P>
It may be handy to provide a separate module of irregular
@@ -528,7 +541,7 @@ few hundred perhaps. Building such a lexicon separately also
makes it less important to cover <I>everything</I> by the
worst-case paradigms (<CODE>mkV</CODE> etc).
</P>
<A NAME="toc18"></A>
<A NAME="toc19"></A>
<H3>Lexicon extraction from a word list</H3>
<P>
You can often find resources such as lists of
@@ -538,10 +551,10 @@ page gives a list of verbs in the
traditional tabular format, which begins as follows:
</P>
<PRE>
backen (du bäckst, er bäckt) backte [buk] gebacken
backen (du bäckst, er bäckt) backte [buk] gebacken
befehlen (du befiehlst, er befiehlt; befiehl!) befahl (beföhle; befähle) befohlen
beginnen begann (begönne; begänne) begonnen
beißen biß gebissen
beginnen begann (begönne; begänne) begonnen
beißen biß gebissen
</PRE>
<P>
All you have to do is to write a suitable verb paradigm
@@ -563,7 +576,7 @@ When using ready-made word lists, you should think about
coyright issues. Ideally, all resource grammar material should
be provided under GNU General Public License.
</P>
<A NAME="toc19"></A>
<A NAME="toc20"></A>
<H3>Lexicon extraction from raw text data</H3>
<P>
This is a cheap technique to build a lexicon of thousands
@@ -571,7 +584,7 @@ of words, if text data is available in digital format.
See the <A HREF="http://www.cs.chalmers.se/~markus/FM/">Functional Morphology</A>
homepage for details.
</P>
<A NAME="toc20"></A>
<A NAME="toc21"></A>
<H3>Extending the resource grammar API</H3>
<P>
Sooner or later it will happen that the resource grammar API
@@ -580,7 +593,7 @@ that it does not include idiomatic expressions in a given language.
The solution then is in the first place to build language-specific
extension modules. This chapter will deal with this issue.
</P>
<A NAME="toc21"></A>
<A NAME="toc22"></A>
<H2>Writing an instance of parametrized resource grammar implementation</H2>
<P>
Above we have looked at how a resource implementation is built by
@@ -595,10 +608,10 @@ use parametrized modules. The advantages are
</UL>
<P>
In this chapter, we will look at an example: adding Portuguese to
In this chapter, we will look at an example: adding Italian to
the Romance family.
</P>
<A NAME="toc22"></A>
<A NAME="toc23"></A>
<H2>Parametrizing a resource grammar implementation</H2>
<P>
This is the most demanding form of resource grammar writing.
@@ -614,6 +627,6 @@ This chapter will work out an example of how an Estonian grammar
is constructed from the Finnish grammar through parametrization.
</P>
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
<!-- html code generated by txt2tags 2.0 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags -\-toc -thtml Resource-HOWTO.txt -->
</BODY></HTML>