starting German: nouns

2026-06-25 02:46:27 -06:00 · 2006-01-04 13:01:05 +00:00
parent 958c754112
commit 5f1cd79b9b
29 changed files with 2437 additions and 168 deletions
@@ -51,7 +51,7 @@
 <P>
 Resource grammar HOWTO
 Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;
-Last update: Thu Dec  8 14:52:30 2005
+Last update: Wed Jan  4 11:29:41 2006
 </P>
 <A NAME="toc1"></A>
 <H1>HOW TO WRITE A RESOURCE GRAMMAR</H1>
@@ -59,7 +59,7 @@ Last update: Thu Dec  8 14:52:30 2005
  <A HREF="http://www.cs.chalmers.se/~aarne/">Aarne Ranta</A>
 </P>
 <P>
-  20051208
+  20060104
 </P>
 <P>
 The purpose of this document is to tell how to implement the GF
@@ -89,7 +89,7 @@ leaves out certain complicated but not always necessary things:
 tenses and most part of the lexicon.
 </P>
 <P>
-<IMG ALIGN="left" SRC="Test.png" BORDER="0" ALT="">
+<IMG ALIGN="middle" SRC="Test.png" BORDER="0" ALT="">
 </P>
 <P>
 The module structure is rather flat: almost every module is a direct
@@ -213,48 +213,64 @@ different languages.
 Unless you are writing an instance of a parametrized implementation
 (Romance or Scandinavian), which will be covered later, the most
 simple way is to follow roughly the following procedure. Assume you
-are building a grammar for the Dutch language. Here are the first steps.
+are building a grammar for the German language. Here are the first steps,
+which we actually followed ourselves when building the German implementation
+of resource v. 1.0.
 </P>
 <OL>
 <LI>Create a sister directory for <CODE>GF/lib/resource/english</CODE>, named
-     <CODE>dutch</CODE>.
-     ```
-       cd GF/lib/resource/
-       mkdir dutch
-       cd dutch
-     ```
+     <CODE>german</CODE>.
+<PRE>
+         cd GF/lib/resource/
+         mkdir german
+         cd german
+</PRE>
 <P></P>
-<LI>Check out the <A HREF="http://www.w3.org/WAI/ER/IG/ert/iso639.htm">ISO 639 3-letter language code</A> 
-   for Dutch: it is <CODE>Dut</CODE>.
+<LI>Check out the [ISO 639 3-letter language code 
+   <A HREF="http://www.w3.org/WAI/ER/IG/ert/iso639.htm">http://www.w3.org/WAI/ER/IG/ert/iso639.htm</A>] 
+   for German: both <CODE>Ger</CODE> and <CODE>Deu</CODE> are given, and we pick <CODE>Ger</CODE>.
 <P></P>
-<LI>Copy the <CODE>*Eng.gf</CODE> files from <CODE>english</CODE> <CODE>dutch</CODE>,
+<LI>Copy the <CODE>*Eng.gf</CODE> files from <CODE>english</CODE> <CODE>german</CODE>,
     and rename them:
-     ```
-       cp ../english/*Eng.gf .
-       rename 's/Eng/Dut/' *Eng.gf
-     ```
+<PRE>
+         cp ../english/*Eng.gf .
+         rename 's/Eng/Ger/' *Eng.gf
+</PRE>
 <P></P>
-<LI>Change the <CODE>Eng</CODE> module references to <CODE>Dut</CODE> references
+<LI>Change the <CODE>Eng</CODE> module references to <CODE>Ger</CODE> references
     in all files:
-     ``` sed -i 's/Eng/Dut/g' *Dut.gf
+<PRE>
+         sed -i 's/English/German/g' *Ger.gf
+         sed -i 's/Eng/Ger/g' *Ger.gf
+</PRE>
+  The first line prevents changing the word <CODE>English</CODE>, which appears
+  here and there in comments, to <CODE>Gerlish</CODE>.
 <P></P>
 <LI>This may of course change unwanted occurrences of the 
     string <CODE>Eng</CODE> - verify this by
-     ``` grep Dut *.gf
+<PRE>
+         grep Ger *.gf
+</PRE>
     But you will have to make lots of manual changes in all files anyway!
 <P></P>
 <LI>Comment out the contents of these files:
-     ``` sed -i 's/^/--/' *Dut.gf
+<PRE>
+         sed -i 's/^/--/' *Ger.gf
+</PRE>
     This will give you a set of templates out of which the grammar
     will grow as you uncomment and modify the files rule by rule.
 <P></P>
-<LI>In the file <CODE>TestDut.gf</CODE>, uncomment all lines except the list
+<LI>In the file <CODE>TestGer.gf</CODE>, uncomment all lines except the list
     of inherited modules. Now you can open the grammar in GF:
-     ``` gf TestDut.gf
+<PRE>
+         gf TestGer.gf
+</PRE>
 <P></P>
 <LI>Now you will at all following steps have a valid, but incomplete
     GF grammar. The GF command
-     ``` pg -printer=missing
+<PRE>
+         pg -printer=missing
+</PRE>
     tells you what exactly is missing.
 </OL>

@@ -266,37 +282,38 @@ were introduced above is a natural order to proceed, even though not the
 only one. So you will find yourself iterating the following steps:
 </P>
 <OL>
-<LI>Select a phrase category module, e.g. <CODE>NounDut</CODE>, and uncomment one
-     linearization rule (for instance, <CODE>IndefSg</CODE>, which is
+<LI>Select a phrase category module, e.g. <CODE>NounGer</CODE>, and uncomment one
+     linearization rule (for instance, <CODE>DefSg</CODE>, which is
     not too complicated).
 <P></P>
-<LI>Write down some Dutch examples of this rule, in this case translations
-     of "a dog", "a house", "a big house", etc.
+<LI>Write down some German examples of this rule, for instance translations
+     of "the dog", "the house", "the big house", etc. Write these in all their
+     different forms (two numbers and four cases).
 <P></P>
 <LI>Think about the categories involved (<CODE>CN, NP, N</CODE>) and the
-     variations they have. Encode this in the lincats of <CODE>CatDut</CODE>.
-     You may have to define some new parameter types in <CODE>ResDut</CODE>.
+     variations they have. Encode this in the lincats of <CODE>CatGer</CODE>.
+     You may have to define some new parameter types in <CODE>ResGer</CODE>.
 <P></P>
 <LI>To be able to test the construction, 
     define some words you need to instantiate it
-     in <CODE>LexDut</CODE>. Again, it can be helpful to define some simple-minded
-     morphological paradigms in <CODE>ResDut</CODE>, in particular worst-case
+     in <CODE>LexGer</CODE>. Again, it can be helpful to define some simple-minded
+     morphological paradigms in <CODE>ResGer</CODE>, in particular worst-case
     constructors corresponding to e.g.
     <CODE>ResEng.mkNoun</CODE>.
 <P></P>
 <LI>Doing this, you may want to test the resource independently. Do this by
-     ```
-       i -retain ResDut
-       cc mkNoun "ei" "eieren" Neutr
-     ```
+<PRE>
+         i -retain ResGer
+         cc mkNoun "Brief" "Briefe" Masc
+</PRE>
 <P></P>
-<LI>Uncomment <CODE>NounDut</CODE> and <CODE>LexDut</CODE> in <CODE>TestDut</CODE>,
-     and compile <CODE>TestDut</CODE> in GF. Then test by parsing, linearization,
+<LI>Uncomment <CODE>NounGer</CODE> and <CODE>LexGer</CODE> in <CODE>TestGer</CODE>,
+     and compile <CODE>TestGer</CODE> in GF. Then test by parsing, linearization,
     and random generation. In particular, linearization to a table should
     be used so that you see all forms produced:
-     ```
-       gr -cat=NP -number=20 -tr | l -table
-     ```
+<PRE>
+         gr -cat=NP -number=20 -tr | l -table
+</PRE>
 <P></P>
 <LI>Spare some tree-linearization pairs for later regression testing.
     You can do this way (!!to be completed)
@@ -319,8 +336,8 @@ very soon, keep you motivated, and reveal errors.
 These modules will be written by you.
 </P>
 <UL>
-<LI><CODE>ResDut</CODE>: parameter types and auxiliary operations
-<LI><CODE>MorphoDut</CODE>: complete inflection engine; not needed for <CODE>Test</CODE>.
+<LI><CODE>ResGer</CODE>: parameter types and auxiliary operations
+<LI><CODE>MorphoGer</CODE>: complete inflection engine; not needed for <CODE>Test</CODE>.
 </UL>

 <P>
@@ -342,13 +359,13 @@ package.
 <P>
 When the implementation of <CODE>Test</CODE> is complete, it is time to
 work out the lexicon files. The underlying machinery is provided in
-<CODE>MorphoDut</CODE>, which is, in effect, your linguistic theory of
-Dutch morphology. It can contain very sophisticated and complicated
+<CODE>MorphoGer</CODE>, which is, in effect, your linguistic theory of
+German morphology. It can contain very sophisticated and complicated
 definitions, which are not necessarily suitable for actually building a
 lexicon. For this purpose, you should write the module
 </P>
 <UL>
-<LI><CODE>ParadigmsDut</CODE>: morphological paradigms for the lexicographer.
+<LI><CODE>ParadigmsGer</CODE>: morphological paradigms for the lexicographer.
 </UL>

 <P>
@@ -364,15 +381,15 @@ the functions
 <UL>
 <LI><CODE>mkN</CODE>, for worst-case construction of <CODE>N</CODE>. Its type signature
     has the form
-     ```
-       mkN : Str -&gt; ... -&gt; Str -&gt; P -&gt; ... -&gt; Q -&gt; N
-     ```
+<PRE>
+         mkN : Str -&gt; ... -&gt; Str -&gt; P -&gt; ... -&gt; Q -&gt; N
+</PRE>
     with as many string and parameter arguments as can ever be needed to
     construct an <CODE>N</CODE>.
 <LI><CODE>regN</CODE>, for the most common cases, with just one string argument:
-     ```
-       regN : Str -&gt; N
-     ```
+<PRE>
+         regN : Str -&gt; N
+</PRE>
 <LI>A language-dependent (small) set of functions to handle mild irregularities
     and common exceptions.
 <P></P>
@@ -380,15 +397,15 @@ For the complement-taking variants, such as <CODE>V2</CODE>, we provide
 <P></P>
 <LI><CODE>mkV2</CODE>, which takes a <CODE>V</CODE> and all necessary arguments, such
     as case and preposition:
-     ```
-       mkV2 : V -&gt; Case -&gt; Str -&gt; V2 ;
-     ```
+<PRE>
+         mkV2 : V -&gt; Case -&gt; Str -&gt; V2 ;
+</PRE>
 <LI>A language-dependent (small) set of functions to handle common special cases,
     such as direct transitive verbs:
-     ```
-       dirV2 : V -&gt; V2 ;
-       -- dirV2 v = mkV2 v accusative [] 
-     ```
+<PRE>
+         dirV2 : V -&gt; V2 ;
+         -- dirV2 v = mkV2 v accusative [] 
+</PRE>
 </UL>

 <P>
@@ -403,33 +420,33 @@ The golden rule for the design of paradigms is that
 The discipline of data abstraction moreover requires that the user of the resource
 is not given access to parameter constructors, but only to constants that denote 
 them. This gives the resource grammarian the freedom to change the underlying
-data representation if needed. It means that the <CODE>ParadigmsDut</CODE> module has
+data representation if needed. It means that the <CODE>ParadigmsGer</CODE> module has
 to define constants for those parameter types and constructors that 
 the application grammarian may need to use, e.g.
 </P>
 <PRE>
    oper 
      Case : Type ;
-      nominative, accusative, genitive : Case ;
+      nominative, accusative, genitive, dative : Case ;
 </PRE>
 <P>
 These constants are defined in terms of parameter types and constructors
-in <CODE>ResDut</CODE> and <CODE>MorphoDut</CODE>, which modules are are not
+in <CODE>ResGer</CODE> and <CODE>MorphoGer</CODE>, which modules are are not
 accessible to the application grammarian.
 </P>
 <A NAME="toc11"></A>
 <H3>Lock fields</H3>
 <P>
-An important difference between <CODE>MorphoDut</CODE> and
-<CODE>ParadigmsDut</CODE> is that the former uses "raw" record types
+An important difference between <CODE>MorphoGer</CODE> and
+<CODE>ParadigmsGer</CODE> is that the former uses "raw" record types
 as lincats, whereas the latter used category symbols defined in
-<CODE>CatDut</CODE>. When these category symbols are used to denote
-record types in a resource modules, such as <CODE>ParadigmsDut</CODE>,
+<CODE>CatGer</CODE>. When these category symbols are used to denote
+record types in a resource modules, such as <CODE>ParadigmsGer</CODE>,
 a <B>lock field</B> is added to the record, so that categories
 with the same implementation are not confused with each other.
 (This is inspired by the <CODE>newtype</CODE> discipline in Haskell.)
 For instance, the lincats of adverbs and conjunctions may be the same
-in <CODE>CatDut</CODE>:
+in <CODE>CatGer</CODE>:
 </P>
 <PRE>
    lincat Adv  = {s : Str} ;
@@ -467,21 +484,21 @@ in her hidden definitions of constants in <CODE>Paradigms</CODE>. For instance,
 <A NAME="toc12"></A>
 <H3>Lexicon construction</H3>
 <P>
-The lexicon belonging to <CODE>LangDut</CODE> consists of two modules:
+The lexicon belonging to <CODE>LangGer</CODE> consists of two modules:
 </P>
 <UL>
-<LI><CODE>StructuralDut</CODE>, structural words, built by directly using
-     <CODE>MorphoDut</CODE>.
-<LI><CODE>BasicDut</CODE>, content words, built by using <CODE>ParadigmsDut</CODE>.
+<LI><CODE>StructuralGer</CODE>, structural words, built by directly using
+     <CODE>MorphoGer</CODE>.
+<LI><CODE>BasicGer</CODE>, content words, built by using <CODE>ParadigmsGer</CODE>.
 </UL>

 <P>
-The reason why <CODE>MorphoDut</CODE> has to be used in <CODE>StructuralDut</CODE>
-is that <CODE>ParadigmsDut</CODE> does not contain constructors for closed
+The reason why <CODE>MorphoGer</CODE> has to be used in <CODE>StructuralGer</CODE>
+is that <CODE>ParadigmsGer</CODE> does not contain constructors for closed
 word classes such as pronouns and determiners. The reason why we
-recommend <CODE>ParadigmsDut</CODE> for building <CODE>BasicDut</CODE> is that
+recommend <CODE>ParadigmsGer</CODE> for building <CODE>BasicGer</CODE> is that
 the coverage of the paradigms gets thereby tested and that the
-use of the paradigms in <CODE>BasicDut</CODE> gives a good set of examples for
+use of the paradigms in <CODE>BasicGer</CODE> gives a good set of examples for
 those who want to build new lexica.
 </P>
 <A NAME="toc13"></A>
@@ -509,34 +526,31 @@ worst-case paradigms (<CODE>mkV</CODE> etc).
 <P>
 You can often find resources such as lists of 
 irregular verbs on the internet. For instance, the
-<A HREF="http://www.dutchtrav.com/gram/irrverbs.html">Dutch for Travelers</A> 
+<A HREF="http://www.iee.et.tu-dresden.de/~wernerr/grammar/verben_dt.html">Irregular German Verbs</A> 
 page gives a list of verbs in the
 traditional tabular format, which begins as follows:
 </P>
 <PRE>
-    begrijpen begrijp begreep begrepen 	to understand
-    bijten    bijt    beet    gebeten     to bite
-    binden    bind    bond    gebonden 	to tie
-    breken    breek   brak    gebroken 	to break
+    backen (du bäckst, er bäckt)	                 backte [buk]	           gebacken
+    befehlen (du befiehlst, er befiehlt; befiehl!) befahl (beföhle; befähle) befohlen
+    beginnen	                                 begann (begönne; begänne) begonnen
+    beißen	                                 biß	                   gebissen
 </PRE>
 <P>
 All you have to do is to write a suitable verb paradigm
 </P>
 <PRE>
-    irregV : Str -&gt; Str -&gt; Str -&gt; Str -&gt; V ;
+    irregV : (x1,_,_,_,_,x6 : Str) -&gt; V ;
 </PRE>
 <P>
 and a Perl or Python or Haskell script that transforms
 the table to
 </P>
 <PRE>
-    begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ;
-    bijten_V    = irregV "bijten"    "bijt"    "beet"    "gebeten" ;
-    binden_V    = irregV "binden"    "bind"    "bond"    "gebonden" ;
+    backen_V   = irregV "backen" "bäckt" "back" "backte" "backte" "gebacken" ;
+    befehlen_V = irregV "befehlen" "befiehlt" "befiehl" "befahl" "beföhle" "befohlen" ;
 </PRE>
-<P>
-(You may want to use the English translation for some purpose, as well.)
-</P>
+<P></P>
 <P>
 When using ready-made word lists, you should think about
 coyright issues. Ideally, all resource grammar material should
@@ -563,7 +577,7 @@ extension modules. This chapter will deal with this issue.
 <H2>Writing an instance of parametrized resource grammar implementation</H2>
 <P>
 Above we have looked at how a resource implementation is built by
-the copy and paste method (from English to Dutch), that is, formally
+the copy and paste method (from English to German), that is, formally
 speaking, from scratch. A more elegant solution available for 
 families of languages such as Romance and Scandinavian is to
 use parametrized modules. The advantages are