diff --git a/lib/resource-1.0/doc/Resource-HOWTO.html b/lib/resource-1.0/doc/Resource-HOWTO.html index ce75e4f59..811df81a5 100644 --- a/lib/resource-1.0/doc/Resource-HOWTO.html +++ b/lib/resource-1.0/doc/Resource-HOWTO.html @@ -1,540 +1,598 @@ - - + + + + + + + -
-

HOW TO WRITE A RESOURCE GRAMMAR

- -

- -Aarne Ranta -

-30 November 2005 -

- -

+

+
+

+ +

+
+

+

+Resource grammar HOWTO +Author: Aarne Ranta <aarne (at) cs.chalmers.se> +Last update: Thu Dec 8 14:52:30 2005 +

+ +

HOW TO WRITE A RESOURCE GRAMMAR

+

+ Aarne Ranta +

+

+ 20051208 +

+

The purpose of this document is to tell how to implement the GF -resource grammar API for a new language. We will not cover how +resource grammar API for a new language. We will not cover how to use the resource grammar, nor how to change the API. But we will give some hints how to extend the API. - -

- -Notice. This document concerns the API v. 1.0 which has not +

+

+Notice. This document concerns the API v. 1.0 which has not yet been released. You can find the beginnings of it -in GF/lib/resource-1.0/. See the -resource-1.0/README for +in GF/lib/resource-1.0/. See the +resource-1.0/README for details on how this differs from previous versions. - - - -

The resource grammar API

- -The API is divided into a bunch of abstract modules. +

+ +

The resource grammar API

+

+The API is divided into a bunch of abstract modules. The following figure gives the dependencies of these modules. - -

- -
- +

+

+ +

+

It is advisable to start with a simpler subset of the API, which leaves out certain complicated but not always necessary things: tenses and most part of the lexicon. - -

- -
- +

+

+ +

+

The module structure is rather flat: almost every module is a direct -parent of the top module (Lang or Test). The idea +parent of the top module (Lang or Test). The idea is that you can concentrate on one linguistic aspect at a time, or also distribute the work among several authors. - - -

Phrase category modules

- -The direct parents of the top could be called phrase category modules, +

+ +

Phrase category modules

+

+The direct parents of the top could be called phrase category modules, since each of them concentrates on a particular phrase category (nouns, verbs, adjectives, sentences,...). A phrase category module tells -how to construct phrases in that category. You will find out that +how to construct phrases in that category. You will find out that all functions in any of these modules have the same value type (or maybe one of a small number of different types). Thus we have -

- - - -

Infrastructure modules

+

+ + +

Infrastructure modules

+

Expressions of each phrase category are constructed in the corresponding -phrase category module. But their use takes mostly place in other modules. -For instance, noun phrases, which are constructed in Noun, are +phrase category module. But their use takes mostly place in other modules. +For instance, noun phrases, which are constructed in Noun, are used as arguments of functions of almost all other phrase category modules. How can we build all these modules independently of each other? - -

- -As usual in typeful programming, the only thing you need to know +

+

+As usual in typeful programming, the only thing you need to know about an object you use is its type. When writing a linearization rule for a GF abstract syntax function, the only thing you need to know is the linearization types of its value and argument categories. To achieve the division of the resource grammar to several parallel phrase category modules, what we need is an underlying definition of the linearization types. This definition is given as the implementation of -

+

+ + +

Any resource grammar implementation has first to agree on how to implement -Cat. Luckily enough, even this can be done incrementally: you -can skip the lincat definition of a category and use the default -{s : Str} until you need to change it to something else. In +Cat. Luckily enough, even this can be done incrementally: you +can skip the lincat definition of a category and use the default +{s : Str} until you need to change it to something else. In English, for instance, most categories do have this linearization type! - -

- +

+

As a slight asymmetry in the module diagrams, you find the following modules: -

-The full resource API (Lang) uses Tensed, whereas the -restricted Test API uses Untensed. - - - -

Lexical modules

+

+ +

+The full resource API (Lang) uses Tensed, whereas the +restricted Test API uses Untensed. +

+ +

Lexical modules

+

What is lexical and what is syntactic is not as clearcut in GF as in some other grammar formalisms. Logically, however, lexical means -fun with no arguments. Linguistically, one may add to this -that the lin consists of only one token (or of a table whose values +fun with no arguments. Linguistically, one may add to this +that the lin consists of only one token (or of a table whose values are single tokens). Even in the restricted lexicon included in the resource API, the latter rule is sometimes violated in some languages. - -

- +

+

Another characterization of lexical is that lexical units can be added -almost ad libitum, and they cannot be defined in terms of already +almost ad libitum, and they cannot be defined in terms of already given rules. The lexical modules of the resource API are thus more like samples than complete lists. There are three such modules: -

-The module Structural aims for completeness, and is likely to -be extended in future releases of the resource. The module Basic +

+ + +

+The module Structural aims for completeness, and is likely to +be extended in future releases of the resource. The module Basic gives a "random" list of words, which enable interesting testing of syntax, and also a check list for morphology, since those words are likely to include most morphological patterns of the language. - -

- -The module Lex is used in Test instead of the two +

+

+The module Lex is used in Test instead of the two larger modules. Its purpose is to provide a quick way to test the syntactic structures of the phrase category modules without having to implement the larger lexica. - -

- -In the case of Basic it may come out clearer than anywhere else +

+

+In the case of Basic it may come out clearer than anywhere else in the API that it is impossible to give exact translation equivalents in different languages on the level of a resource grammar. In other words, application grammars are likely to use the resource in different ways for different languages. - - - -

Phases of the work

- -

Putting up a directory

- +

+ +

Phases of the work

+ +

Putting up a directory

+

Unless you are writing an instance of a parametrized implementation (Romance or Scandinavian), which will be covered later, the most simple way is to follow roughly the following procedure. Assume you are building a grammar for the Dutch language. Here are the first steps. -

    -
  1. Create a sister directory for GF/lib/resource/english, named - dutch. -
    +

    +
      +
    1. Create a sister directory for GF/lib/resource/english, named + dutch. + ``` cd GF/lib/resource/ mkdir dutch cd dutch -
    - -
  2. Check out the - ISO 639 3-letter language code for Dutch: it is Dut. - -
  3. Copy the *Eng.gf files from english dutch, + ``` +

    +
  4. Check out the ISO 639 3-letter language code + for Dutch: it is Dut. +

    +
  5. Copy the *Eng.gf files from english dutch, and rename them: -
    +     ```
            cp ../english/*Eng.gf .
            rename 's/Eng/Dut/' *Eng.gf
    -     
    - -
  6. Change the Eng module references to Dut references + ``` +

    +
  7. Change the Eng module references to Dut references in all files: -
    -       sed -i 's/Eng/Dut/g' *Dut.gf
    -     
    - -
  8. This may of course change unwanted occurrences of the - string Eng - verify this by -
    -       grep Dut *.gf
    -     
    + ``` sed -i 's/Eng/Dut/g' *Dut.gf +

    +
  9. This may of course change unwanted occurrences of the + string Eng - verify this by + ``` grep Dut *.gf But you will have to make lots of manual changes in all files anyway! - -
  10. Comment out the contents of these files: -
    -       sed -i 's/^/--/' *Dut.gf
    -     
    +

    +
  11. Comment out the contents of these files: + ``` sed -i 's/^/--/' *Dut.gf This will give you a set of templates out of which the grammar will grow as you uncomment and modify the files rule by rule. - -
  12. In the file TestDut.gf, uncomment all lines except the list +

    +
  13. In the file TestDut.gf, uncomment all lines except the list of inherited modules. Now you can open the grammar in GF: -
    -       gf TestDut.gf
    -     
    - -
  14. Now you will at all following steps have a valid, but incomplete + ``` gf TestDut.gf +

    +
  15. Now you will at all following steps have a valid, but incomplete GF grammar. The GF command -
    -       pg -printer=missing
    -     
    + ``` pg -printer=missing tells you what exactly is missing. +
- - - -

The develop-test cycle

- -The real work starts now. The order in which the Phrase modules + +

The develop-test cycle

+

+The real work starts now. The order in which the Phrase modules were introduced above is a natural order to proceed, even though not the only one. So you will find yourself iterating the following steps: - -

    -
  1. Select a phrase category module, e.g. NounDut, and uncomment one - linearization rule (for instance, IndefSg, which is +

    +
      +
    1. Select a phrase category module, e.g. NounDut, and uncomment one + linearization rule (for instance, IndefSg, which is not too complicated). - -
    2. Write down some Dutch examples of this rule, in this case translations +

      +
    3. Write down some Dutch examples of this rule, in this case translations of "a dog", "a house", "a big house", etc. - -
    4. Think about the categories involved (CN, NP, N) and the - variations they have. Encode this in the lincats of CatDut. - You may have to define some new parameter types in ResDut. - -
    5. To be able to test the construction, +

      +
    6. Think about the categories involved (CN, NP, N) and the + variations they have. Encode this in the lincats of CatDut. + You may have to define some new parameter types in ResDut. +

      +
    7. To be able to test the construction, define some words you need to instantiate it - in LexDut. Again, it can be helpful to define some simple-minded - morphological paradigms in ResDut, in particular worst-case + in LexDut. Again, it can be helpful to define some simple-minded + morphological paradigms in ResDut, in particular worst-case constructors corresponding to e.g. - ResEng.mkNoun. - -
    8. Doing this, you may want to test the resource independently. Do this by -
      +     ResEng.mkNoun.
      +

      +
    9. Doing this, you may want to test the resource independently. Do this by + ``` i -retain ResDut cc mkNoun "ei" "eieren" Neutr -
    10. - -
    11. Uncomment NounDut and LexDut in TestDut, - and compile TestDut in GF. Then test by parsing, linearization, + ``` +

      +
    12. Uncomment NounDut and LexDut in TestDut, + and compile TestDut in GF. Then test by parsing, linearization, and random generation. In particular, linearization to a table should be used so that you see all forms produced: -
      +     ```
              gr -cat=NP -number=20 -tr | l -table
      -     
      - -
    13. Spare some tree-linearization pairs for later regression testing. + ``` +

      +
    14. Spare some tree-linearization pairs for later regression testing. You can do this way (!!to be completed) +
    -
+

You are likely to run this cycle a few times for each linearization rule you implement, and some hundreds of times altogether. There are 159 -funs in Test (at the moment). - -

- +funs in Test (at the moment). +

+

Of course, you don't need to complete one phrase category module before starting -with the next one. Actually, a suitable subset of Noun, -Verb, and Adjective will lead to a reasonable coverage +with the next one. Actually, a suitable subset of Noun, +Verb, and Adjective will lead to a reasonable coverage very soon, keep you motivated, and reveal errors. - - -

Resource modules used

- +

+ +

Resource modules used

+

These modules will be written by you. -

+

+ + +

These modules are language-independent and provided by the existing resource package. -

- - -

Morphology and lexicon

- -When the implementation of Test is complete, it is time to + +

Morphology and lexicon

+

+When the implementation of Test is complete, it is time to work out the lexicon files. The underlying machinery is provided in -MorphoDut, which is, in effect, your linguistic theory of +MorphoDut, which is, in effect, your linguistic theory of Dutch morphology. It can contain very sophisticated and complicated definitions, which are not necessarily suitable for actually building a lexicon. For this purpose, you should write the module -

+

+ + +

This module provides high-level ways to define the linearization of -lexical items, of categories N, A, V and their complement-taking +lexical items, of categories N, A, V and their complement-taking variants. - -

- -For ease of use, the Paradigms modules follow a certain -naming convention. Thus they for each lexical category, such as N, +

+

+For ease of use, the Paradigms modules follow a certain +naming convention. Thus they for each lexical category, such as N, the functions -

+ +

The golden rule for the design of paradigms is that -

+ +

The discipline of data abstraction moreover requires that the user of the resource is not given access to parameter constructors, but only to constants that denote them. This gives the resource grammarian the freedom to change the underlying -data representation if needed. It means that the ParadigmsDut module has +data representation if needed. It means that the ParadigmsDut module has to define constants for those parameter types and constructors that the application grammarian may need to use, e.g. -

-  oper 
-    Case : Type ;
-    nominative, accusative, genitive : Case ;
-
+

+
+    oper 
+      Case : Type ;
+      nominative, accusative, genitive : Case ;
+
+

These constants are defined in terms of parameter types and constructors -in ResDut and MorphoDut, which modules are are not +in ResDut and MorphoDut, which modules are are not accessible to the application grammarian. - - -

Lock fields

- -An important difference between MorphoDut and -ParadigmsDut is that the former uses "raw" record types +

+ +

Lock fields

+

+An important difference between MorphoDut and +ParadigmsDut is that the former uses "raw" record types as lincats, whereas the latter used category symbols defined in -CatDut. When these category symbols are used to denote -record types in a resource modules, such as ParadigmsDut, -a lock field is added to the record, so that categories +CatDut. When these category symbols are used to denote +record types in a resource modules, such as ParadigmsDut, +a lock field is added to the record, so that categories with the same implementation are not confused with each other. -(This is inspired by the newtype discipline in Haskell.) +(This is inspired by the newtype discipline in Haskell.) For instance, the lincats of adverbs and conjunctions may be the same -in CatDut: -

-  lincat Adv  = {s : Str} ;
-  lincat Conj = {s : Str} ;
-
+in CatDut: +

+
+    lincat Adv  = {s : Str} ;
+    lincat Conj = {s : Str} ;
+
+

But when these category symbols are used to denote their linearization types in resource module, these definitions are translated to -

-  oper Adv  : Type = {s : Str  ; lock_Adv  : {}} ;
-  oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
-
+

+
+    oper Adv  : Type = {s : Str  ; lock_Adv  : {}} ;
+    oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
+
+

In this way, the user of a resource grammar cannot confuse adverbs with conjunctions. In other words, the lock fields force the type checker to function as grammaticality checker. - -

- -When the resource grammar is opened in an application grammar, the +

+

+When the resource grammar is opened in an application grammar, the lock fields are never seen (except possibly in type error messages), and the application grammarian should never write them herself. If she has to do this, it is a sign that the resource grammar is incomplete, and the proper way to proceed is to fix the resource grammar. - -

- +

+

The resource grammarian has to provide the dummy lock field values -in her hidden definitions of constants in Paradigms. For instance, -

-  mkAdv : Str -> Adv ;
-  -- mkAdv s = {s = s ; lock_Adv = <>} ;
-
+in her hidden definitions of constants in Paradigms. For instance, +

+
+    mkAdv : Str -> Adv ;
+    -- mkAdv s = {s = s ; lock_Adv = <>} ;
+
+

+ +

Lexicon construction

+

+The lexicon belonging to LangDut consists of two modules: +

+ - -

Lexicon construction

- -The lexicon belonging to LangDut consists of two modules: - -The reason why MorphoDut has to be used in StructuralDut -is that ParadigmsDut does not contain constructors for closed +

+The reason why MorphoDut has to be used in StructuralDut +is that ParadigmsDut does not contain constructors for closed word classes such as pronouns and determiners. The reason why we -recommend ParadigmsDut for building BasicDut is that +recommend ParadigmsDut for building BasicDut is that the coverage of the paradigms gets thereby tested and that the -use of the paradigms in BasicDut gives a good set of examples for +use of the paradigms in BasicDut gives a good set of examples for those who want to build new lexica. - - - - -

Inside phrase category modules

- -

Noun

- -

Verb

- -

Adjective

- - -

Lexicon extension

- -

The irregularity lexicon

- +

+ +

Inside phrase category modules

+ +

Noun

+ +

Verb

+ +

Adjective

+ +

Lexicon extension

+ +

The irregularity lexicon

+

It may be handy to provide a separate module of irregular verbs and other words which are difficult for a lexicographer to handle. There are usually a limited number of such words - a few hundred perhaps. Building such a lexicon separately also -makes it less important to cover everything by the -worst-case paradigms (mkV etc). - - - -

Lexicon extraction from a word list

- +makes it less important to cover everything by the +worst-case paradigms (mkV etc). +

+ +

Lexicon extraction from a word list

+

You can often find resources such as lists of irregular verbs on the internet. For instance, the - -Dutch for Travelers page gives a list of verbs in the +Dutch for Travelers +page gives a list of verbs in the traditional tabular format, which begins as follows: -

-  begrijpen begrijp begreep begrepen 	to understand
-  bijten    bijt    beet    gebeten     to bite
-  binden    bind    bond    gebonden 	to tie
-  breken    breek   brak    gebroken 	to break
-
+

+
+    begrijpen begrijp begreep begrepen 	to understand
+    bijten    bijt    beet    gebeten     to bite
+    binden    bind    bond    gebonden 	to tie
+    breken    breek   brak    gebroken 	to break
+
+

All you have to do is to write a suitable verb paradigm -

-  irregV : Str -> Str -> Str -> Str -> V ;
-
+

+
+    irregV : Str -> Str -> Str -> Str -> V ;
+
+

and a Perl or Python or Haskell script that transforms the table to -

-  begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ;
-  bijten_V    = irregV "bijten"    "bijt"    "beet"    "gebeten" ;
-  binden_V    = irregV "binden"    "bind"    "bond"    "gebonden" ;
-
+

+
+    begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ;
+    bijten_V    = irregV "bijten"    "bijt"    "beet"    "gebeten" ;
+    binden_V    = irregV "binden"    "bind"    "bond"    "gebonden" ;
+
+

(You may want to use the English translation for some purpose, as well.) - -

- +

+

When using ready-made word lists, you should think about coyright issues. Ideally, all resource grammar material should be provided under GNU General Public License. - - - -

Lexicon extraction from raw text data

- +

+ +

Lexicon extraction from raw text data

+

This is a cheap technique to build a lexicon of thousands of words, if text data is available in digital format. -See the -Functional Morphology homepage for details. - - - -

Extending the resource grammar API

- +See the Functional Morphology +homepage for details. +

+ +

Extending the resource grammar API

+

Sooner or later it will happen that the resource grammar API does not suffice for all applications. A common reason is that it does not include idiomatic expressions in a given language. The solution then is in the first place to build language-specific extension modules. This chapter will deal with this issue. - - -

Writing an instance of parametrized resource grammar implementation

- +

+ +

Writing an instance of parametrized resource grammar implementation

+

Above we have looked at how a resource implementation is built by the copy and paste method (from English to Dutch), that is, formally speaking, from scratch. A more elegant solution available for families of languages such as Romance and Scandinavian is to use parametrized modules. The advantages are -

+

+ + +

In this chapter, we will look at an example: adding Portuguese to the Romance family. - - - -

Parametrizing a resource grammar implementation

- +

+ +

Parametrizing a resource grammar implementation

+

This is the most demanding form of resource grammar writing. -We do not recommend the method of parametrizing from the +We do not recommend the method of parametrizing from the beginning: it is easier to have one language first implemented in the conventional way and then add another language of the same family by aprametrization. This means that the copy and paste method is still used, but at this time the differences -are put into an interface module. - -

- +are put into an interface module. +

+

This chapter will work out an example of how an Estonian grammar is constructed from the Finnish grammar through parametrization. +

- - - - + + + diff --git a/lib/resource-1.0/doc/Resource-HOWTO.txt b/lib/resource-1.0/doc/Resource-HOWTO.txt new file mode 100644 index 000000000..3910beabe --- /dev/null +++ b/lib/resource-1.0/doc/Resource-HOWTO.txt @@ -0,0 +1,542 @@ + +Resource grammar HOWTO +Author: Aarne Ranta +Last update: %%date(%c) + +% NOTE: this is a txt2tags file. +% Create an html file from this file using: +% txt2tags Resource-HOWTO.txt + +%!target:html + + + =HOW TO WRITE A RESOURCE GRAMMAR= + + + + [Aarne Ranta http://www.cs.chalmers.se/~aarne/] + + %%Date + + + +The purpose of this document is to tell how to implement the GF +resource grammar API for a new language. We will //not// cover how +to use the resource grammar, nor how to change the API. But we +will give some hints how to extend the API. + + + +**Notice**. This document concerns the API v. 1.0 which has not +yet been released. You can find the beginnings of it +in [``GF/lib/resource-1.0/`` ..]. See the +[``resource-1.0/README`` ../README] for +details on how this differs from previous versions. + + + +==The resource grammar API== + +The API is divided into a bunch of ``abstract`` modules. +The following figure gives the dependencies of these modules. + +[Lang.png] + + +It is advisable to start with a simpler subset of the API, which +leaves out certain complicated but not always necessary things: +tenses and most part of the lexicon. + + +[Test.png] + + + +The module structure is rather flat: almost every module is a direct +parent of the top module (``Lang`` or ``Test``). The idea +is that you can concentrate on one linguistic aspect at a time, or +also distribute the work among several authors. + + +===Phrase category modules=== + +The direct parents of the top could be called **phrase category modules**, +since each of them concentrates on a particular phrase category (nouns, verbs, +adjectives, sentences,...). A phrase category module tells +//how to construct phrases in that category//. You will find out that +all functions in any of these modules have the same value type (or maybe +one of a small number of different types). Thus we have + + +- ``Noun``: construction of nouns and noun phrases +- ``Adjective``: construction of adjectival phrases +- ``Verb``: construction of verb phrases +- ``Adverb``: construction of adverbial phrases +- ``Numeral``: construction of cardinal and ordinal numerals +- ``Sentence``: construction of sentences and imperatives +- ``Question``: construction of questions +- ``Relative``: construction of relative clauses +- ``Conjunction``: coordination of phrases +- ``Phrase``: construction of the major units of text and speech + + + + +===Infrastructure modules=== + +Expressions of each phrase category are constructed in the corresponding +phrase category module. But their //use// takes mostly place in other modules. +For instance, noun phrases, which are constructed in ``Noun``, are +used as arguments of functions of almost all other phrase category modules. +How can we build all these modules independently of each other? + + + +As usual in typeful programming, the //only// thing you need to know +about an object you use is its type. When writing a linearization rule +for a GF abstract syntax function, the only thing you need to know is +the linearization types of its value and argument categories. To achieve +the division of the resource grammar to several parallel phrase category modules, +what we need is an underlying definition of the linearization types. This +definition is given as the implementation of + +- ``Cat``: syntactic categories of the resource grammar + + +Any resource grammar implementation has first to agree on how to implement +``Cat``. Luckily enough, even this can be done incrementally: you +can skip the ``lincat`` definition of a category and use the default +``{s : Str}`` until you need to change it to something else. In +English, for instance, most categories do have this linearization type! + + + +As a slight asymmetry in the module diagrams, you find the following +modules: + +- ``Tense``: defines the parameters of polarity, anteriority, and tense +- ``Tensed``: defines how sentences use those parameters +- ``Untensed``: makes sentences use the polarity parameter only + + +The full resource API (``Lang``) uses ``Tensed``, whereas the +restricted ``Test`` API uses ``Untensed``. + + + +===Lexical modules=== + +What is lexical and what is syntactic is not as clearcut in GF as in +some other grammar formalisms. Logically, however, lexical means +``fun`` with no arguments. Linguistically, one may add to this +that the ``lin`` consists of only one token (or of a table whose values +are single tokens). Even in the restricted lexicon included in the resource +API, the latter rule is sometimes violated in some languages. + + + +Another characterization of lexical is that lexical units can be added +almost //ad libitum//, and they cannot be defined in terms of already +given rules. The lexical modules of the resource API are thus more like +samples than complete lists. There are three such modules: + +- ``Structural``: structural words (determiners, conjunctions,...) +- ``Basic``: basic everyday content words (nouns, verbs,...) +- ``Lex``: a very small sample of both structural and content words + + +The module ``Structural`` aims for completeness, and is likely to +be extended in future releases of the resource. The module ``Basic`` +gives a "random" list of words, which enable interesting testing of syntax, +and also a check list for morphology, since those words are likely to include +most morphological patterns of the language. + + + +The module ``Lex`` is used in ``Test`` instead of the two +larger modules. Its purpose is to provide a quick way to test the +syntactic structures of the phrase category modules without having to implement +the larger lexica. + + + +In the case of ``Basic`` it may come out clearer than anywhere else +in the API that it is impossible to give exact translation equivalents in +different languages on the level of a resource grammar. In other words, +application grammars are likely to use the resource in different ways for +different languages. + + + +==Phases of the work== + +===Putting up a directory=== + +Unless you are writing an instance of a parametrized implementation +(Romance or Scandinavian), which will be covered later, the most +simple way is to follow roughly the following procedure. Assume you +are building a grammar for the Dutch language. Here are the first steps. + ++ Create a sister directory for ``GF/lib/resource/english``, named + ``dutch``. + ``` + cd GF/lib/resource/ + mkdir dutch + cd dutch + ``` + ++ Check out the [ISO 639 3-letter language code http://www.w3.org/WAI/ER/IG/ert/iso639.htm] + for Dutch: it is ``Dut``. + ++ Copy the ``*Eng.gf`` files from ``english`` ``dutch``, + and rename them: + ``` + cp ../english/*Eng.gf . + rename 's/Eng/Dut/' *Eng.gf + ``` + ++ Change the ``Eng`` module references to ``Dut`` references + in all files: + ``` sed -i 's/Eng/Dut/g' *Dut.gf + ++ This may of course change unwanted occurrences of the + string ``Eng`` - verify this by + ``` grep Dut *.gf + But you will have to make lots of manual changes in all files anyway! + ++ Comment out the contents of these files: + ``` sed -i 's/^/--/' *Dut.gf + This will give you a set of templates out of which the grammar + will grow as you uncomment and modify the files rule by rule. + ++ In the file ``TestDut.gf``, uncomment all lines except the list + of inherited modules. Now you can open the grammar in GF: + ``` gf TestDut.gf + ++ Now you will at all following steps have a valid, but incomplete + GF grammar. The GF command + ``` pg -printer=missing + tells you what exactly is missing. + + + +===The develop-test cycle=== + +The real work starts now. The order in which the ``Phrase`` modules +were introduced above is a natural order to proceed, even though not the +only one. So you will find yourself iterating the following steps: + ++ Select a phrase category module, e.g. ``NounDut``, and uncomment one + linearization rule (for instance, ``IndefSg``, which is + not too complicated). + ++ Write down some Dutch examples of this rule, in this case translations + of "a dog", "a house", "a big house", etc. + ++ Think about the categories involved (``CN, NP, N``) and the + variations they have. Encode this in the lincats of ``CatDut``. + You may have to define some new parameter types in ``ResDut``. + ++ To be able to test the construction, + define some words you need to instantiate it + in ``LexDut``. Again, it can be helpful to define some simple-minded + morphological paradigms in ``ResDut``, in particular worst-case + constructors corresponding to e.g. + ``ResEng.mkNoun``. + ++ Doing this, you may want to test the resource independently. Do this by + ``` + i -retain ResDut + cc mkNoun "ei" "eieren" Neutr + ``` + ++ Uncomment ``NounDut`` and ``LexDut`` in ``TestDut``, + and compile ``TestDut`` in GF. Then test by parsing, linearization, + and random generation. In particular, linearization to a table should + be used so that you see all forms produced: + ``` + gr -cat=NP -number=20 -tr | l -table + ``` + ++ Spare some tree-linearization pairs for later regression testing. + You can do this way (!!to be completed) + + +You are likely to run this cycle a few times for each linearization rule +you implement, and some hundreds of times altogether. There are 159 +``funs`` in ``Test`` (at the moment). + + + +Of course, you don't need to complete one phrase category module before starting +with the next one. Actually, a suitable subset of ``Noun``, +``Verb``, and ``Adjective`` will lead to a reasonable coverage +very soon, keep you motivated, and reveal errors. + + +===Resource modules used=== + +These modules will be written by you. + +- ``ResDut``: parameter types and auxiliary operations +- ``MorphoDut``: complete inflection engine; not needed for ``Test``. + + +These modules are language-independent and provided by the existing resource +package. + +- ``ParamX``: parameter types used in many languages +- ``TenseX``: implementation of the logical tense, anteriority, + and polarity parameters +- ``Coordination``: operations to deal with lists and coordination +- ``Prelude``: general-purpose operations on strings, records, + truth values, etc. +- ``Predefined``: general-purpose operations with hard-coded definitions + + + + +===Morphology and lexicon=== + +When the implementation of ``Test`` is complete, it is time to +work out the lexicon files. The underlying machinery is provided in +``MorphoDut``, which is, in effect, your linguistic theory of +Dutch morphology. It can contain very sophisticated and complicated +definitions, which are not necessarily suitable for actually building a +lexicon. For this purpose, you should write the module + +- ``ParadigmsDut``: morphological paradigms for the lexicographer. + + +This module provides high-level ways to define the linearization of +lexical items, of categories ``N, A, V`` and their complement-taking +variants. + + + +For ease of use, the ``Paradigms`` modules follow a certain +naming convention. Thus they for each lexical category, such as ``N``, +the functions + +- ``mkN``, for worst-case construction of ``N``. Its type signature + has the form + ``` + mkN : Str -> ... -> Str -> P -> ... -> Q -> N + ``` + with as many string and parameter arguments as can ever be needed to + construct an ``N``. +- ``regN``, for the most common cases, with just one string argument: + ``` + regN : Str -> N + ``` +- A language-dependent (small) set of functions to handle mild irregularities + and common exceptions. + +For the complement-taking variants, such as ``V2``, we provide + +- ``mkV2``, which takes a ``V`` and all necessary arguments, such + as case and preposition: + ``` + mkV2 : V -> Case -> Str -> V2 ; + ``` +- A language-dependent (small) set of functions to handle common special cases, + such as direct transitive verbs: + ``` + dirV2 : V -> V2 ; + -- dirV2 v = mkV2 v accusative [] + ``` + + +The golden rule for the design of paradigms is that + +- The user will only need function applications with constants and strings, + never any records or tables. + + +The discipline of data abstraction moreover requires that the user of the resource +is not given access to parameter constructors, but only to constants that denote +them. This gives the resource grammarian the freedom to change the underlying +data representation if needed. It means that the ``ParadigmsDut`` module has +to define constants for those parameter types and constructors that +the application grammarian may need to use, e.g. +``` + oper + Case : Type ; + nominative, accusative, genitive : Case ; +``` +These constants are defined in terms of parameter types and constructors +in ``ResDut`` and ``MorphoDut``, which modules are are not +accessible to the application grammarian. + + +===Lock fields=== + +An important difference between ``MorphoDut`` and +``ParadigmsDut`` is that the former uses "raw" record types +as lincats, whereas the latter used category symbols defined in +``CatDut``. When these category symbols are used to denote +record types in a resource modules, such as ``ParadigmsDut``, +a **lock field** is added to the record, so that categories +with the same implementation are not confused with each other. +(This is inspired by the ``newtype`` discipline in Haskell.) +For instance, the lincats of adverbs and conjunctions may be the same +in ``CatDut``: +``` + lincat Adv = {s : Str} ; + lincat Conj = {s : Str} ; +``` +But when these category symbols are used to denote their linearization +types in resource module, these definitions are translated to +``` + oper Adv : Type = {s : Str ; lock_Adv : {}} ; + oper Conj : Type = {s : Str} ; lock_Conj : {}} ; +``` +In this way, the user of a resource grammar cannot confuse adverbs with +conjunctions. In other words, the lock fields force the type checker +to function as grammaticality checker. + + + +When the resource grammar is ``open``ed in an application grammar, the +lock fields are never seen (except possibly in type error messages), +and the application grammarian should never write them herself. If she +has to do this, it is a sign that the resource grammar is incomplete, and +the proper way to proceed is to fix the resource grammar. + + + +The resource grammarian has to provide the dummy lock field values +in her hidden definitions of constants in ``Paradigms``. For instance, +``` + mkAdv : Str -> Adv ; + -- mkAdv s = {s = s ; lock_Adv = <>} ; +``` + + +===Lexicon construction=== + +The lexicon belonging to ``LangDut`` consists of two modules: + +- ``StructuralDut``, structural words, built by directly using + ``MorphoDut``. +- ``BasicDut``, content words, built by using ``ParadigmsDut``. + + +The reason why ``MorphoDut`` has to be used in ``StructuralDut`` +is that ``ParadigmsDut`` does not contain constructors for closed +word classes such as pronouns and determiners. The reason why we +recommend ``ParadigmsDut`` for building ``BasicDut`` is that +the coverage of the paradigms gets thereby tested and that the +use of the paradigms in ``BasicDut`` gives a good set of examples for +those who want to build new lexica. + + + + +==Inside phrase category modules== + +===Noun=== + +===Verb=== + +===Adjective=== + + +==Lexicon extension== + +===The irregularity lexicon=== + +It may be handy to provide a separate module of irregular +verbs and other words which are difficult for a lexicographer +to handle. There are usually a limited number of such words - a +few hundred perhaps. Building such a lexicon separately also +makes it less important to cover //everything// by the +worst-case paradigms (``mkV`` etc). + + + +===Lexicon extraction from a word list=== + +You can often find resources such as lists of +irregular verbs on the internet. For instance, the +[Dutch for Travelers http://www.dutchtrav.com/gram/irrverbs.html] +page gives a list of verbs in the +traditional tabular format, which begins as follows: +``` + begrijpen begrijp begreep begrepen to understand + bijten bijt beet gebeten to bite + binden bind bond gebonden to tie + breken breek brak gebroken to break +``` +All you have to do is to write a suitable verb paradigm +``` + irregV : Str -> Str -> Str -> Str -> V ; +``` +and a Perl or Python or Haskell script that transforms +the table to +``` + begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ; + bijten_V = irregV "bijten" "bijt" "beet" "gebeten" ; + binden_V = irregV "binden" "bind" "bond" "gebonden" ; +``` +(You may want to use the English translation for some purpose, as well.) + + + +When using ready-made word lists, you should think about +coyright issues. Ideally, all resource grammar material should +be provided under GNU General Public License. + + + +===Lexicon extraction from raw text data=== + +This is a cheap technique to build a lexicon of thousands +of words, if text data is available in digital format. +See the [Functional Morphology http://www.cs.chalmers.se/~markus/FM/] +homepage for details. + + + +===Extending the resource grammar API=== + +Sooner or later it will happen that the resource grammar API +does not suffice for all applications. A common reason is +that it does not include idiomatic expressions in a given language. +The solution then is in the first place to build language-specific +extension modules. This chapter will deal with this issue. + + +==Writing an instance of parametrized resource grammar implementation== + +Above we have looked at how a resource implementation is built by +the copy and paste method (from English to Dutch), that is, formally +speaking, from scratch. A more elegant solution available for +families of languages such as Romance and Scandinavian is to +use parametrized modules. The advantages are + +- theoretical: linguistic generalizations and insights +- practical: maintainability improves with fewer components + + +In this chapter, we will look at an example: adding Portuguese to +the Romance family. + + + +==Parametrizing a resource grammar implementation== + +This is the most demanding form of resource grammar writing. +We do //not// recommend the method of parametrizing from the +beginning: it is easier to have one language first implemented +in the conventional way and then add another language of the +same family by aprametrization. This means that the copy and +paste method is still used, but at this time the differences +are put into an ``interface`` module. + + + +This chapter will work out an example of how an Estonian grammar +is constructed from the Finnish grammar through parametrization. + +