moved resource-howto to txt2tags format

This commit is contained in:
aarne
2005-12-08 13:54:33 +00:00
parent a8a5080693
commit 89423c7ad3
2 changed files with 974 additions and 374 deletions

View File

@@ -1,540 +1,598 @@
<html> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<body> <HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
</HEAD><BODY BGCOLOR="white" TEXT="black">
<FONT SIZE="4">
</FONT></CENTER>
<center> <P></P>
<h1>HOW TO WRITE A RESOURCE GRAMMAR</h1> <HR NOSHADE SIZE=1>
<P></P>
<p> <UL>
<LI><A HREF="#toc1">HOW TO WRITE A RESOURCE GRAMMAR</A>
<a href="http://www.cs.chalmers.se/~aarne/">Aarne Ranta</a> <UL>
<p> <LI><A HREF="#toc2">The resource grammar API</A>
30 November 2005 <UL>
</center> <LI><A HREF="#toc3">Phrase category modules</A>
<LI><A HREF="#toc4">Infrastructure modules</A>
<p> <LI><A HREF="#toc5">Lexical modules</A>
</UL>
<LI><A HREF="#toc6">Phases of the work</A>
<UL>
<LI><A HREF="#toc7">Putting up a directory</A>
<LI><A HREF="#toc8">The develop-test cycle</A>
<LI><A HREF="#toc9">Resource modules used</A>
<LI><A HREF="#toc10">Morphology and lexicon</A>
<LI><A HREF="#toc11">Lock fields</A>
<LI><A HREF="#toc12">Lexicon construction</A>
</UL>
<LI><A HREF="#toc13">Inside phrase category modules</A>
<UL>
<LI><A HREF="#toc14">Noun</A>
<LI><A HREF="#toc15">Verb</A>
<LI><A HREF="#toc16">Adjective</A>
</UL>
<LI><A HREF="#toc17">Lexicon extension</A>
<UL>
<LI><A HREF="#toc18">The irregularity lexicon</A>
<LI><A HREF="#toc19">Lexicon extraction from a word list</A>
<LI><A HREF="#toc20">Lexicon extraction from raw text data</A>
<LI><A HREF="#toc21">Extending the resource grammar API</A>
</UL>
<LI><A HREF="#toc22">Writing an instance of parametrized resource grammar implementation</A>
<LI><A HREF="#toc23">Parametrizing a resource grammar implementation</A>
</UL>
</UL>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<P>
Resource grammar HOWTO
Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;
Last update: Thu Dec 8 14:52:30 2005
</P>
<A NAME="toc1"></A>
<H1>HOW TO WRITE A RESOURCE GRAMMAR</H1>
<P>
<A HREF="http://www.cs.chalmers.se/~aarne/">Aarne Ranta</A>
</P>
<P>
20051208
</P>
<P>
The purpose of this document is to tell how to implement the GF The purpose of this document is to tell how to implement the GF
resource grammar API for a new language. We will <i>not</i> cover how resource grammar API for a new language. We will <I>not</I> cover how
to use the resource grammar, nor how to change the API. But we to use the resource grammar, nor how to change the API. But we
will give some hints how to extend the API. will give some hints how to extend the API.
</P>
<p> <P>
<B>Notice</B>. This document concerns the API v. 1.0 which has not
<b>Notice</b>. This document concerns the API v. 1.0 which has not
yet been released. You can find the beginnings of it yet been released. You can find the beginnings of it
in <a href=".."><tt>GF/lib/resource-1.0/</tt></a>. See the in <A HREF=".."><CODE>GF/lib/resource-1.0/</CODE></A>. See the
<a href="../README"><tt>resource-1.0/README</tt></a> for <A HREF="../README"><CODE>resource-1.0/README</CODE></A> for
details on how this differs from previous versions. details on how this differs from previous versions.
</P>
<A NAME="toc2"></A>
<H2>The resource grammar API</H2>
<h2>The resource grammar API</h2> <P>
The API is divided into a bunch of <CODE>abstract</CODE> modules.
The API is divided into a bunch of <tt>abstract</tt> modules.
The following figure gives the dependencies of these modules. The following figure gives the dependencies of these modules.
</P>
<center> <P>
<img src="Lang.png"> <IMG ALIGN="left" SRC="Lang.png" BORDER="0" ALT="">
</center> </P>
<P>
It is advisable to start with a simpler subset of the API, which It is advisable to start with a simpler subset of the API, which
leaves out certain complicated but not always necessary things: leaves out certain complicated but not always necessary things:
tenses and most part of the lexicon. tenses and most part of the lexicon.
</P>
<center> <P>
<img src="Test.png"> <IMG ALIGN="left" SRC="Test.png" BORDER="0" ALT="">
</center> </P>
<P>
The module structure is rather flat: almost every module is a direct The module structure is rather flat: almost every module is a direct
parent of the top module (<tt>Lang</tt> or <tt>Test</tt>). The idea parent of the top module (<CODE>Lang</CODE> or <CODE>Test</CODE>). The idea
is that you can concentrate on one linguistic aspect at a time, or is that you can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors. also distribute the work among several authors.
</P>
<A NAME="toc3"></A>
<h3>Phrase category modules</h3> <H3>Phrase category modules</H3>
<P>
The direct parents of the top could be called <b>phrase category modules</b>, The direct parents of the top could be called <B>phrase category modules</B>,
since each of them concentrates on a particular phrase category (nouns, verbs, since each of them concentrates on a particular phrase category (nouns, verbs,
adjectives, sentences,...). A phrase category module tells adjectives, sentences,...). A phrase category module tells
<i>how to construct phrases in that category</i>. You will find out that <I>how to construct phrases in that category</I>. You will find out that
all functions in any of these modules have the same value type (or maybe all functions in any of these modules have the same value type (or maybe
one of a small number of different types). Thus we have one of a small number of different types). Thus we have
<ul> </P>
<li> <tt>Noun</tt>: construction of nouns and noun phrases <UL>
<li> <tt>Adjective</tt>: construction of adjectival phrases <LI><CODE>Noun</CODE>: construction of nouns and noun phrases
<li> <tt>Verb</tt>: construction of verb phrases <LI><CODE>Adjective</CODE>: construction of adjectival phrases
<li> <tt>Adverb</tt>: construction of adverbial phrases <LI><CODE>Verb</CODE>: construction of verb phrases
<li> <tt>Numeral</tt>: construction of cardinal and ordinal numerals <LI><CODE>Adverb</CODE>: construction of adverbial phrases
<li> <tt>Sentence</tt>: construction of sentences and imperatives <LI><CODE>Numeral</CODE>: construction of cardinal and ordinal numerals
<li> <tt>Question</tt>: construction of questions <LI><CODE>Sentence</CODE>: construction of sentences and imperatives
<li> <tt>Relative</tt>: construction of relative clauses <LI><CODE>Question</CODE>: construction of questions
<li> <tt>Conjunction</tt>: coordination of phrases <LI><CODE>Relative</CODE>: construction of relative clauses
<li> <tt>Phrase</tt>: construction of the major units of text and speech <LI><CODE>Conjunction</CODE>: coordination of phrases
</ul> <LI><CODE>Phrase</CODE>: construction of the major units of text and speech
</UL>
<h3>Infrastructure modules</h3>
<A NAME="toc4"></A>
<H3>Infrastructure modules</H3>
<P>
Expressions of each phrase category are constructed in the corresponding Expressions of each phrase category are constructed in the corresponding
phrase category module. But their <i>use</i> takes mostly place in other modules. phrase category module. But their <I>use</I> takes mostly place in other modules.
For instance, noun phrases, which are constructed in <tt>Noun</tt>, are For instance, noun phrases, which are constructed in <CODE>Noun</CODE>, are
used as arguments of functions of almost all other phrase category modules. used as arguments of functions of almost all other phrase category modules.
How can we build all these modules independently of each other? How can we build all these modules independently of each other?
</P>
<p> <P>
As usual in typeful programming, the <I>only</I> thing you need to know
As usual in typeful programming, the <i>only</i> thing you need to know
about an object you use is its type. When writing a linearization rule about an object you use is its type. When writing a linearization rule
for a GF abstract syntax function, the only thing you need to know is for a GF abstract syntax function, the only thing you need to know is
the linearization types of its value and argument categories. To achieve the linearization types of its value and argument categories. To achieve
the division of the resource grammar to several parallel phrase category modules, the division of the resource grammar to several parallel phrase category modules,
what we need is an underlying definition of the linearization types. This what we need is an underlying definition of the linearization types. This
definition is given as the implementation of definition is given as the implementation of
<ul> </P>
<li> <tt>Cat</tt>: syntactic categories of the resource grammar <UL>
</ul> <LI><CODE>Cat</CODE>: syntactic categories of the resource grammar
</UL>
<P>
Any resource grammar implementation has first to agree on how to implement Any resource grammar implementation has first to agree on how to implement
<tt>Cat</tt>. Luckily enough, even this can be done incrementally: you <CODE>Cat</CODE>. Luckily enough, even this can be done incrementally: you
can skip the <tt>lincat</tt> definition of a category and use the default can skip the <CODE>lincat</CODE> definition of a category and use the default
<tt>{s : Str}</tt> until you need to change it to something else. In <CODE>{s : Str}</CODE> until you need to change it to something else. In
English, for instance, most categories do have this linearization type! English, for instance, most categories do have this linearization type!
</P>
<p> <P>
As a slight asymmetry in the module diagrams, you find the following As a slight asymmetry in the module diagrams, you find the following
modules: modules:
<ul> </P>
<li> <tt>Tense</tt>: defines the parameters of polarity, anteriority, and tense <UL>
<li> <tt>Tensed</tt>: defines how sentences use those parameters <LI><CODE>Tense</CODE>: defines the parameters of polarity, anteriority, and tense
<li> <tt>Untensed</tt>: makes sentences use the polarity parameter only <LI><CODE>Tensed</CODE>: defines how sentences use those parameters
</ul> <LI><CODE>Untensed</CODE>: makes sentences use the polarity parameter only
The full resource API (<tt>Lang</tt>) uses <tt>Tensed</tt>, whereas the </UL>
restricted <tt>Test</tt> API uses <tt>Untensed</tt>.
<h3>Lexical modules</h3>
<P>
The full resource API (<CODE>Lang</CODE>) uses <CODE>Tensed</CODE>, whereas the
restricted <CODE>Test</CODE> API uses <CODE>Untensed</CODE>.
</P>
<A NAME="toc5"></A>
<H3>Lexical modules</H3>
<P>
What is lexical and what is syntactic is not as clearcut in GF as in What is lexical and what is syntactic is not as clearcut in GF as in
some other grammar formalisms. Logically, however, lexical means some other grammar formalisms. Logically, however, lexical means
<tt>fun</tt> with no arguments. Linguistically, one may add to this <CODE>fun</CODE> with no arguments. Linguistically, one may add to this
that the <tt>lin</tt> consists of only one token (or of a table whose values that the <CODE>lin</CODE> consists of only one token (or of a table whose values
are single tokens). Even in the restricted lexicon included in the resource are single tokens). Even in the restricted lexicon included in the resource
API, the latter rule is sometimes violated in some languages. API, the latter rule is sometimes violated in some languages.
</P>
<p> <P>
Another characterization of lexical is that lexical units can be added Another characterization of lexical is that lexical units can be added
almost <i>ad libitum</i>, and they cannot be defined in terms of already almost <I>ad libitum</I>, and they cannot be defined in terms of already
given rules. The lexical modules of the resource API are thus more like given rules. The lexical modules of the resource API are thus more like
samples than complete lists. There are three such modules: samples than complete lists. There are three such modules:
<ul> </P>
<li> <tt>Structural</tt>: structural words (determiners, conjunctions,...) <UL>
<li> <tt>Basic</tt>: basic everyday content words (nouns, verbs,...) <LI><CODE>Structural</CODE>: structural words (determiners, conjunctions,...)
<li> <tt>Lex</tt>: a very small sample of both structural and content words <LI><CODE>Basic</CODE>: basic everyday content words (nouns, verbs,...)
</ul> <LI><CODE>Lex</CODE>: a very small sample of both structural and content words
The module <tt>Structural</tt> aims for completeness, and is likely to </UL>
be extended in future releases of the resource. The module <tt>Basic</tt>
<P>
The module <CODE>Structural</CODE> aims for completeness, and is likely to
be extended in future releases of the resource. The module <CODE>Basic</CODE>
gives a "random" list of words, which enable interesting testing of syntax, gives a "random" list of words, which enable interesting testing of syntax,
and also a check list for morphology, since those words are likely to include and also a check list for morphology, since those words are likely to include
most morphological patterns of the language. most morphological patterns of the language.
</P>
<p> <P>
The module <CODE>Lex</CODE> is used in <CODE>Test</CODE> instead of the two
The module <tt>Lex</tt> is used in <tt>Test</tt> instead of the two
larger modules. Its purpose is to provide a quick way to test the larger modules. Its purpose is to provide a quick way to test the
syntactic structures of the phrase category modules without having to implement syntactic structures of the phrase category modules without having to implement
the larger lexica. the larger lexica.
</P>
<p> <P>
In the case of <CODE>Basic</CODE> it may come out clearer than anywhere else
In the case of <tt>Basic</tt> it may come out clearer than anywhere else
in the API that it is impossible to give exact translation equivalents in in the API that it is impossible to give exact translation equivalents in
different languages on the level of a resource grammar. In other words, different languages on the level of a resource grammar. In other words,
application grammars are likely to use the resource in different ways for application grammars are likely to use the resource in different ways for
different languages. different languages.
</P>
<A NAME="toc6"></A>
<H2>Phases of the work</H2>
<h2>Phases of the work</h2> <A NAME="toc7"></A>
<H3>Putting up a directory</H3>
<h3>Putting up a directory</h3> <P>
Unless you are writing an instance of a parametrized implementation Unless you are writing an instance of a parametrized implementation
(Romance or Scandinavian), which will be covered later, the most (Romance or Scandinavian), which will be covered later, the most
simple way is to follow roughly the following procedure. Assume you simple way is to follow roughly the following procedure. Assume you
are building a grammar for the Dutch language. Here are the first steps. are building a grammar for the Dutch language. Here are the first steps.
<ol> </P>
<li> Create a sister directory for <tt>GF/lib/resource/english</tt>, named <OL>
<tt>dutch</tt>. <LI>Create a sister directory for <CODE>GF/lib/resource/english</CODE>, named
<pre> <CODE>dutch</CODE>.
```
cd GF/lib/resource/ cd GF/lib/resource/
mkdir dutch mkdir dutch
cd dutch cd dutch
</pre> ```
<P></P>
<li> Check out the <a href="http://www.w3.org/WAI/ER/IG/ert/iso639.htm"> <LI>Check out the <A HREF="http://www.w3.org/WAI/ER/IG/ert/iso639.htm">ISO 639 3-letter language code</A>
ISO 639 3-letter language code</a> for Dutch: it is <tt>Dut</tt>. for Dutch: it is <CODE>Dut</CODE>.
<P></P>
<li> Copy the <tt>*Eng.gf</tt> files from <tt>english</tt> <tt>dutch</tt>, <LI>Copy the <CODE>*Eng.gf</CODE> files from <CODE>english</CODE> <CODE>dutch</CODE>,
and rename them: and rename them:
<pre> ```
cp ../english/*Eng.gf . cp ../english/*Eng.gf .
rename 's/Eng/Dut/' *Eng.gf rename 's/Eng/Dut/' *Eng.gf
</pre> ```
<P></P>
<li> Change the <tt>Eng</tt> module references to <tt>Dut</tt> references <LI>Change the <CODE>Eng</CODE> module references to <CODE>Dut</CODE> references
in all files: in all files:
<pre> ``` sed -i 's/Eng/Dut/g' *Dut.gf
sed -i 's/Eng/Dut/g' *Dut.gf <P></P>
</pre> <LI>This may of course change unwanted occurrences of the
string <CODE>Eng</CODE> - verify this by
<li> This may of course change unwanted occurrences of the ``` grep Dut *.gf
string <tt>Eng</tt> - verify this by
<pre>
grep Dut *.gf
</pre>
But you will have to make lots of manual changes in all files anyway! But you will have to make lots of manual changes in all files anyway!
<P></P>
<li> Comment out the contents of these files: <LI>Comment out the contents of these files:
<pre> ``` sed -i 's/^/--/' *Dut.gf
sed -i 's/^/--/' *Dut.gf
</pre>
This will give you a set of templates out of which the grammar This will give you a set of templates out of which the grammar
will grow as you uncomment and modify the files rule by rule. will grow as you uncomment and modify the files rule by rule.
<P></P>
<li> In the file <tt>TestDut.gf</tt>, uncomment all lines except the list <LI>In the file <CODE>TestDut.gf</CODE>, uncomment all lines except the list
of inherited modules. Now you can open the grammar in GF: of inherited modules. Now you can open the grammar in GF:
<pre> ``` gf TestDut.gf
gf TestDut.gf <P></P>
</pre> <LI>Now you will at all following steps have a valid, but incomplete
<li> Now you will at all following steps have a valid, but incomplete
GF grammar. The GF command GF grammar. The GF command
<pre> ``` pg -printer=missing
pg -printer=missing
</pre>
tells you what exactly is missing. tells you what exactly is missing.
</OL>
</ol> <A NAME="toc8"></A>
<H3>The develop-test cycle</H3>
<P>
<h3>The develop-test cycle</h3> The real work starts now. The order in which the <CODE>Phrase</CODE> modules
The real work starts now. The order in which the <tt>Phrase</tt> modules
were introduced above is a natural order to proceed, even though not the were introduced above is a natural order to proceed, even though not the
only one. So you will find yourself iterating the following steps: only one. So you will find yourself iterating the following steps:
</P>
<ol> <OL>
<li> Select a phrase category module, e.g. <tt>NounDut</tt>, and uncomment one <LI>Select a phrase category module, e.g. <CODE>NounDut</CODE>, and uncomment one
linearization rule (for instance, <tt>IndefSg</tt>, which is linearization rule (for instance, <CODE>IndefSg</CODE>, which is
not too complicated). not too complicated).
<P></P>
<li> Write down some Dutch examples of this rule, in this case translations <LI>Write down some Dutch examples of this rule, in this case translations
of "a dog", "a house", "a big house", etc. of "a dog", "a house", "a big house", etc.
<P></P>
<li> Think about the categories involved (<tt>CN, NP, N</tt>) and the <LI>Think about the categories involved (<CODE>CN, NP, N</CODE>) and the
variations they have. Encode this in the lincats of <tt>CatDut</tt>. variations they have. Encode this in the lincats of <CODE>CatDut</CODE>.
You may have to define some new parameter types in <tt>ResDut</tt>. You may have to define some new parameter types in <CODE>ResDut</CODE>.
<P></P>
<li> To be able to test the construction, <LI>To be able to test the construction,
define some words you need to instantiate it define some words you need to instantiate it
in <tt>LexDut</tt>. Again, it can be helpful to define some simple-minded in <CODE>LexDut</CODE>. Again, it can be helpful to define some simple-minded
morphological paradigms in <tt>ResDut</tt>, in particular worst-case morphological paradigms in <CODE>ResDut</CODE>, in particular worst-case
constructors corresponding to e.g. constructors corresponding to e.g.
<tt>ResEng.mkNoun</tt>. <CODE>ResEng.mkNoun</CODE>.
<P></P>
<li> Doing this, you may want to test the resource independently. Do this by <LI>Doing this, you may want to test the resource independently. Do this by
<pre> ```
i -retain ResDut i -retain ResDut
cc mkNoun "ei" "eieren" Neutr cc mkNoun "ei" "eieren" Neutr
</pre> ```
<P></P>
<li> Uncomment <tt>NounDut</tt> and <tt>LexDut</tt> in <tt>TestDut</tt>, <LI>Uncomment <CODE>NounDut</CODE> and <CODE>LexDut</CODE> in <CODE>TestDut</CODE>,
and compile <tt>TestDut</tt> in GF. Then test by parsing, linearization, and compile <CODE>TestDut</CODE> in GF. Then test by parsing, linearization,
and random generation. In particular, linearization to a table should and random generation. In particular, linearization to a table should
be used so that you see all forms produced: be used so that you see all forms produced:
<pre> ```
gr -cat=NP -number=20 -tr | l -table gr -cat=NP -number=20 -tr | l -table
</pre> ```
<P></P>
<li> Spare some tree-linearization pairs for later regression testing. <LI>Spare some tree-linearization pairs for later regression testing.
You can do this way (!!to be completed) You can do this way (!!to be completed)
</OL>
</ol> <P>
You are likely to run this cycle a few times for each linearization rule You are likely to run this cycle a few times for each linearization rule
you implement, and some hundreds of times altogether. There are 159 you implement, and some hundreds of times altogether. There are 159
<tt>funs</tt> in <tt>Test</tt> (at the moment). <CODE>funs</CODE> in <CODE>Test</CODE> (at the moment).
</P>
<p> <P>
Of course, you don't need to complete one phrase category module before starting Of course, you don't need to complete one phrase category module before starting
with the next one. Actually, a suitable subset of <tt>Noun</tt>, with the next one. Actually, a suitable subset of <CODE>Noun</CODE>,
<tt>Verb</tt>, and <tt>Adjective</tt> will lead to a reasonable coverage <CODE>Verb</CODE>, and <CODE>Adjective</CODE> will lead to a reasonable coverage
very soon, keep you motivated, and reveal errors. very soon, keep you motivated, and reveal errors.
</P>
<A NAME="toc9"></A>
<h3>Resource modules used</h3> <H3>Resource modules used</H3>
<P>
These modules will be written by you. These modules will be written by you.
<ul> </P>
<li> <tt>ResDut</tt>: parameter types and auxiliary operations <UL>
<li> <tt>MorphoDut</tt>: complete inflection engine; not needed for <tt>Test</tt>. <LI><CODE>ResDut</CODE>: parameter types and auxiliary operations
</ul> <LI><CODE>MorphoDut</CODE>: complete inflection engine; not needed for <CODE>Test</CODE>.
</UL>
<P>
These modules are language-independent and provided by the existing resource These modules are language-independent and provided by the existing resource
package. package.
<ul> </P>
<li> <tt>ParamX</tt>: parameter types used in many languages <UL>
<li> <tt>TenseX</tt>: implementation of the logical tense, anteriority, <LI><CODE>ParamX</CODE>: parameter types used in many languages
<LI><CODE>TenseX</CODE>: implementation of the logical tense, anteriority,
and polarity parameters and polarity parameters
<li> <tt>Coordination</tt>: operations to deal with lists and coordination <LI><CODE>Coordination</CODE>: operations to deal with lists and coordination
<li> <tt>Prelude</tt>: general-purpose operations on strings, records, <LI><CODE>Prelude</CODE>: general-purpose operations on strings, records,
truth values, etc. truth values, etc.
<li> <tt>Predefined</tt>: general-purpose operations with hard-coded definitions <LI><CODE>Predefined</CODE>: general-purpose operations with hard-coded definitions
</ul> </UL>
<A NAME="toc10"></A>
<H3>Morphology and lexicon</H3>
<h3>Morphology and lexicon</h3> <P>
When the implementation of <CODE>Test</CODE> is complete, it is time to
When the implementation of <tt>Test</tt> is complete, it is time to
work out the lexicon files. The underlying machinery is provided in work out the lexicon files. The underlying machinery is provided in
<tt>MorphoDut</tt>, which is, in effect, your linguistic theory of <CODE>MorphoDut</CODE>, which is, in effect, your linguistic theory of
Dutch morphology. It can contain very sophisticated and complicated Dutch morphology. It can contain very sophisticated and complicated
definitions, which are not necessarily suitable for actually building a definitions, which are not necessarily suitable for actually building a
lexicon. For this purpose, you should write the module lexicon. For this purpose, you should write the module
<ul> </P>
<li> <tt>ParadigmsDut</tt>: morphological paradigms for the lexicographer. <UL>
</ul> <LI><CODE>ParadigmsDut</CODE>: morphological paradigms for the lexicographer.
</UL>
<P>
This module provides high-level ways to define the linearization of This module provides high-level ways to define the linearization of
lexical items, of categories <tt>N, A, V</tt> and their complement-taking lexical items, of categories <CODE>N, A, V</CODE> and their complement-taking
variants. variants.
</P>
<p> <P>
For ease of use, the <CODE>Paradigms</CODE> modules follow a certain
For ease of use, the <tt>Paradigms</tt> modules follow a certain naming convention. Thus they for each lexical category, such as <CODE>N</CODE>,
naming convention. Thus they for each lexical category, such as <tt>N</tt>,
the functions the functions
<ul> </P>
<li> <tt>mkN</tt>, for worst-case construction of <tt>N</tt>. Its type signature <UL>
<LI><CODE>mkN</CODE>, for worst-case construction of <CODE>N</CODE>. Its type signature
has the form has the form
<pre> ```
mkN : Str -> ... -> Str -> P -> ... -> Q -> N mkN : Str -&gt; ... -&gt; Str -&gt; P -&gt; ... -&gt; Q -&gt; N
</pre> ```
with as many string and parameter arguments as can ever be needed to with as many string and parameter arguments as can ever be needed to
construct an <tt>N</tt>. construct an <CODE>N</CODE>.
<li> <tt>regN</tt>, for the most common cases, with just one string argument: <LI><CODE>regN</CODE>, for the most common cases, with just one string argument:
<pre> ```
regN : Str -> N regN : Str -&gt; N
</pre> ```
<li> A language-dependent (small) set of functions to handle mild irregularities <LI>A language-dependent (small) set of functions to handle mild irregularities
and common exceptions. and common exceptions.
</ul> <P></P>
For the complement-taking variants, such as <tt>V2</tt>, we provide For the complement-taking variants, such as <CODE>V2</CODE>, we provide
<ul> <P></P>
<li> <tt>mkV2</tt>, which takes a <tt>V</tt> and all necessary arguments, such <LI><CODE>mkV2</CODE>, which takes a <CODE>V</CODE> and all necessary arguments, such
as case and preposition: as case and preposition:
<pre> ```
mkV2 : V -> Case -> Str -> V2 ; mkV2 : V -&gt; Case -&gt; Str -&gt; V2 ;
</pre> ```
<li> A language-dependent (small) set of functions to handle common special cases, <LI>A language-dependent (small) set of functions to handle common special cases,
such as direct transitive verbs: such as direct transitive verbs:
<pre> ```
dirV2 : V -> V2 ; dirV2 : V -&gt; V2 ;
-- dirV2 v = mkV2 v accusative [] -- dirV2 v = mkV2 v accusative []
</pre> ```
</ul> </UL>
<P>
The golden rule for the design of paradigms is that The golden rule for the design of paradigms is that
<ul> </P>
<li> The user will only need function applications with constants and strings, <UL>
<LI>The user will only need function applications with constants and strings,
never any records or tables. never any records or tables.
</ul> </UL>
<P>
The discipline of data abstraction moreover requires that the user of the resource The discipline of data abstraction moreover requires that the user of the resource
is not given access to parameter constructors, but only to constants that denote is not given access to parameter constructors, but only to constants that denote
them. This gives the resource grammarian the freedom to change the underlying them. This gives the resource grammarian the freedom to change the underlying
data representation if needed. It means that the <tt>ParadigmsDut</tt> module has data representation if needed. It means that the <CODE>ParadigmsDut</CODE> module has
to define constants for those parameter types and constructors that to define constants for those parameter types and constructors that
the application grammarian may need to use, e.g. the application grammarian may need to use, e.g.
<pre> </P>
<PRE>
oper oper
Case : Type ; Case : Type ;
nominative, accusative, genitive : Case ; nominative, accusative, genitive : Case ;
</pre> </PRE>
<P>
These constants are defined in terms of parameter types and constructors These constants are defined in terms of parameter types and constructors
in <tt>ResDut</tt> and <tt>MorphoDut</tt>, which modules are are not in <CODE>ResDut</CODE> and <CODE>MorphoDut</CODE>, which modules are are not
accessible to the application grammarian. accessible to the application grammarian.
</P>
<A NAME="toc11"></A>
<h3>Lock fields</h3> <H3>Lock fields</H3>
<P>
An important difference between <tt>MorphoDut</tt> and An important difference between <CODE>MorphoDut</CODE> and
<tt>ParadigmsDut</tt> is that the former uses "raw" record types <CODE>ParadigmsDut</CODE> is that the former uses "raw" record types
as lincats, whereas the latter used category symbols defined in as lincats, whereas the latter used category symbols defined in
<tt>CatDut</tt>. When these category symbols are used to denote <CODE>CatDut</CODE>. When these category symbols are used to denote
record types in a resource modules, such as <tt>ParadigmsDut</tt>, record types in a resource modules, such as <CODE>ParadigmsDut</CODE>,
a <b>lock field</b> is added to the record, so that categories a <B>lock field</B> is added to the record, so that categories
with the same implementation are not confused with each other. with the same implementation are not confused with each other.
(This is inspired by the <tt>newtype</tt> discipline in Haskell.) (This is inspired by the <CODE>newtype</CODE> discipline in Haskell.)
For instance, the lincats of adverbs and conjunctions may be the same For instance, the lincats of adverbs and conjunctions may be the same
in <tt>CatDut</tt>: in <CODE>CatDut</CODE>:
<pre> </P>
<PRE>
lincat Adv = {s : Str} ; lincat Adv = {s : Str} ;
lincat Conj = {s : Str} ; lincat Conj = {s : Str} ;
</pre> </PRE>
<P>
But when these category symbols are used to denote their linearization But when these category symbols are used to denote their linearization
types in resource module, these definitions are translated to types in resource module, these definitions are translated to
<pre> </P>
<PRE>
oper Adv : Type = {s : Str ; lock_Adv : {}} ; oper Adv : Type = {s : Str ; lock_Adv : {}} ;
oper Conj : Type = {s : Str} ; lock_Conj : {}} ; oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
</pre> </PRE>
<P>
In this way, the user of a resource grammar cannot confuse adverbs with In this way, the user of a resource grammar cannot confuse adverbs with
conjunctions. In other words, the lock fields force the type checker conjunctions. In other words, the lock fields force the type checker
to function as grammaticality checker. to function as grammaticality checker.
</P>
<p> <P>
When the resource grammar is <CODE>open</CODE>ed in an application grammar, the
When the resource grammar is <tt>open</tt>ed in an application grammar, the
lock fields are never seen (except possibly in type error messages), lock fields are never seen (except possibly in type error messages),
and the application grammarian should never write them herself. If she and the application grammarian should never write them herself. If she
has to do this, it is a sign that the resource grammar is incomplete, and has to do this, it is a sign that the resource grammar is incomplete, and
the proper way to proceed is to fix the resource grammar. the proper way to proceed is to fix the resource grammar.
</P>
<p> <P>
The resource grammarian has to provide the dummy lock field values The resource grammarian has to provide the dummy lock field values
in her hidden definitions of constants in <tt>Paradigms</tt>. For instance, in her hidden definitions of constants in <CODE>Paradigms</CODE>. For instance,
<pre> </P>
mkAdv : Str -> Adv ; <PRE>
-- mkAdv s = {s = s ; lock_Adv = <>} ; mkAdv : Str -&gt; Adv ;
</pre> -- mkAdv s = {s = s ; lock_Adv = &lt;&gt;} ;
</PRE>
<P></P>
<A NAME="toc12"></A>
<H3>Lexicon construction</H3>
<P>
The lexicon belonging to <CODE>LangDut</CODE> consists of two modules:
</P>
<UL>
<LI><CODE>StructuralDut</CODE>, structural words, built by directly using
<CODE>MorphoDut</CODE>.
<LI><CODE>BasicDut</CODE>, content words, built by using <CODE>ParadigmsDut</CODE>.
</UL>
<P>
<h3>Lexicon construction</h3> The reason why <CODE>MorphoDut</CODE> has to be used in <CODE>StructuralDut</CODE>
is that <CODE>ParadigmsDut</CODE> does not contain constructors for closed
The lexicon belonging to <tt>LangDut</tt> consists of two modules:
<ul>
<li> <tt>StructuralDut</tt>, structural words, built by directly using
<tt>MorphoDut</tt>.
<li> <tt>BasicDut</tt>, content words, built by using <tt>ParadigmsDut</tt>.
</ul>
The reason why <tt>MorphoDut</tt> has to be used in <tt>StructuralDut</tt>
is that <tt>ParadigmsDut</tt> does not contain constructors for closed
word classes such as pronouns and determiners. The reason why we word classes such as pronouns and determiners. The reason why we
recommend <tt>ParadigmsDut</tt> for building <tt>BasicDut</tt> is that recommend <CODE>ParadigmsDut</CODE> for building <CODE>BasicDut</CODE> is that
the coverage of the paradigms gets thereby tested and that the the coverage of the paradigms gets thereby tested and that the
use of the paradigms in <tt>BasicDut</tt> gives a good set of examples for use of the paradigms in <CODE>BasicDut</CODE> gives a good set of examples for
those who want to build new lexica. those who want to build new lexica.
</P>
<A NAME="toc13"></A>
<H2>Inside phrase category modules</H2>
<A NAME="toc14"></A>
<h2>Inside phrase category modules</h2> <H3>Noun</H3>
<A NAME="toc15"></A>
<h3>Noun</h3> <H3>Verb</H3>
<A NAME="toc16"></A>
<h3>Verb</h3> <H3>Adjective</H3>
<A NAME="toc17"></A>
<h3>Adjective</h3> <H2>Lexicon extension</H2>
<A NAME="toc18"></A>
<H3>The irregularity lexicon</H3>
<h2>Lexicon extension</h2> <P>
<h3>The irregularity lexicon</h3>
It may be handy to provide a separate module of irregular It may be handy to provide a separate module of irregular
verbs and other words which are difficult for a lexicographer verbs and other words which are difficult for a lexicographer
to handle. There are usually a limited number of such words - a to handle. There are usually a limited number of such words - a
few hundred perhaps. Building such a lexicon separately also few hundred perhaps. Building such a lexicon separately also
makes it less important to cover <i>everything</i> by the makes it less important to cover <I>everything</I> by the
worst-case paradigms (<tt>mkV</tt> etc). worst-case paradigms (<CODE>mkV</CODE> etc).
</P>
<A NAME="toc19"></A>
<H3>Lexicon extraction from a word list</H3>
<h3>Lexicon extraction from a word list</h3> <P>
You can often find resources such as lists of You can often find resources such as lists of
irregular verbs on the internet. For instance, the irregular verbs on the internet. For instance, the
<a href="http://www.dutchtrav.com/gram/irrverbs.html"> <A HREF="http://www.dutchtrav.com/gram/irrverbs.html">Dutch for Travelers</A>
Dutch for Travelers</a> page gives a list of verbs in the page gives a list of verbs in the
traditional tabular format, which begins as follows: traditional tabular format, which begins as follows:
<pre> </P>
<PRE>
begrijpen begrijp begreep begrepen to understand begrijpen begrijp begreep begrepen to understand
bijten bijt beet gebeten to bite bijten bijt beet gebeten to bite
binden bind bond gebonden to tie binden bind bond gebonden to tie
breken breek brak gebroken to break breken breek brak gebroken to break
</pre> </PRE>
<P>
All you have to do is to write a suitable verb paradigm All you have to do is to write a suitable verb paradigm
<pre> </P>
irregV : Str -> Str -> Str -> Str -> V ; <PRE>
</pre> irregV : Str -&gt; Str -&gt; Str -&gt; Str -&gt; V ;
</PRE>
<P>
and a Perl or Python or Haskell script that transforms and a Perl or Python or Haskell script that transforms
the table to the table to
<pre> </P>
<PRE>
begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ; begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ;
bijten_V = irregV "bijten" "bijt" "beet" "gebeten" ; bijten_V = irregV "bijten" "bijt" "beet" "gebeten" ;
binden_V = irregV "binden" "bind" "bond" "gebonden" ; binden_V = irregV "binden" "bind" "bond" "gebonden" ;
</pre> </PRE>
<P>
(You may want to use the English translation for some purpose, as well.) (You may want to use the English translation for some purpose, as well.)
</P>
<p> <P>
When using ready-made word lists, you should think about When using ready-made word lists, you should think about
coyright issues. Ideally, all resource grammar material should coyright issues. Ideally, all resource grammar material should
be provided under GNU General Public License. be provided under GNU General Public License.
</P>
<A NAME="toc20"></A>
<H3>Lexicon extraction from raw text data</H3>
<h3>Lexicon extraction from raw text data</h3> <P>
This is a cheap technique to build a lexicon of thousands This is a cheap technique to build a lexicon of thousands
of words, if text data is available in digital format. of words, if text data is available in digital format.
See the <a href="http://www.cs.chalmers.se/~markus/FM/"> See the <A HREF="http://www.cs.chalmers.se/~markus/FM/">Functional Morphology</A>
Functional Morphology</a> homepage for details. homepage for details.
</P>
<A NAME="toc21"></A>
<H3>Extending the resource grammar API</H3>
<h3>Extending the resource grammar API</h3> <P>
Sooner or later it will happen that the resource grammar API Sooner or later it will happen that the resource grammar API
does not suffice for all applications. A common reason is does not suffice for all applications. A common reason is
that it does not include idiomatic expressions in a given language. that it does not include idiomatic expressions in a given language.
The solution then is in the first place to build language-specific The solution then is in the first place to build language-specific
extension modules. This chapter will deal with this issue. extension modules. This chapter will deal with this issue.
</P>
<A NAME="toc22"></A>
<h2>Writing an instance of parametrized resource grammar implementation</h2> <H2>Writing an instance of parametrized resource grammar implementation</H2>
<P>
Above we have looked at how a resource implementation is built by Above we have looked at how a resource implementation is built by
the copy and paste method (from English to Dutch), that is, formally the copy and paste method (from English to Dutch), that is, formally
speaking, from scratch. A more elegant solution available for speaking, from scratch. A more elegant solution available for
families of languages such as Romance and Scandinavian is to families of languages such as Romance and Scandinavian is to
use parametrized modules. The advantages are use parametrized modules. The advantages are
<ul> </P>
<li> theoretical: linguistic generalizations and insights <UL>
<li> practical: maintainability improves with fewer components <LI>theoretical: linguistic generalizations and insights
</ul> <LI>practical: maintainability improves with fewer components
</UL>
<P>
In this chapter, we will look at an example: adding Portuguese to In this chapter, we will look at an example: adding Portuguese to
the Romance family. the Romance family.
</P>
<A NAME="toc23"></A>
<H2>Parametrizing a resource grammar implementation</H2>
<h2>Parametrizing a resource grammar implementation</h2> <P>
This is the most demanding form of resource grammar writing. This is the most demanding form of resource grammar writing.
We do <i>not</i> recommend the method of parametrizing from the We do <I>not</I> recommend the method of parametrizing from the
beginning: it is easier to have one language first implemented beginning: it is easier to have one language first implemented
in the conventional way and then add another language of the in the conventional way and then add another language of the
same family by aprametrization. This means that the copy and same family by aprametrization. This means that the copy and
paste method is still used, but at this time the differences paste method is still used, but at this time the differences
are put into an <tt>interface</tt> module. are put into an <CODE>interface</CODE> module.
</P>
<p> <P>
This chapter will work out an example of how an Estonian grammar This chapter will work out an example of how an Estonian grammar
is constructed from the Finnish grammar through parametrization. is constructed from the Finnish grammar through parametrization.
</P>
<!-- html code generated by txt2tags 2.0 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags -\-toc -thtml Resource-HOWTO.txt -->
</body> </BODY></HTML>
</html>

View File

@@ -0,0 +1,542 @@
Resource grammar HOWTO
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
Last update: %%date(%c)
% NOTE: this is a txt2tags file.
% Create an html file from this file using:
% txt2tags Resource-HOWTO.txt
%!target:html
=HOW TO WRITE A RESOURCE GRAMMAR=
[Aarne Ranta http://www.cs.chalmers.se/~aarne/]
%%Date
The purpose of this document is to tell how to implement the GF
resource grammar API for a new language. We will //not// cover how
to use the resource grammar, nor how to change the API. But we
will give some hints how to extend the API.
**Notice**. This document concerns the API v. 1.0 which has not
yet been released. You can find the beginnings of it
in [``GF/lib/resource-1.0/`` ..]. See the
[``resource-1.0/README`` ../README] for
details on how this differs from previous versions.
==The resource grammar API==
The API is divided into a bunch of ``abstract`` modules.
The following figure gives the dependencies of these modules.
[Lang.png]
It is advisable to start with a simpler subset of the API, which
leaves out certain complicated but not always necessary things:
tenses and most part of the lexicon.
[Test.png]
The module structure is rather flat: almost every module is a direct
parent of the top module (``Lang`` or ``Test``). The idea
is that you can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors.
===Phrase category modules===
The direct parents of the top could be called **phrase category modules**,
since each of them concentrates on a particular phrase category (nouns, verbs,
adjectives, sentences,...). A phrase category module tells
//how to construct phrases in that category//. You will find out that
all functions in any of these modules have the same value type (or maybe
one of a small number of different types). Thus we have
- ``Noun``: construction of nouns and noun phrases
- ``Adjective``: construction of adjectival phrases
- ``Verb``: construction of verb phrases
- ``Adverb``: construction of adverbial phrases
- ``Numeral``: construction of cardinal and ordinal numerals
- ``Sentence``: construction of sentences and imperatives
- ``Question``: construction of questions
- ``Relative``: construction of relative clauses
- ``Conjunction``: coordination of phrases
- ``Phrase``: construction of the major units of text and speech
===Infrastructure modules===
Expressions of each phrase category are constructed in the corresponding
phrase category module. But their //use// takes mostly place in other modules.
For instance, noun phrases, which are constructed in ``Noun``, are
used as arguments of functions of almost all other phrase category modules.
How can we build all these modules independently of each other?
As usual in typeful programming, the //only// thing you need to know
about an object you use is its type. When writing a linearization rule
for a GF abstract syntax function, the only thing you need to know is
the linearization types of its value and argument categories. To achieve
the division of the resource grammar to several parallel phrase category modules,
what we need is an underlying definition of the linearization types. This
definition is given as the implementation of
- ``Cat``: syntactic categories of the resource grammar
Any resource grammar implementation has first to agree on how to implement
``Cat``. Luckily enough, even this can be done incrementally: you
can skip the ``lincat`` definition of a category and use the default
``{s : Str}`` until you need to change it to something else. In
English, for instance, most categories do have this linearization type!
As a slight asymmetry in the module diagrams, you find the following
modules:
- ``Tense``: defines the parameters of polarity, anteriority, and tense
- ``Tensed``: defines how sentences use those parameters
- ``Untensed``: makes sentences use the polarity parameter only
The full resource API (``Lang``) uses ``Tensed``, whereas the
restricted ``Test`` API uses ``Untensed``.
===Lexical modules===
What is lexical and what is syntactic is not as clearcut in GF as in
some other grammar formalisms. Logically, however, lexical means
``fun`` with no arguments. Linguistically, one may add to this
that the ``lin`` consists of only one token (or of a table whose values
are single tokens). Even in the restricted lexicon included in the resource
API, the latter rule is sometimes violated in some languages.
Another characterization of lexical is that lexical units can be added
almost //ad libitum//, and they cannot be defined in terms of already
given rules. The lexical modules of the resource API are thus more like
samples than complete lists. There are three such modules:
- ``Structural``: structural words (determiners, conjunctions,...)
- ``Basic``: basic everyday content words (nouns, verbs,...)
- ``Lex``: a very small sample of both structural and content words
The module ``Structural`` aims for completeness, and is likely to
be extended in future releases of the resource. The module ``Basic``
gives a "random" list of words, which enable interesting testing of syntax,
and also a check list for morphology, since those words are likely to include
most morphological patterns of the language.
The module ``Lex`` is used in ``Test`` instead of the two
larger modules. Its purpose is to provide a quick way to test the
syntactic structures of the phrase category modules without having to implement
the larger lexica.
In the case of ``Basic`` it may come out clearer than anywhere else
in the API that it is impossible to give exact translation equivalents in
different languages on the level of a resource grammar. In other words,
application grammars are likely to use the resource in different ways for
different languages.
==Phases of the work==
===Putting up a directory===
Unless you are writing an instance of a parametrized implementation
(Romance or Scandinavian), which will be covered later, the most
simple way is to follow roughly the following procedure. Assume you
are building a grammar for the Dutch language. Here are the first steps.
+ Create a sister directory for ``GF/lib/resource/english``, named
``dutch``.
```
cd GF/lib/resource/
mkdir dutch
cd dutch
```
+ Check out the [ISO 639 3-letter language code http://www.w3.org/WAI/ER/IG/ert/iso639.htm]
for Dutch: it is ``Dut``.
+ Copy the ``*Eng.gf`` files from ``english`` ``dutch``,
and rename them:
```
cp ../english/*Eng.gf .
rename 's/Eng/Dut/' *Eng.gf
```
+ Change the ``Eng`` module references to ``Dut`` references
in all files:
``` sed -i 's/Eng/Dut/g' *Dut.gf
+ This may of course change unwanted occurrences of the
string ``Eng`` - verify this by
``` grep Dut *.gf
But you will have to make lots of manual changes in all files anyway!
+ Comment out the contents of these files:
``` sed -i 's/^/--/' *Dut.gf
This will give you a set of templates out of which the grammar
will grow as you uncomment and modify the files rule by rule.
+ In the file ``TestDut.gf``, uncomment all lines except the list
of inherited modules. Now you can open the grammar in GF:
``` gf TestDut.gf
+ Now you will at all following steps have a valid, but incomplete
GF grammar. The GF command
``` pg -printer=missing
tells you what exactly is missing.
===The develop-test cycle===
The real work starts now. The order in which the ``Phrase`` modules
were introduced above is a natural order to proceed, even though not the
only one. So you will find yourself iterating the following steps:
+ Select a phrase category module, e.g. ``NounDut``, and uncomment one
linearization rule (for instance, ``IndefSg``, which is
not too complicated).
+ Write down some Dutch examples of this rule, in this case translations
of "a dog", "a house", "a big house", etc.
+ Think about the categories involved (``CN, NP, N``) and the
variations they have. Encode this in the lincats of ``CatDut``.
You may have to define some new parameter types in ``ResDut``.
+ To be able to test the construction,
define some words you need to instantiate it
in ``LexDut``. Again, it can be helpful to define some simple-minded
morphological paradigms in ``ResDut``, in particular worst-case
constructors corresponding to e.g.
``ResEng.mkNoun``.
+ Doing this, you may want to test the resource independently. Do this by
```
i -retain ResDut
cc mkNoun "ei" "eieren" Neutr
```
+ Uncomment ``NounDut`` and ``LexDut`` in ``TestDut``,
and compile ``TestDut`` in GF. Then test by parsing, linearization,
and random generation. In particular, linearization to a table should
be used so that you see all forms produced:
```
gr -cat=NP -number=20 -tr | l -table
```
+ Spare some tree-linearization pairs for later regression testing.
You can do this way (!!to be completed)
You are likely to run this cycle a few times for each linearization rule
you implement, and some hundreds of times altogether. There are 159
``funs`` in ``Test`` (at the moment).
Of course, you don't need to complete one phrase category module before starting
with the next one. Actually, a suitable subset of ``Noun``,
``Verb``, and ``Adjective`` will lead to a reasonable coverage
very soon, keep you motivated, and reveal errors.
===Resource modules used===
These modules will be written by you.
- ``ResDut``: parameter types and auxiliary operations
- ``MorphoDut``: complete inflection engine; not needed for ``Test``.
These modules are language-independent and provided by the existing resource
package.
- ``ParamX``: parameter types used in many languages
- ``TenseX``: implementation of the logical tense, anteriority,
and polarity parameters
- ``Coordination``: operations to deal with lists and coordination
- ``Prelude``: general-purpose operations on strings, records,
truth values, etc.
- ``Predefined``: general-purpose operations with hard-coded definitions
===Morphology and lexicon===
When the implementation of ``Test`` is complete, it is time to
work out the lexicon files. The underlying machinery is provided in
``MorphoDut``, which is, in effect, your linguistic theory of
Dutch morphology. It can contain very sophisticated and complicated
definitions, which are not necessarily suitable for actually building a
lexicon. For this purpose, you should write the module
- ``ParadigmsDut``: morphological paradigms for the lexicographer.
This module provides high-level ways to define the linearization of
lexical items, of categories ``N, A, V`` and their complement-taking
variants.
For ease of use, the ``Paradigms`` modules follow a certain
naming convention. Thus they for each lexical category, such as ``N``,
the functions
- ``mkN``, for worst-case construction of ``N``. Its type signature
has the form
```
mkN : Str -> ... -> Str -> P -> ... -> Q -> N
```
with as many string and parameter arguments as can ever be needed to
construct an ``N``.
- ``regN``, for the most common cases, with just one string argument:
```
regN : Str -> N
```
- A language-dependent (small) set of functions to handle mild irregularities
and common exceptions.
For the complement-taking variants, such as ``V2``, we provide
- ``mkV2``, which takes a ``V`` and all necessary arguments, such
as case and preposition:
```
mkV2 : V -> Case -> Str -> V2 ;
```
- A language-dependent (small) set of functions to handle common special cases,
such as direct transitive verbs:
```
dirV2 : V -> V2 ;
-- dirV2 v = mkV2 v accusative []
```
The golden rule for the design of paradigms is that
- The user will only need function applications with constants and strings,
never any records or tables.
The discipline of data abstraction moreover requires that the user of the resource
is not given access to parameter constructors, but only to constants that denote
them. This gives the resource grammarian the freedom to change the underlying
data representation if needed. It means that the ``ParadigmsDut`` module has
to define constants for those parameter types and constructors that
the application grammarian may need to use, e.g.
```
oper
Case : Type ;
nominative, accusative, genitive : Case ;
```
These constants are defined in terms of parameter types and constructors
in ``ResDut`` and ``MorphoDut``, which modules are are not
accessible to the application grammarian.
===Lock fields===
An important difference between ``MorphoDut`` and
``ParadigmsDut`` is that the former uses "raw" record types
as lincats, whereas the latter used category symbols defined in
``CatDut``. When these category symbols are used to denote
record types in a resource modules, such as ``ParadigmsDut``,
a **lock field** is added to the record, so that categories
with the same implementation are not confused with each other.
(This is inspired by the ``newtype`` discipline in Haskell.)
For instance, the lincats of adverbs and conjunctions may be the same
in ``CatDut``:
```
lincat Adv = {s : Str} ;
lincat Conj = {s : Str} ;
```
But when these category symbols are used to denote their linearization
types in resource module, these definitions are translated to
```
oper Adv : Type = {s : Str ; lock_Adv : {}} ;
oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
```
In this way, the user of a resource grammar cannot confuse adverbs with
conjunctions. In other words, the lock fields force the type checker
to function as grammaticality checker.
When the resource grammar is ``open``ed in an application grammar, the
lock fields are never seen (except possibly in type error messages),
and the application grammarian should never write them herself. If she
has to do this, it is a sign that the resource grammar is incomplete, and
the proper way to proceed is to fix the resource grammar.
The resource grammarian has to provide the dummy lock field values
in her hidden definitions of constants in ``Paradigms``. For instance,
```
mkAdv : Str -> Adv ;
-- mkAdv s = {s = s ; lock_Adv = <>} ;
```
===Lexicon construction===
The lexicon belonging to ``LangDut`` consists of two modules:
- ``StructuralDut``, structural words, built by directly using
``MorphoDut``.
- ``BasicDut``, content words, built by using ``ParadigmsDut``.
The reason why ``MorphoDut`` has to be used in ``StructuralDut``
is that ``ParadigmsDut`` does not contain constructors for closed
word classes such as pronouns and determiners. The reason why we
recommend ``ParadigmsDut`` for building ``BasicDut`` is that
the coverage of the paradigms gets thereby tested and that the
use of the paradigms in ``BasicDut`` gives a good set of examples for
those who want to build new lexica.
==Inside phrase category modules==
===Noun===
===Verb===
===Adjective===
==Lexicon extension==
===The irregularity lexicon===
It may be handy to provide a separate module of irregular
verbs and other words which are difficult for a lexicographer
to handle. There are usually a limited number of such words - a
few hundred perhaps. Building such a lexicon separately also
makes it less important to cover //everything// by the
worst-case paradigms (``mkV`` etc).
===Lexicon extraction from a word list===
You can often find resources such as lists of
irregular verbs on the internet. For instance, the
[Dutch for Travelers http://www.dutchtrav.com/gram/irrverbs.html]
page gives a list of verbs in the
traditional tabular format, which begins as follows:
```
begrijpen begrijp begreep begrepen to understand
bijten bijt beet gebeten to bite
binden bind bond gebonden to tie
breken breek brak gebroken to break
```
All you have to do is to write a suitable verb paradigm
```
irregV : Str -> Str -> Str -> Str -> V ;
```
and a Perl or Python or Haskell script that transforms
the table to
```
begrijpen_V = irregV "begrijpen" "begrijp" "begreep" "begrepen" ;
bijten_V = irregV "bijten" "bijt" "beet" "gebeten" ;
binden_V = irregV "binden" "bind" "bond" "gebonden" ;
```
(You may want to use the English translation for some purpose, as well.)
When using ready-made word lists, you should think about
coyright issues. Ideally, all resource grammar material should
be provided under GNU General Public License.
===Lexicon extraction from raw text data===
This is a cheap technique to build a lexicon of thousands
of words, if text data is available in digital format.
See the [Functional Morphology http://www.cs.chalmers.se/~markus/FM/]
homepage for details.
===Extending the resource grammar API===
Sooner or later it will happen that the resource grammar API
does not suffice for all applications. A common reason is
that it does not include idiomatic expressions in a given language.
The solution then is in the first place to build language-specific
extension modules. This chapter will deal with this issue.
==Writing an instance of parametrized resource grammar implementation==
Above we have looked at how a resource implementation is built by
the copy and paste method (from English to Dutch), that is, formally
speaking, from scratch. A more elegant solution available for
families of languages such as Romance and Scandinavian is to
use parametrized modules. The advantages are
- theoretical: linguistic generalizations and insights
- practical: maintainability improves with fewer components
In this chapter, we will look at an example: adding Portuguese to
the Romance family.
==Parametrizing a resource grammar implementation==
This is the most demanding form of resource grammar writing.
We do //not// recommend the method of parametrizing from the
beginning: it is easier to have one language first implemented
in the conventional way and then add another language of the
same family by aprametrization. This means that the copy and
paste method is still used, but at this time the differences
are put into an ``interface`` module.
This chapter will work out an example of how an Estonian grammar
is constructed from the Finnish grammar through parametrization.