1
0
forked from GitHub/gf-rgl

updated tutorial and resource howto

This commit is contained in:
aarne
2006-06-15 23:05:42 +00:00
parent 6065e73738
commit 17c3861c55
2 changed files with 190 additions and 16 deletions

View File

@@ -7,9 +7,56 @@
<P ALIGN="center"><CENTER><H1>Resource grammar writing HOWTO</H1> <P ALIGN="center"><CENTER><H1>Resource grammar writing HOWTO</H1>
<FONT SIZE="4"> <FONT SIZE="4">
<I>Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;</I><BR> <I>Author: Aarne Ranta &lt;aarne (at) cs.chalmers.se&gt;</I><BR>
Last update: Fri May 26 17:36:48 2006 Last update: Fri Jun 16 00:59:52 2006
</FONT></CENTER> </FONT></CENTER>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<UL>
<LI><A HREF="#toc1">The resource grammar API</A>
<UL>
<LI><A HREF="#toc2">Phrase category modules</A>
<LI><A HREF="#toc3">Infrastructure modules</A>
<LI><A HREF="#toc4">Lexical modules</A>
</UL>
<LI><A HREF="#toc5">Language-dependent syntax modules</A>
<LI><A HREF="#toc6">The core of the syntax</A>
<UL>
<LI><A HREF="#toc7">Another reduced API</A>
<LI><A HREF="#toc8">The present-tense fragment</A>
</UL>
<LI><A HREF="#toc9">Phases of the work</A>
<UL>
<LI><A HREF="#toc10">Putting up a directory</A>
<LI><A HREF="#toc11">Direction of work</A>
<LI><A HREF="#toc12">The develop-test cycle</A>
<LI><A HREF="#toc13">Resource modules used</A>
<LI><A HREF="#toc14">Morphology and lexicon</A>
<LI><A HREF="#toc15">Lock fields</A>
<LI><A HREF="#toc16">Lexicon construction</A>
</UL>
<LI><A HREF="#toc17">Inside grammar modules</A>
<UL>
<LI><A HREF="#toc18">The category system</A>
<LI><A HREF="#toc19">Phrase category modules</A>
<LI><A HREF="#toc20">Resource modules</A>
<LI><A HREF="#toc21">Lexicon</A>
</UL>
<LI><A HREF="#toc22">Lexicon extension</A>
<UL>
<LI><A HREF="#toc23">The irregularity lexicon</A>
<LI><A HREF="#toc24">Lexicon extraction from a word list</A>
<LI><A HREF="#toc25">Lexicon extraction from raw text data</A>
<LI><A HREF="#toc26">Extending the resource grammar API</A>
</UL>
<LI><A HREF="#toc27">Writing an instance of parametrized resource grammar implementation</A>
<LI><A HREF="#toc28">Parametrizing a resource grammar implementation</A>
</UL>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<P> <P>
The purpose of this document is to tell how to implement the GF The purpose of this document is to tell how to implement the GF
resource grammar API for a new language. We will <I>not</I> cover how resource grammar API for a new language. We will <I>not</I> cover how
@@ -17,23 +64,43 @@ to use the resource grammar, nor how to change the API. But we
will give some hints how to extend the API. will give some hints how to extend the API.
</P> </P>
<P> <P>
<B>Notice</B>. This document concerns the API v. 1.0 which has not A manual for using the resource grammar is found in
yet been released. You can find the current code </P>
in <A HREF=".."><CODE>GF/lib/resource-1.0/</CODE></A>. See the <P>
<A HREF="../README"><CODE>resource-1.0/README</CODE></A> for <A HREF="../../../doc/resource.pdf"><CODE>http://www.cs.chalmers.se/~aarne/GF/doc/resource.pdf</CODE></A>.
</P>
<P>
A tutorial on GF, also introducing the idea of resource grammars, is found in
</P>
<P>
<A HREF="../../../doc/tutorial/gf-tutorial2.html"><CODE>http://www.cs.chalmers.se/~aarne/GF/doc/tutorial/gf-tutorial2.html</CODE></A>.
</P>
<P>
This document concerns the API v. 1.0. You can find the current code in
</P>
<P>
<A HREF=".."><CODE>http://www.cs.chalmers.se/~aarne/GF/lib/resource-1.0/</CODE></A>
</P>
<P>
See the <A HREF="../README"><CODE>README</CODE></A> for
details on how this differs from previous versions. details on how this differs from previous versions.
</P> </P>
<A NAME="toc1"></A>
<H2>The resource grammar API</H2> <H2>The resource grammar API</H2>
<P> <P>
The API is divided into a bunch of <CODE>abstract</CODE> modules. The API is divided into a bunch of <CODE>abstract</CODE> modules.
The following figure gives the dependencies of these modules. The following figure gives the dependencies of these modules.
</P> </P>
<P> <P>
<IMG ALIGN="left" SRC="Lang.png" BORDER="0" ALT=""> <IMG ALIGN="left" SRC="Grammar.png" BORDER="0" ALT="">
</P> </P>
<P> <P>
The module structure is rather flat: almost every module is a direct Thus the API consists of a grammar and a lexicon, which is
parent of the top module <CODE>Lang</CODE>. The idea provided for test purposes.
</P>
<P>
The module structure is rather flat: most modules are direct
parents of <CODE>Grammar</CODE>. The idea
is that you can concentrate on one linguistic aspect at a time, or is that you can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors. The module <CODE>Cat</CODE> also distribute the work among several authors. The module <CODE>Cat</CODE>
defines the "glue" that ties the aspects together - a type system defines the "glue" that ties the aspects together - a type system
@@ -41,6 +108,7 @@ to which all the other modules conform, so that e.g. <CODE>NP</CODE> means
the same thing in those modules that use <CODE>NP</CODE>s and those that the same thing in those modules that use <CODE>NP</CODE>s and those that
constructs them. constructs them.
</P> </P>
<A NAME="toc2"></A>
<H3>Phrase category modules</H3> <H3>Phrase category modules</H3>
<P> <P>
The direct parents of the top will be called <B>phrase category modules</B>, The direct parents of the top will be called <B>phrase category modules</B>,
@@ -65,6 +133,7 @@ one of a small number of different types). Thus we have
<LI><CODE>Idiom</CODE>: idiomatic phrases such as existentials <LI><CODE>Idiom</CODE>: idiomatic phrases such as existentials
</UL> </UL>
<A NAME="toc3"></A>
<H3>Infrastructure modules</H3> <H3>Infrastructure modules</H3>
<P> <P>
Expressions of each phrase category are constructed in the corresponding Expressions of each phrase category are constructed in the corresponding
@@ -93,6 +162,7 @@ can skip the <CODE>lincat</CODE> definition of a category and use the default
<CODE>{s : Str}</CODE> until you need to change it to something else. In <CODE>{s : Str}</CODE> until you need to change it to something else. In
English, for instance, many categories do have this linearization type. English, for instance, many categories do have this linearization type.
</P> </P>
<A NAME="toc4"></A>
<H3>Lexical modules</H3> <H3>Lexical modules</H3>
<P> <P>
What is lexical and what is syntactic is not as clearcut in GF as in What is lexical and what is syntactic is not as clearcut in GF as in
@@ -129,6 +199,45 @@ different languages on the level of a resource grammar. In other words,
application grammars are likely to use the resource in different ways for application grammars are likely to use the resource in different ways for
different languages. different languages.
</P> </P>
<A NAME="toc5"></A>
<H2>Language-dependent syntax modules</H2>
<P>
In addition to the common API, there is room for language-dependent extensions
of the resource. The top level of each languages looks as follows (with English as example):
</P>
<PRE>
abstract English = Grammar, ExtraEngAbs, DictEngAbs
</PRE>
<P>
where <CODE>ExtraEngAbs</CODE> is a collection of syntactic structures specific to English,
and <CODE>DictEngAbs</CODE> is an English dictionary
(at the moment, it consists of <CODE>IrregEngAbs</CODE>,
the irregular verbs of English). Each of these language-specific grammars has
the potential to grow into a full-scale grammar of the language. These grammar
can also be used as libraries, but the possibility of using functors is lost.
</P>
<P>
To give a better overview of language-specific structures,
modules like <CODE>ExtraEngAbs</CODE>
are built from a language-independent module <CODE>ExtraAbs</CODE>
by restricted inheritance:
</P>
<PRE>
abstract ExtraEngAbs = Extra [f,g,...]
</PRE>
<P>
Thus any category and function in <CODE>Extra</CODE> may be shared by a subset of all
languages. One can see this set-up as a matrix, which tells
what <CODE>Extra</CODE> structures
are implemented in what languages. For the common API in <CODE>Grammar</CODE>, the matrix
is filled with 1's (everything is implemented in every language).
</P>
<P>
In a minimal resource grammar implementation, the language-dependent
extensions are just empty modules, but it is good to provide them for
the sake of uniformity.
</P>
<A NAME="toc6"></A>
<H2>The core of the syntax</H2> <H2>The core of the syntax</H2>
<P> <P>
Among all categories and functions, a handful are Among all categories and functions, a handful are
@@ -153,6 +262,7 @@ rules relate the categories to each other. It is intended to be a
first approximation when designing the parameter system of a new first approximation when designing the parameter system of a new
language. language.
</P> </P>
<A NAME="toc7"></A>
<H3>Another reduced API</H3> <H3>Another reduced API</H3>
<P> <P>
If you want to experiment with a small subset of the resource API first, If you want to experiment with a small subset of the resource API first,
@@ -161,6 +271,7 @@ try out the module
explained in the explained in the
<A HREF="http://www.cs.chalmers.se/~aarne/GF/doc/tutorial/gf-tutorial2.html">GF Tutorial</A>. <A HREF="http://www.cs.chalmers.se/~aarne/GF/doc/tutorial/gf-tutorial2.html">GF Tutorial</A>.
</P> </P>
<A NAME="toc8"></A>
<H3>The present-tense fragment</H3> <H3>The present-tense fragment</H3>
<P> <P>
Some lines in the resource library are suffixed with the comment Some lines in the resource library are suffixed with the comment
@@ -176,7 +287,9 @@ implementation. To compile a grammar with present-tense-only, use
i -preproc=GF/lib/resource-1.0/mkPresent LangGer.gf i -preproc=GF/lib/resource-1.0/mkPresent LangGer.gf
</PRE> </PRE>
<P></P> <P></P>
<A NAME="toc9"></A>
<H2>Phases of the work</H2> <H2>Phases of the work</H2>
<A NAME="toc10"></A>
<H3>Putting up a directory</H3> <H3>Putting up a directory</H3>
<P> <P>
Unless you are writing an instance of a parametrized implementation Unless you are writing an instance of a parametrized implementation
@@ -262,6 +375,7 @@ as e.g. <CODE>VerbGer</CODE>.
<P> <P>
<IMG ALIGN="middle" SRC="German.png" BORDER="0" ALT=""> <IMG ALIGN="middle" SRC="German.png" BORDER="0" ALT="">
</P> </P>
<A NAME="toc11"></A>
<H3>Direction of work</H3> <H3>Direction of work</H3>
<P> <P>
The real work starts now. There are many ways to proceed, the main ones being The real work starts now. There are many ways to proceed, the main ones being
@@ -360,6 +474,7 @@ and dependences there are in your language, and you can now produce very
much in the order you please. much in the order you please.
</OL> </OL>
<A NAME="toc12"></A>
<H3>The develop-test cycle</H3> <H3>The develop-test cycle</H3>
<P> <P>
The following develop-test cycle will The following develop-test cycle will
@@ -416,6 +531,7 @@ follow soon. (You will found out that these explanations involve
a rational reconstruction of the live process! Among other things, the a rational reconstruction of the live process! Among other things, the
API was changed during the actual process to make it more intuitive.) API was changed during the actual process to make it more intuitive.)
</P> </P>
<A NAME="toc13"></A>
<H3>Resource modules used</H3> <H3>Resource modules used</H3>
<P> <P>
These modules will be written by you. These modules will be written by you.
@@ -472,6 +588,7 @@ almost everything. This led in practice to the duplication of almost
all code on the <CODE>lin</CODE> and <CODE>oper</CODE> levels, and made the code all code on the <CODE>lin</CODE> and <CODE>oper</CODE> levels, and made the code
hard to understand and maintain. hard to understand and maintain.
</P> </P>
<A NAME="toc14"></A>
<H3>Morphology and lexicon</H3> <H3>Morphology and lexicon</H3>
<P> <P>
The paradigms needed to implement The paradigms needed to implement
@@ -542,6 +659,7 @@ These constants are defined in terms of parameter types and constructors
in <CODE>ResGer</CODE> and <CODE>MorphoGer</CODE>, which modules are not in <CODE>ResGer</CODE> and <CODE>MorphoGer</CODE>, which modules are not
visible to the application grammarian. visible to the application grammarian.
</P> </P>
<A NAME="toc15"></A>
<H3>Lock fields</H3> <H3>Lock fields</H3>
<P> <P>
An important difference between <CODE>MorphoGer</CODE> and An important difference between <CODE>MorphoGer</CODE> and
@@ -588,6 +706,7 @@ in her hidden definitions of constants in <CODE>Paradigms</CODE>. For instance,
-- mkAdv s = {s = s ; lock_Adv = &lt;&gt;} ; -- mkAdv s = {s = s ; lock_Adv = &lt;&gt;} ;
</PRE> </PRE>
<P></P> <P></P>
<A NAME="toc16"></A>
<H3>Lexicon construction</H3> <H3>Lexicon construction</H3>
<P> <P>
The lexicon belonging to <CODE>LangGer</CODE> consists of two modules: The lexicon belonging to <CODE>LangGer</CODE> consists of two modules:
@@ -607,17 +726,20 @@ the coverage of the paradigms gets thereby tested and that the
use of the paradigms in <CODE>LexiconGer</CODE> gives a good set of examples for use of the paradigms in <CODE>LexiconGer</CODE> gives a good set of examples for
those who want to build new lexica. those who want to build new lexica.
</P> </P>
<A NAME="toc17"></A>
<H2>Inside grammar modules</H2> <H2>Inside grammar modules</H2>
<P> <P>
Detailed implementation tricks Detailed implementation tricks
are found in the comments of each module. are found in the comments of each module.
</P> </P>
<A NAME="toc18"></A>
<H3>The category system</H3> <H3>The category system</H3>
<UL> <UL>
<LI><A HREF="gfdoc/Common.html">Common</A>, <A HREF="../common/CommonX.gf">CommonX</A> <LI><A HREF="gfdoc/Common.html">Common</A>, <A HREF="../common/CommonX.gf">CommonX</A>
<LI><A HREF="gfdoc/Cat.html">Cat</A>, <A HREF="gfdoc/CatGer.gf">CatGer</A> <LI><A HREF="gfdoc/Cat.html">Cat</A>, <A HREF="gfdoc/CatGer.gf">CatGer</A>
</UL> </UL>
<A NAME="toc19"></A>
<H3>Phrase category modules</H3> <H3>Phrase category modules</H3>
<UL> <UL>
<LI><A HREF="gfdoc/Noun.html">Noun</A>, <A HREF="../german/NounGer.gf">NounGer</A> <LI><A HREF="gfdoc/Noun.html">Noun</A>, <A HREF="../german/NounGer.gf">NounGer</A>
@@ -635,6 +757,7 @@ are found in the comments of each module.
<LI><A HREF="gfdoc/Lang.html">Lang</A>, <A HREF="../german/LangGer.gf">LangGer</A> <LI><A HREF="gfdoc/Lang.html">Lang</A>, <A HREF="../german/LangGer.gf">LangGer</A>
</UL> </UL>
<A NAME="toc20"></A>
<H3>Resource modules</H3> <H3>Resource modules</H3>
<UL> <UL>
<LI><A HREF="../german/ResGer.gf">ResGer</A> <LI><A HREF="../german/ResGer.gf">ResGer</A>
@@ -642,13 +765,16 @@ are found in the comments of each module.
<LI><A HREF="gfdoc/ParadigmsGer.html">ParadigmsGer</A>, <A HREF="../german/ParadigmsGer.gf">ParadigmsGer.gf</A> <LI><A HREF="gfdoc/ParadigmsGer.html">ParadigmsGer</A>, <A HREF="../german/ParadigmsGer.gf">ParadigmsGer.gf</A>
</UL> </UL>
<A NAME="toc21"></A>
<H3>Lexicon</H3> <H3>Lexicon</H3>
<UL> <UL>
<LI><A HREF="gfdoc/Structural.html">Structural</A>, <A HREF="../german/StructuralGer.gf">StructuralGer</A> <LI><A HREF="gfdoc/Structural.html">Structural</A>, <A HREF="../german/StructuralGer.gf">StructuralGer</A>
<LI><A HREF="gfdoc/Lexicon.html">Lexicon</A>, <A HREF="../german/LexiconGer.gf">LexiconGer</A> <LI><A HREF="gfdoc/Lexicon.html">Lexicon</A>, <A HREF="../german/LexiconGer.gf">LexiconGer</A>
</UL> </UL>
<A NAME="toc22"></A>
<H2>Lexicon extension</H2> <H2>Lexicon extension</H2>
<A NAME="toc23"></A>
<H3>The irregularity lexicon</H3> <H3>The irregularity lexicon</H3>
<P> <P>
It may be handy to provide a separate module of irregular It may be handy to provide a separate module of irregular
@@ -658,6 +784,7 @@ few hundred perhaps. Building such a lexicon separately also
makes it less important to cover <I>everything</I> by the makes it less important to cover <I>everything</I> by the
worst-case paradigms (<CODE>mkV</CODE> etc). worst-case paradigms (<CODE>mkV</CODE> etc).
</P> </P>
<A NAME="toc24"></A>
<H3>Lexicon extraction from a word list</H3> <H3>Lexicon extraction from a word list</H3>
<P> <P>
You can often find resources such as lists of You can often find resources such as lists of
@@ -692,6 +819,7 @@ When using ready-made word lists, you should think about
coyright issues. Ideally, all resource grammar material should coyright issues. Ideally, all resource grammar material should
be provided under GNU General Public License. be provided under GNU General Public License.
</P> </P>
<A NAME="toc25"></A>
<H3>Lexicon extraction from raw text data</H3> <H3>Lexicon extraction from raw text data</H3>
<P> <P>
This is a cheap technique to build a lexicon of thousands This is a cheap technique to build a lexicon of thousands
@@ -699,6 +827,7 @@ of words, if text data is available in digital format.
See the <A HREF="http://www.cs.chalmers.se/~markus/FM/">Functional Morphology</A> See the <A HREF="http://www.cs.chalmers.se/~markus/FM/">Functional Morphology</A>
homepage for details. homepage for details.
</P> </P>
<A NAME="toc26"></A>
<H3>Extending the resource grammar API</H3> <H3>Extending the resource grammar API</H3>
<P> <P>
Sooner or later it will happen that the resource grammar API Sooner or later it will happen that the resource grammar API
@@ -707,6 +836,7 @@ that it does not include idiomatic expressions in a given language.
The solution then is in the first place to build language-specific The solution then is in the first place to build language-specific
extension modules. This chapter will deal with this issue (to be completed). extension modules. This chapter will deal with this issue (to be completed).
</P> </P>
<A NAME="toc27"></A>
<H2>Writing an instance of parametrized resource grammar implementation</H2> <H2>Writing an instance of parametrized resource grammar implementation</H2>
<P> <P>
Above we have looked at how a resource implementation is built by Above we have looked at how a resource implementation is built by
@@ -726,6 +856,7 @@ the Romance family (to be completed). Here is a set of
<A HREF="http://www.cs.chalmers.se/~aarne/geocal2006.pdf">slides</A> <A HREF="http://www.cs.chalmers.se/~aarne/geocal2006.pdf">slides</A>
on the topic. on the topic.
</P> </P>
<A NAME="toc28"></A>
<H2>Parametrizing a resource grammar implementation</H2> <H2>Parametrizing a resource grammar implementation</H2>
<P> <P>
This is the most demanding form of resource grammar writing. This is the most demanding form of resource grammar writing.
@@ -742,5 +873,5 @@ is constructed from the Finnish grammar through parametrization.
</P> </P>
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) --> <!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags Resource-HOWTO.txt --> <!-- cmdline: txt2tags -\-toc -thtml Resource-HOWTO.txt -->
</BODY></HTML> </BODY></HTML>

View File

@@ -14,11 +14,19 @@ resource grammar API for a new language. We will //not// cover how
to use the resource grammar, nor how to change the API. But we to use the resource grammar, nor how to change the API. But we
will give some hints how to extend the API. will give some hints how to extend the API.
A manual for using the resource grammar is found in
**Notice**. This document concerns the API v. 1.0 which has not [``http://www.cs.chalmers.se/~aarne/GF/doc/resource.pdf`` http://www.cs.chalmers.se/~aarne/GF/doc/resource.pdf].
yet been released. You can find the current code
in [``GF/lib/resource-1.0/`` ..]. See the A tutorial on GF, also introducing the idea of resource grammars, is found in
[``resource-1.0/README`` ../README] for
[``http://www.cs.chalmers.se/~aarne/GF/doc/tutorial/gf-tutorial2.html`` ../../../doc/tutorial/gf-tutorial2.html].
This document concerns the API v. 1.0. You can find the current code in
[``http://www.cs.chalmers.se/~aarne/GF/lib/resource-1.0/`` ..]
See the [``README`` ../README] for
details on how this differs from previous versions. details on how this differs from previous versions.
@@ -28,10 +36,13 @@ details on how this differs from previous versions.
The API is divided into a bunch of ``abstract`` modules. The API is divided into a bunch of ``abstract`` modules.
The following figure gives the dependencies of these modules. The following figure gives the dependencies of these modules.
[Lang.png] [Grammar.png]
The module structure is rather flat: almost every module is a direct Thus the API consists of a grammar and a lexicon, which is
parent of the top module ``Lang``. The idea provided for test purposes.
The module structure is rather flat: most modules are direct
parents of ``Grammar``. The idea
is that you can concentrate on one linguistic aspect at a time, or is that you can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors. The module ``Cat`` also distribute the work among several authors. The module ``Cat``
defines the "glue" that ties the aspects together - a type system defines the "glue" that ties the aspects together - a type system
@@ -127,6 +138,38 @@ application grammars are likely to use the resource in different ways for
different languages. different languages.
==Language-dependent syntax modules==
In addition to the common API, there is room for language-dependent extensions
of the resource. The top level of each languages looks as follows (with English as example):
```
abstract English = Grammar, ExtraEngAbs, DictEngAbs
```
where ``ExtraEngAbs`` is a collection of syntactic structures specific to English,
and ``DictEngAbs`` is an English dictionary
(at the moment, it consists of ``IrregEngAbs``,
the irregular verbs of English). Each of these language-specific grammars has
the potential to grow into a full-scale grammar of the language. These grammar
can also be used as libraries, but the possibility of using functors is lost.
To give a better overview of language-specific structures,
modules like ``ExtraEngAbs``
are built from a language-independent module ``ExtraAbs``
by restricted inheritance:
```
abstract ExtraEngAbs = Extra [f,g,...]
```
Thus any category and function in ``Extra`` may be shared by a subset of all
languages. One can see this set-up as a matrix, which tells
what ``Extra`` structures
are implemented in what languages. For the common API in ``Grammar``, the matrix
is filled with 1's (everything is implemented in every language).
In a minimal resource grammar implementation, the language-dependent
extensions are just empty modules, but it is good to provide them for
the sake of uniformity.
==The core of the syntax== ==The core of the syntax==
Among all categories and functions, a handful are Among all categories and functions, a handful are