mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-05-27 19:58:55 -06:00
started next version of tutorial
This commit is contained in:
506
doc/intro-resource.txt
Normal file
506
doc/intro-resource.txt
Normal file
@@ -0,0 +1,506 @@
|
||||
==Coverage==
|
||||
|
||||
The GF Resource Grammar Library contains grammar rules for
|
||||
10 languages (in addition, 2 languages are available as incomplete
|
||||
implementations, and a few more are under construction). Its purpose
|
||||
is to make these rules available for application programmers,
|
||||
who can thereby concentrate on the semantic and stylistic
|
||||
aspects of their grammars, without having to think about
|
||||
grammaticality. The targeted level of application grammarians
|
||||
is that of a skilled programmer with
|
||||
a practical knowledge of the target languages, but without
|
||||
theoretical knowledge about their grammars.
|
||||
Such a combination of
|
||||
skills is typical of programmers who, for instance, want to localize
|
||||
software to new languages.
|
||||
|
||||
The current resource languages are
|
||||
- ``Ara``bic (incomplete)
|
||||
- ``Cat``alan (incomplete)
|
||||
- ``Dan``ish
|
||||
- ``Eng``lish
|
||||
- ``Fin``nish
|
||||
- ``Fre``nch
|
||||
- ``Ger``man
|
||||
- ``Ita``lian
|
||||
- ``Nor``wegian
|
||||
- ``Rus``sian
|
||||
- ``Spa``nish
|
||||
- ``Swe``dish
|
||||
|
||||
|
||||
The first three letters (``Eng`` etc) are used in grammar module names.
|
||||
The incomplete Arabic and Catalan implementations are
|
||||
enough to be used in many applications; they both contain, amoung other
|
||||
things, complete inflectional morphology.
|
||||
|
||||
|
||||
|
||||
==A first example==
|
||||
|
||||
To give an example application, consider a system for steering
|
||||
music playing devices by voice commands. In the application,
|
||||
we may have a semantical category ``Kind``, examples
|
||||
of ``Kind``s being ``Song`` and ``Artist``. In German, for instance, ``Song``
|
||||
is linearized into the noun "Lied", but knowing this is not
|
||||
enough to make the application work, because the noun must be
|
||||
produced in both singular and plural, and in four different
|
||||
cases. By using the resource grammar library, it is enough to
|
||||
write
|
||||
```
|
||||
lin Song = mkN "Lied" "Lieder" neuter
|
||||
```
|
||||
and the eight forms are correctly generated. The resource grammar
|
||||
library contains a complete set of inflectional paradigms (such as
|
||||
``mkN`` here), enabling the definition of any lexical items.
|
||||
|
||||
The resource grammar library is not only about inflectional paradigms - it
|
||||
also has syntax rules. The music player application
|
||||
might also want to modify songs with properties, such as "American",
|
||||
"old", "good". The German grammar for adjectival modifications is
|
||||
particularly complex, because adjectives have to agree in gender,
|
||||
number, and case, and also depend on what determiner is used
|
||||
("ein amerikanisches Lied" vs. "das amerikanische Lied"). All this
|
||||
variation is taken care of by the resource grammar function
|
||||
```
|
||||
mkCN : AP -> CN -> CN
|
||||
```
|
||||
(see the table in the end of this document for the list of all resource grammar
|
||||
functions). The resource grammar implementation of the rule adding properties
|
||||
to kinds is
|
||||
```
|
||||
lin PropKind kind prop = mkCN prop kind
|
||||
```
|
||||
given that
|
||||
```
|
||||
lincat Prop = AP
|
||||
lincat Kind = CN
|
||||
```
|
||||
The resource library API is devided into language-specific
|
||||
and language-independent parts. To put it roughly,
|
||||
- the lexicon API is language-specific
|
||||
- the syntax API is language-independent
|
||||
|
||||
|
||||
Thus, to render the above example in French instead of German, we need to
|
||||
pick a different linearization of ``Song``,
|
||||
```
|
||||
lin Song = mkN "chanson" feminine
|
||||
```
|
||||
But to linearize ``PropKind``, we can use the very same rule as in German.
|
||||
The resource function ``mkCN`` has different implementations in the two
|
||||
languages (e.g. a different word order in French),
|
||||
but the application programmer need not care about the difference.
|
||||
|
||||
|
||||
|
||||
==Note on APIs==
|
||||
|
||||
From version 1.1 onwards, the resource library is available via two
|
||||
APIs:
|
||||
- original ``fun`` and ``oper`` definitions
|
||||
- overloaded ``oper`` definitions
|
||||
|
||||
|
||||
Introducing overloading in GF version 2.7 has been a success in improving
|
||||
the accessibility of libraries. It has also created a layer of abstraction
|
||||
between the writers and users of libraries, and thereby makes the library
|
||||
easier to modify. We shall therefore use the overloaded API
|
||||
in this document. The original function names are mainly interesting
|
||||
for those who want to write or modify libraries.
|
||||
|
||||
|
||||
|
||||
==A complete example==
|
||||
|
||||
To summarize the example, and also give a template for a programmer to work on,
|
||||
here is the complete implementation of a small system with songs and properties.
|
||||
The abstract syntax defines a "domain ontology":
|
||||
```
|
||||
abstract Music = {
|
||||
|
||||
cat
|
||||
Kind,
|
||||
Property ;
|
||||
fun
|
||||
PropKind : Kind -> Property -> Kind ;
|
||||
Song : Kind ;
|
||||
American : Property ;
|
||||
}
|
||||
```
|
||||
The concrete syntax is defined by a functor (parametrized module),
|
||||
independently of language, by opening
|
||||
two interfaces: the resource ``Syntax`` and an application lexicon.
|
||||
```
|
||||
incomplete concrete MusicI of Music =
|
||||
open Syntax, MusicLex in {
|
||||
lincat
|
||||
Kind = CN ;
|
||||
Property = AP ;
|
||||
lin
|
||||
PropKind k p = mkCN p k ;
|
||||
Song = mkCN song_N ;
|
||||
American = mkAP american_A ;
|
||||
}
|
||||
```
|
||||
The application lexicon ``MusicLex`` has an abstract syntax that extends
|
||||
the resource category system ``Cat``.
|
||||
```
|
||||
abstract MusicLex = Cat ** {
|
||||
|
||||
fun
|
||||
song_N : N ;
|
||||
american_A : A ;
|
||||
}
|
||||
```
|
||||
Each language has its own concrete syntax, which opens the
|
||||
inflectional paradigms module for that language:
|
||||
```
|
||||
concrete MusicLexGer of MusicLex =
|
||||
CatGer ** open ParadigmsGer in {
|
||||
lin
|
||||
song_N = mkN "Lied" "Lieder" neuter ;
|
||||
american_A = mkA "amerikanisch" ;
|
||||
}
|
||||
|
||||
concrete MusicLexFre of MusicLex =
|
||||
CatFre ** open ParadigmsFre in {
|
||||
lin
|
||||
song_N = mkN "chanson" feminine ;
|
||||
american_A = mkA "américain" ;
|
||||
}
|
||||
```
|
||||
The top-level ``Music`` grammars are obtained by
|
||||
instantiating the two interfaces of ``MusicI``:
|
||||
```
|
||||
concrete MusicGer of Music = MusicI with
|
||||
(Syntax = SyntaxGer),
|
||||
(MusicLex = MusicLexGer) ;
|
||||
|
||||
concrete MusicFre of Music = MusicI with
|
||||
(Syntax = SyntaxFre),
|
||||
(MusicLex = MusicLexFre) ;
|
||||
```
|
||||
Both of these files can use the same ``path``, defined as
|
||||
```
|
||||
--# -path=.:present:prelude
|
||||
```
|
||||
The ``present`` category contains the compiled resources, restricted to
|
||||
present tense; ``alltenses`` has the full resources.
|
||||
|
||||
To localize the music player system to a new language,
|
||||
all that is needed is two modules,
|
||||
one implementing ``MusicLex`` and the other
|
||||
instantiating ``Music``. The latter is
|
||||
completely trivial, whereas the former one involves the choice of correct
|
||||
vocabulary and inflectional paradigms. For instance, Finnish is added as follows:
|
||||
```
|
||||
concrete MusicLexFin of MusicLex =
|
||||
CatFin ** open ParadigmsFin in {
|
||||
lin
|
||||
song_N = mkN "kappale" ;
|
||||
american_A = mkA "amerikkalainen" ;
|
||||
}
|
||||
|
||||
concrete MusicFin of Music = MusicI with
|
||||
(Syntax = SyntaxFin),
|
||||
(MusicLex = MusicLexFin) ;
|
||||
```
|
||||
More work is of course needed if the language-independent linearizations in
|
||||
MusicI are not satisfactory for some language. The resource grammar guarantees
|
||||
that the linearizations are possible in all languages, in the sense of grammatical,
|
||||
but they might of course be inadequate for stylistic reasons. Assume,
|
||||
for the sake of argument, that adjectival modification does not sound good in
|
||||
English, but that a relative clause would be preferrable. One can then use
|
||||
restricted inheritance of the functor:
|
||||
```
|
||||
concrete MusicEng of Music =
|
||||
MusicI - [PropKind]
|
||||
with
|
||||
(Syntax = SyntaxEng),
|
||||
(MusicLex = MusicLexEng) **
|
||||
open SyntaxEng in {
|
||||
lin
|
||||
PropKind k p = mkCN k (mkRS (mkRCl which_RP (mkVP p))) ;
|
||||
}
|
||||
```
|
||||
The lexicon is as expected:
|
||||
```
|
||||
concrete MusicLexEng of MusicLex =
|
||||
CatEng ** open ParadigmsEng in {
|
||||
lin
|
||||
song_N = mkN "song" ;
|
||||
american_A = mkA "American" ;
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
==Lock fields==
|
||||
|
||||
//This section is only relevant as a guide to error messages that have to do with lock fields, and can be skipped otherwise.//
|
||||
|
||||
FIXME: this section may become obsolete.
|
||||
|
||||
When the categories of the resource grammar are used
|
||||
in applications, a **lock field** is added to their linearization types.
|
||||
The lock field for a category ``C`` is a record field
|
||||
```
|
||||
lock_C : {}
|
||||
```
|
||||
with the only possible value
|
||||
```
|
||||
lock_C = <>
|
||||
```
|
||||
The lock field carries no information, but its presence
|
||||
makes the linearization type of ``C``
|
||||
unique, so that categories
|
||||
with the same implementation are not confused with each other.
|
||||
(This is inspired by the ``newtype`` discipline in Haskell.)
|
||||
|
||||
For example, the lincats of adverbs and conjunctions are the same
|
||||
in ``CatEng`` (and therefore in ``GrammarEng``, which inherits it):
|
||||
```
|
||||
lincat Adv = {s : Str} ;
|
||||
lincat Conj = {s : Str} ;
|
||||
```
|
||||
But when these category symbols are used to denote their linearization
|
||||
types in an application, these definitions are translated to
|
||||
```
|
||||
oper Adv : Type = {s : Str ; lock_Adv : {}} ;
|
||||
oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
|
||||
```
|
||||
In this way, the user of a resource grammar cannot confuse adverbs with
|
||||
conjunctions. In other words, the lock fields force the type checker
|
||||
to function as grammaticality checker.
|
||||
|
||||
When the resource grammar is ``open``ed in an application grammar,
|
||||
and only functions from the resource are used in type-correct way, the
|
||||
lock fields are never seen (except possibly in type error messages).
|
||||
If an application grammarian has to write lock fields herself,
|
||||
it is a sign that the guarantees given by the resource grammar
|
||||
no longer hold. But since the resource may be incomplete, the
|
||||
application grammarian may occasionally have to provide the dummy
|
||||
values of lock fields (always ``<>``, the empty record).
|
||||
Here is an example:
|
||||
```
|
||||
mkUtt : Str -> Utt ;
|
||||
mkUtt s = {s = s ; lock_Utt = <>} ;
|
||||
```
|
||||
Currently, missing lock field produce warnings rather than errors,
|
||||
but this behaviour of GF may change in future.
|
||||
|
||||
|
||||
==Parsing with resource grammars?==
|
||||
|
||||
The intended use of the resource grammar is as a library for writing
|
||||
application grammars. It is not designed for parsing e.g. newspaper text. There
|
||||
are several reasons why this is not practical:
|
||||
- Efficiency: the resource grammar uses complex data structures, in
|
||||
particular, discontinuous constituents, which make parsing slow and the
|
||||
parser size huge.
|
||||
- Completeness: the resource grammar does not necessarily cover all rules
|
||||
of the language - only enough many to be able to express everything
|
||||
in one way or another.
|
||||
- Lexicon: the resource grammar has a very small lexicon, only meant for test
|
||||
purposes.
|
||||
- Semantics: the resource grammar has very little semantic control, and may
|
||||
accept strange input or deliver strange interpretations.
|
||||
- Ambiguity: parsing in the resource grammar may return lots of results many
|
||||
of which are implausible.
|
||||
|
||||
|
||||
All of these problems should be solved in application grammars.
|
||||
The task of resource grammars is just to take care of low-level linguistic
|
||||
details such as inflection, agreement, and word order.
|
||||
|
||||
It is for the same reasons that resource grammars are not adequate for translation.
|
||||
That the syntax API is implemented for different languages of course makes
|
||||
it possible to translate via it - but there is no guarantee of translation
|
||||
equivalence. Of course, the use of functor implementations such as ``MusicI``
|
||||
above only extends to those cases where the syntax API does give translation
|
||||
equivalence - but this must be seen as a limiting case, and bigger applications
|
||||
will often use only restricted inheritance of ``MusicI``.
|
||||
|
||||
|
||||
|
||||
=To find rules in the resource grammar library=
|
||||
|
||||
==Inflection paradigms==
|
||||
|
||||
Inflection paradigms are defined separately for each language //L//
|
||||
in the module ``Paradigms``//L//. To test them, the command
|
||||
``cc`` (= ``compute_concrete``)
|
||||
can be used:
|
||||
```
|
||||
> i -retain german/ParadigmsGer.gf
|
||||
|
||||
> cc mkN "Schlange"
|
||||
{
|
||||
s : Number => Case => Str = table Number {
|
||||
Sg => table Case {
|
||||
Nom => "Schlange" ;
|
||||
Acc => "Schlange" ;
|
||||
Dat => "Schlange" ;
|
||||
Gen => "Schlange"
|
||||
} ;
|
||||
Pl => table Case {
|
||||
Nom => "Schlangen" ;
|
||||
Acc => "Schlangen" ;
|
||||
Dat => "Schlangen" ;
|
||||
Gen => "Schlangen"
|
||||
}
|
||||
} ;
|
||||
g : Gender = Fem
|
||||
}
|
||||
```
|
||||
For the sake of convenience, every language implements these five paradigms:
|
||||
```
|
||||
oper
|
||||
mkN : Str -> N ; -- regular nouns
|
||||
mkA : Str -> A : -- regular adjectives
|
||||
mkV : Str -> V ; -- regular verbs
|
||||
mkPN : Str -> PN ; -- regular proper names
|
||||
mkV2 : V -> V2 ; -- direct transitive verbs
|
||||
```
|
||||
It is often possible to initialize a lexicon by just using these functions,
|
||||
and later revise it by using the more involved paradigms. For instance, in
|
||||
German we cannot use ``mkN "Lied"`` for ``Song``, because the result would be a
|
||||
Masculine noun with the plural form ``"Liede"``.
|
||||
The individual ``Paradigms`` modules
|
||||
tell what cases are covered by the regular heuristics.
|
||||
|
||||
As a limiting case, one could even initialize the lexicon for a new language
|
||||
by copying the English (or some other already existing) lexicon. This would
|
||||
produce language with correct grammar but with content words directly borrowed from
|
||||
English - maybe not so strange in certain technical domains.
|
||||
|
||||
|
||||
|
||||
==Syntax rules==
|
||||
|
||||
Syntax rules should be looked for in the module ``Constructors``.
|
||||
Below this top-level module exposing overloaded constructors,
|
||||
there are around 10 abstract modules, each defining constructors for
|
||||
a group of one or more related categories. For instance, the module
|
||||
``Noun`` defines how to construct common nouns, noun phrases, and determiners.
|
||||
But these special modules are seldom or never needed by the users of the library.
|
||||
|
||||
TODO: when are they needed?
|
||||
|
||||
Browsing the libraries is helped by the gfdoc-generated HTML pages,
|
||||
whose LaTeX versions are included in the present document.
|
||||
|
||||
|
||||
==Special-purpose APIs==
|
||||
|
||||
To give an analogy with the well-known type setting software, GF can be compared
|
||||
with TeX and the resource grammar library with LaTeX.
|
||||
Just like TeX frees the author
|
||||
from thinking about low-level problems of page layout, so GF frees the grammarian
|
||||
from writing parsing and generation algorithms. But quite a lot of knowledge of
|
||||
//how// to write grammars is still needed, and the resource grammar library helps
|
||||
GF grammarians in a way similar to how the LaTeX macro package helps TeX authors.
|
||||
|
||||
But even LaTeX is often too detailed and low-level, and users are encouraged to
|
||||
develop their own macro packages. The same applies to GF resource grammars:
|
||||
the application grammarian might not need all the choices that the resource
|
||||
provides, but would prefer less writing and higher-level programming.
|
||||
To this end, application grammarians may want to write their own views on the
|
||||
resource grammar.
|
||||
|
||||
|
||||
==Browsing by the parser==
|
||||
|
||||
A method alternative to browsing library documentation is
|
||||
to use the parser.
|
||||
Even though parsing is not an intended end-user application
|
||||
of resource grammars, it is a useful technique for application grammarians
|
||||
to browse the library. To find out which resource function implements
|
||||
a particular structure, one can just parse a string that exemplifies this
|
||||
structure. For instance, to find out how sentences are built using
|
||||
transitive verbs, write
|
||||
```
|
||||
> i english/LangEng.gf
|
||||
|
||||
> p -cat=Cl "she loves him"
|
||||
PredVP (UsePron she_Pron) (ComplV2 love_V2 (UsePron he_Pron))
|
||||
```
|
||||
The parser returns original constructors, not overloaded ones. Overloaded
|
||||
constructors can be returned, so far with experimental heuristics, by using
|
||||
the grammar ``api/toplevel/OverLangEng.gf`` and a special flag:
|
||||
```
|
||||
> i api/toplevel/OverLangEng.gf
|
||||
|
||||
> p -cat=Cl -overload "she loves him"
|
||||
mkCl (mkNP she_Pron) love_V2 (mkNP he_Pron)
|
||||
```
|
||||
Parsing with the English resource grammar has an acceptable speed, but
|
||||
with most languages it takes just too much resources even to build the
|
||||
parser. However, examples parsed in one language can always be linearized into
|
||||
other languages:
|
||||
```
|
||||
> i italian/LangIta.gf
|
||||
|
||||
> l PredVP (UsePron she_Pron) (ComplV2 love_V2 (UsePron he_Pron))
|
||||
lo ama
|
||||
```
|
||||
Therefore, one can use the English parser to write an Italian grammar, and also
|
||||
to write a language-independent (incomplete) grammar. One can also parse strings
|
||||
that are bizarre in English but the intended way of expression in another language.
|
||||
For instance, the phrase for "I am hungry" in Italian is literally "I have hunger".
|
||||
This can be built by parsing "I have beer" in ``OverLangEng`` and then writing
|
||||
```
|
||||
lin IamHungry =
|
||||
let beer_N = mkN "fame" feminine
|
||||
in
|
||||
mkCl (mkNP i_Pron) have_V2 (mkNP massQuant beer_N)
|
||||
```
|
||||
which uses ``ParadigmsIta.mkN``.
|
||||
|
||||
|
||||
|
||||
==Example-based grammar writing==
|
||||
|
||||
The technique of parsing with the resource grammar can be used in GF source files,
|
||||
endowed with the suffix ``.gfe`` ("GF examples"). The suffix tells GF to preprocess
|
||||
the file by replacing all expressions of the form
|
||||
```
|
||||
in Module.Cat "example string"
|
||||
```
|
||||
by the syntax trees obtained by parsing "example string" in ``Cat`` in ``Module``.
|
||||
For instance,
|
||||
```
|
||||
lin IamHungry =
|
||||
let beer_N = mkN "fame" feminine
|
||||
in
|
||||
(in LangEng.Cl "I have beer") ;
|
||||
```
|
||||
will result in the rule displayed in the previous section. The normal binding rules
|
||||
of functional programming (and GF) guarantee that local bindings of identifiers
|
||||
take precedence over constants of the same forms. Thus it is also possible to
|
||||
linearize functions taking arguments in this way:
|
||||
```
|
||||
lin
|
||||
PropKind car_N old_A = in LangEng.CN "old car" ;
|
||||
```
|
||||
However, the technique of example-based grammar writing has some limitations:
|
||||
- Ambiguity. If a string has several parses, the first one is returned, and
|
||||
it may not be the intended one. The other parses are shown in a comment, from
|
||||
where they must/can be picked manually.
|
||||
- Lexicality. The arguments of a function must be atomic identifiers, and are thus
|
||||
not available for categories that have no lexical items.
|
||||
For instance, the ``PropKind`` rule above gives the result
|
||||
```
|
||||
lin
|
||||
PropKind car_N old_A = AdjCN (UseN car_N) (PositA old_A) ;
|
||||
```
|
||||
However, it is possible to write a special lexicon that gives atomic rules for
|
||||
all those categories that can be used as arguments, for instance,
|
||||
```
|
||||
fun
|
||||
cat_CN : CN ;
|
||||
old_AP : AP ;
|
||||
```
|
||||
and then use this lexicon instead of the standard one included in ``Lang``.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user