forked from GitHub/gf-core
more work on resource.txt
This commit is contained in:
260
doc/resource.txt
260
doc/resource.txt
@@ -1,12 +1,25 @@
|
|||||||
The GF Resource Grammar Library
|
The GF Resource Grammar Library
|
||||||
|
|
||||||
|
This document is about the use of the
|
||||||
|
GF Resource Grammar Library. It presuppose knowledge of GF and its
|
||||||
|
module system, knowledge that can be acquired e.g. from the GF
|
||||||
|
tutorial. Starting with an introduction to the library, we will
|
||||||
|
later cover all aspects of it that one needs to know in order
|
||||||
|
to use it.
|
||||||
|
|
||||||
|
|
||||||
|
==Motivation==
|
||||||
|
|
||||||
The GF Resource Grammar Library contains grammar rules for
|
The GF Resource Grammar Library contains grammar rules for
|
||||||
10 languages (some more are under construction). Its purpose
|
10 languages (some more are under construction). Its purpose
|
||||||
is to make these rules available for application programmers,
|
is to make these rules available for application programmers,
|
||||||
who can thereby concentrate on the semantic and stylistic
|
who can thereby concentrate on the semantic and stylistic
|
||||||
aspects of their grammars, without having to think about
|
aspects of their grammars, without having to think about
|
||||||
grammaticality.
|
grammaticality. The level of a typical application grammarian
|
||||||
|
is skilled programmer, without knowledge linguistics, but with
|
||||||
|
a good knowledge of the target languages. Such a combination of
|
||||||
|
skilles is typical of a programmer who wants to localize a piece
|
||||||
|
of software to a new language.
|
||||||
|
|
||||||
To give an example, an application dealing with
|
To give an example, an application dealing with
|
||||||
music players may have a semantical category ``Kind``, examples
|
music players may have a semantical category ``Kind``, examples
|
||||||
@@ -19,12 +32,16 @@ write
|
|||||||
|
|
||||||
lin Song = reg2N "Lied" "Lieder" neuter
|
lin Song = reg2N "Lied" "Lieder" neuter
|
||||||
|
|
||||||
and the eight forms are correctly generated. The use of the resource
|
and the eight forms are correctly generated. The resource grammar
|
||||||
grammar extends from lexical items to syntax rules. The application
|
library contains a complete set of inflectional paradigms (such as
|
||||||
mught also want to modify songs with properties, such as "American",
|
regN2 here), enabling the definition of any lexical items.
|
||||||
|
|
||||||
|
The resource grammar library is not only about inflectional paradigms - it
|
||||||
|
also has syntax rules. The music player application
|
||||||
|
might also want to modify songs with properties, such as "American",
|
||||||
"old", "good". The German grammar for adjectival modifications is
|
"old", "good". The German grammar for adjectival modifications is
|
||||||
particularly complex, because the adjectives have to agree in gender,
|
particularly complex, because the adjectives have to agree in gender,
|
||||||
number, and case, also depending on what determiner is used
|
number, and case, and also depend on what determiner is used
|
||||||
("ein Amerikanisches Lied" vs. "das Amerikanische Lied"). All this
|
("ein Amerikanisches Lied" vs. "das Amerikanische Lied"). All this
|
||||||
variation is taken care of by the resource grammar function
|
variation is taken care of by the resource grammar function
|
||||||
|
|
||||||
@@ -42,8 +59,8 @@ given that
|
|||||||
|
|
||||||
The resource library API is devided into language-specific and language-independet
|
The resource library API is devided into language-specific and language-independet
|
||||||
parts. To put is roughly,
|
parts. To put is roughly,
|
||||||
- syntax is language-independent
|
|
||||||
- lexicon is language-specific
|
- lexicon is language-specific
|
||||||
|
- syntax is language-independent
|
||||||
|
|
||||||
|
|
||||||
Thus, to render the above example in French instead of German, we need to
|
Thus, to render the above example in French instead of German, we need to
|
||||||
@@ -55,38 +72,152 @@ But to linearize PropKind, we can use the very same rule as in German.
|
|||||||
The resource function AdjCN has different implementations in the two
|
The resource function AdjCN has different implementations in the two
|
||||||
languages, but the application programmer need not care about the difference.
|
languages, but the application programmer need not care about the difference.
|
||||||
|
|
||||||
|
To summarize the example, and also give a template for a programmer to work on,
|
||||||
|
here is the complete implementation of a small system with songs and properties.
|
||||||
|
The abstract syntax defines a "domain ontology":
|
||||||
|
|
||||||
|
abstract Music = {
|
||||||
|
cat
|
||||||
|
Kind,
|
||||||
|
Property ;
|
||||||
|
fun
|
||||||
|
PropKind : Kind -> Property -> Kind ;
|
||||||
|
Song : Kind ;
|
||||||
|
American : Property ;
|
||||||
|
}
|
||||||
|
|
||||||
|
The concrete syntax is defined independently of language, by opening
|
||||||
|
two interfaces: the resource Grammar and an application lexicon.
|
||||||
|
|
||||||
|
incomplete concrete MusicI of Music = open Grammar, MusicLex in {
|
||||||
|
lincat
|
||||||
|
Kind = CN ;
|
||||||
|
Property = AP ;
|
||||||
|
lin
|
||||||
|
PropKind k p = AdjCN p k ;
|
||||||
|
Song = UseN song_N ;
|
||||||
|
American = PositA american_A ;
|
||||||
|
}
|
||||||
|
|
||||||
|
The application lexicon MusicLex has an abstract syntax, that extends
|
||||||
|
the resource category system Cat.
|
||||||
|
|
||||||
|
abstract MusicLex = Cat ** {
|
||||||
|
fun
|
||||||
|
song_N : N ;
|
||||||
|
american_A : A ;
|
||||||
|
}
|
||||||
|
|
||||||
|
Each language has its own concrete syntax, which opens the inflectional paradigms
|
||||||
|
module for that language:
|
||||||
|
|
||||||
|
concrete MusicLexGer of MusicLex = CatGer ** open ParadigmsGer in {
|
||||||
|
lin
|
||||||
|
song_N = reg2N "Lied" "Lieder" neuter ;
|
||||||
|
american_A = regA "amerikanisch" ;
|
||||||
|
}
|
||||||
|
|
||||||
|
concrete MusicLexFre of MusicLex = CatFre ** open ParadigmsFre in {
|
||||||
|
lin
|
||||||
|
song_N = regGenN "chanson" feminine ;
|
||||||
|
american_A = regA "américain" ;
|
||||||
|
}
|
||||||
|
|
||||||
|
The top-level Music grammars are obtained by instantiating the two interfaces
|
||||||
|
of MusicI:
|
||||||
|
|
||||||
|
concrete MusicGer of Music = MusicI with
|
||||||
|
(Grammar = GrammarGer),
|
||||||
|
(MusicLex = MusicLexGer) ;
|
||||||
|
|
||||||
|
concrete MusicFre of Music = MusicI with
|
||||||
|
(Grammar = GrammarFre),
|
||||||
|
(MusicLex = MusicLexFre) ;
|
||||||
|
|
||||||
|
To localize the system to a new language, all that is needed is two modules,
|
||||||
|
one implementing MusicLex and the other instantiating Music. The latter is
|
||||||
|
completely trivial, whereas the former one involves the choice of correct
|
||||||
|
vocabulary and inflectional paradigms. For instance, Finnish is added as follows:
|
||||||
|
|
||||||
|
concrete MusicLexFin of MusicLex = CatFre ** open ParadigmsFin in {
|
||||||
|
lin
|
||||||
|
song_N = regN "kappale" ;
|
||||||
|
american_A = regA "amerikkalainen" ;
|
||||||
|
}
|
||||||
|
|
||||||
|
concrete MusicFin of Music = MusicI with
|
||||||
|
(Grammar = GrammarFin),
|
||||||
|
(MusicLex = MusicLexFin) ;
|
||||||
|
|
||||||
|
More work is of course involved if the language-independent linearizations in
|
||||||
|
MusicI are not satisfactory for some language. The resource grammar guarantees
|
||||||
|
that the linearizations are possible in all languages, in the sense of grammatical,
|
||||||
|
but they might of course be inadequate for stylistic reasons. Assume,
|
||||||
|
for the sake of argument, that adjectival modification does not sound good in
|
||||||
|
English, but that a relative clause would be preferrable. One can then start as
|
||||||
|
before,
|
||||||
|
|
||||||
|
concrete MusicLexEng of MusicLex = CatFre ** open ParadigmsEng in {
|
||||||
|
lin
|
||||||
|
song_N = regN "song" ;
|
||||||
|
american_A = regA "American" ;
|
||||||
|
}
|
||||||
|
|
||||||
|
concrete MusicEng0 of Music = MusicI with
|
||||||
|
(Grammar = GrammarEng),
|
||||||
|
(MusicLex = MusicLexEng) ;
|
||||||
|
|
||||||
|
The module MusicEng0 would not be used on the top level, however, but
|
||||||
|
another module would be built on top of it, with a restricted import from
|
||||||
|
MusicEng0. MusicEng inherits everything from MusicEng0 except PropKind, and
|
||||||
|
gives its own definition of this function:
|
||||||
|
|
||||||
|
concrete MusicEng of Music = MusicEng0 - [PropKind] ** open GrammarEng in {
|
||||||
|
lin
|
||||||
|
PropKind k p =
|
||||||
|
RelCN k (UseRCl TPres ASimul PPos (RelVP IdRP (UseComp (CompAP p)))) ;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
==To use a resouce grammar==
|
|
||||||
|
|
||||||
===Parsing===
|
===Parsing with resource grammars?===
|
||||||
|
|
||||||
The intended use of the resource grammar is as a library for writing
|
The intended use of the resource grammar is as a library for writing
|
||||||
application grammars. It is not designed for e.g. parsing text. There
|
application grammars. It is not designed for e.g. parsing newspaper text. There
|
||||||
are several reasons why this is not so practical:
|
are several reasons why this is not so practical:
|
||||||
- efficiency: the resource grammar uses complex data structures, in
|
- Efficiency: the resource grammar uses complex data structures, in
|
||||||
particular, discontinuous constituents, which make parsing slow and the
|
particular, discontinuous constituents, which make parsing slow and the
|
||||||
parser size huge
|
parser size huge.
|
||||||
- completeness: the resource grammar does not necessarily cover all rules
|
- Completeness: the resource grammar does not necessarily cover all rules
|
||||||
of the language - only enough many so that it is possible to express everything
|
of the language - only enough many to be able to express everything
|
||||||
in one way or another
|
in one way or another.
|
||||||
- lexicon: the resource grammar has a very small lexicon, only meant for test
|
- Lexicon: the resource grammar has a very small lexicon, only meant for test
|
||||||
purposes
|
purposes.
|
||||||
- semantics: the resource grammar has very little semantic control, and may
|
- Semantics: the resource grammar has very little semantic control, and may
|
||||||
accept strange input or deliver strange interpretations
|
accept strange input or deliver strange interpretations.
|
||||||
- ambiguity: parsing in the resource grammar may return lots of results many
|
- Ambiguity: parsing in the resource grammar may return lots of results many
|
||||||
of which are implausible
|
of which are implausible.
|
||||||
|
|
||||||
|
|
||||||
All of these problems should be settled in application grammars - the very point
|
All of these problems should be solved in application grammars.
|
||||||
of resource grammars is to isolate the low-level linguistic details such as
|
The task of resource grammars is just to take care of low-level linguistic
|
||||||
inflection, agreement, and word order, from semantic questions, which is what
|
details such as inflection, agreement, and word order.
|
||||||
the application grammarians should solve.
|
|
||||||
|
|
||||||
|
For the same reasons, resource grammars are not adequate for parsing.
|
||||||
|
That the syntax API is implemented for different languages of course makes
|
||||||
|
it possible to translate via it - but there is no guarantee of translation
|
||||||
|
equivalence. Of course, the use of parametrized implementations such as MusicI
|
||||||
|
above only extends to those cases where the syntax API does give translation
|
||||||
|
equivalence - but this must be seen as a limiting case, and real applications
|
||||||
|
will often use only restricted inheritance of MusicI.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
==To find rules in the resource grammar library==
|
||||||
|
|
||||||
===Inflection paradigms===
|
===Inflection paradigms===
|
||||||
|
|
||||||
The inflection paradigms are defined separately for each language L
|
Inflection paradigms are defined separately for each language L
|
||||||
in the module ParadigmsL. To test them, the command cc (= compute_concrete)
|
in the module ParadigmsL. To test them, the command cc (= compute_concrete)
|
||||||
can be used:
|
can be used:
|
||||||
|
|
||||||
@@ -111,6 +242,25 @@ can be used:
|
|||||||
g : Gender = Fem
|
g : Gender = Fem
|
||||||
}
|
}
|
||||||
|
|
||||||
|
For the sake of convenience, every language implements these four paradigms:
|
||||||
|
|
||||||
|
oper
|
||||||
|
regN : Str -> N ; -- regular nouns
|
||||||
|
regA : Str -> A : -- regular adjectives
|
||||||
|
regV : Str -> V ; -- regular verbs
|
||||||
|
dirV : V -> V2 ; -- direct transitive verbs
|
||||||
|
|
||||||
|
It is often possible to initialize a lexicon by just using these functions,
|
||||||
|
and later revise it by using the more involved paradigms. For instance, in
|
||||||
|
German we cannot use regN "Lied" for Song, because the result would be a
|
||||||
|
Masculine noun with the plural form "Liede". The individual Paradigms modules
|
||||||
|
tell what cases are covered by the regular heuristics.
|
||||||
|
|
||||||
|
As a limiting case, one could even initialize the lexicon for a new language
|
||||||
|
by copying the English (or some other already existing) lexicon. This will
|
||||||
|
produce language with correct grammar but content words directly borrowed from
|
||||||
|
English.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
===Syntax rules===
|
===Syntax rules===
|
||||||
@@ -139,7 +289,7 @@ instance, to find out how sentences are built using transitive verbs, write
|
|||||||
|
|
||||||
Parsing with the English resource grammar has an acceptable speed, but
|
Parsing with the English resource grammar has an acceptable speed, but
|
||||||
with most languages it takes just too much resources even to build the
|
with most languages it takes just too much resources even to build the
|
||||||
parser. However, examples parsed in one language can always be linearized in
|
parser. However, examples parsed in one language can always be linearized into
|
||||||
other languages:
|
other languages:
|
||||||
|
|
||||||
> i italian/LangIta.gf
|
> i italian/LangIta.gf
|
||||||
@@ -148,6 +298,64 @@ other languages:
|
|||||||
|
|
||||||
lo ama
|
lo ama
|
||||||
|
|
||||||
|
Therefore, one can use the English parser to write an Italian grammar, and also
|
||||||
|
to write a language-independent (incomplete) grammar. One can also parse strings
|
||||||
|
that are bizarre in English but the intended way of expression in another language.
|
||||||
|
For instance, the phrase for "I am hungry" in Italian is literally "I have hunger".
|
||||||
|
This can be built by parsing "I have beer" in LanEng and then writing
|
||||||
|
|
||||||
|
lin IamHungry =
|
||||||
|
let beer_N = regGenN "fame" feminine
|
||||||
|
in
|
||||||
|
PredVP (UsePron i_Pron) (ComplV2 have_V2
|
||||||
|
(DetCN (DetSg MassDet NoOrd) (UseN beer_N))) ;
|
||||||
|
|
||||||
|
which uses ParadigmsIta.regGenN.
|
||||||
|
|
||||||
|
|
||||||
|
===Example-based grammar writing===
|
||||||
|
|
||||||
|
The technique of parsing with the resource grammar can be used in GF source files,
|
||||||
|
endowed with the suffix .gfe ("GF examples"). The suffix tells GF to preprocess
|
||||||
|
the file by replacing all expressions of the form
|
||||||
|
|
||||||
|
in Module.Cat "example string"
|
||||||
|
|
||||||
|
by the syntax trees obtained by parsing "example string" in Cat in Module.
|
||||||
|
For instance,
|
||||||
|
|
||||||
|
lin IamHungry =
|
||||||
|
let beer_N = regGenN "fame" feminine
|
||||||
|
in
|
||||||
|
(in LangEng.Cl "I have beer") ;
|
||||||
|
|
||||||
|
will result in the rule displayed in the previous section. The normal binding rules
|
||||||
|
of functional programming (and GF) guarantee that local bindings of identifiers
|
||||||
|
take precedence over constants of the same forms. Thus it is also possible to
|
||||||
|
linearize functions taking arguments in this way:
|
||||||
|
|
||||||
|
lin
|
||||||
|
PropKind car_N old_A = in LangEng.CN "old car" ;
|
||||||
|
|
||||||
|
However, the technique of example-based grammar writing has some limitations:
|
||||||
|
- Ambiguity. If a string has several parses, the first one is returned, and
|
||||||
|
it may not be the intended one. The other parses are shown in a comment, from
|
||||||
|
where they must/can be picked manually.
|
||||||
|
- Lexicality. The arguments of a function must be atomic identifiers, and are thus
|
||||||
|
not available for categories that have no lexical items. For instance, the PropKind
|
||||||
|
rule above gives the result
|
||||||
|
|
||||||
|
lin
|
||||||
|
PropKind car_N old_A = AdjCN (UseN car_N) (PositA old_A) ;
|
||||||
|
|
||||||
|
However, it is possible to write a special lexicon that gives atomic rules for
|
||||||
|
all those categories that can be used as arguments, for instance,
|
||||||
|
|
||||||
|
fun
|
||||||
|
cat_CN : CN
|
||||||
|
old_AP : AP
|
||||||
|
|
||||||
|
and then use this lexicon instead of the standard one included in Lang.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user