From bfbe2e3d47e5f1904846609c80058f0561d76ede Mon Sep 17 00:00:00 2001
From: aarne Grammatical Framework Tutorial
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
-Last update: Sun Dec 18 22:29:50 2005
+Last update: Mon Dec 19 17:31:35 2005
-
@@ -126,7 +131,9 @@ Last update: Sun Dec 18 22:29:50 2005
-
-
-
-
-
-
-
-
-
-
The term GF is used for different things:
@@ -147,7 +154,161 @@ It will guide you+A grammar is a definition of a language. +From this definition, different language processing components +can be derived: +
++A GF grammar can be seen as a declarative program from which these +processing tasks can be automatically derived. In addition, many +other tasks are readily available for GF grammars: +
++A typical GF application is based on a multilingual grammar involving +translation on a special domain. Existing applications of this idea include +
++The specialization of a grammar to a domain makes it possible to +obtain much better translations than in an unlimited machine translation +system. This is due to the well-defined semantics of such domains. +Grammars having this character are called application grammars. +They are different from most grammars written by linguists just +because they are multilingual and domain-specific. +
++However, there is another kind of grammars, which we call resource grammars. +These are large, comprehensive grammars that can be used on any domain. +The GF Resource Grammar Library has resource grammars for 10 languages. +These grammars can be used as libraries to define application grammars. +In this way, it is possible to write a high-quality grammar without +knowing about linguistics: in general, to write an application grammar +by using the resource library just requires practical knowledge of +the target language. +
+ ++This tutorial is mainly for programmers who want to learn to write +application grammars. It will go through GF's programming concepts +without entering too deep into linguistics. Thus it should +be accessible to anyone who has some previous programming experience. +
++A separate document is being written on how to write resource grammars. +This includes the ways in which linguistic problems posed by different +languages are solved in GF. +
+ ++The tutorial gives a hands-on introduction to grammar writing. +We start by building a small grammar for the domain of food: +in this grammar, you can say things like +
++ this Italian cheese is delicious ++
+in English and Italian. +
+
+The first English grammar
+food.cf
+is written in a context-free
+notation (also known as BNF). The BNF format is often a good
+starting point for GF grammar development, because it is
+simple and widely used. However, the BNF format is not
+good for multilingual grammars. While it is possible to
+translate the words contained in a BNF grammar to another
+language, proper translation usually involves more, e.g.
+changing the word order in
+
+ Italian cheese ===> formaggio italiano ++
+The full GF grammar format is designed to support such +changes, by separating between the abstract syntax +(the logical structure) and the concrete syntax (the +sequence of words) of expressions. +
++There is more than words and word order that makes languages +different. Words can have different forms, and which forms +they have vary from language to language. For instance, +Italian adjectives usually have four forms where English +has just one: +
++ delicious (wine | wines | pizza | pizzas) + vino delizioso, vini deliziosi, pizza deliziosa, pizze deliziose ++
+The morphology of a language describes the +forms of its words. While the complete description of morphology +belongs to resource grammars, the tutorial will explain the +main programming concepts involved. This will moreover +make it possible to grow the fragment covered by the food example. +The tutorial will in fact build a toy resource grammar in order +to illustrate the module structure of library-based application +grammar writing. +
+
+Thus it is by elaborating the initial food.cf example that
+the tutorial makes a guided tour through all concepts of GF.
+While the constructs of the GF language are the main focus,
+also the commands of the GF system are introduced as they
+are needed.
+
+To learn how to write GF grammars is not the only goal of +this tutorial. To learn the commands of the GF system means +that simple applications of grammars, such as translation and +quiz systems, can be built simply by writing scripts for the +system. More complicated applications, such as natural-language +interfaces and dialogue systems, also require programming in +some general-purpose language. We will briefly explain how +GF grammars are used as components of Haskell, Java, and +Prolog grammars. The tutorial concludes with a couple of +case studies showing how such complete systems can be built. +
+The program is open-source free software, which you can download via the @@ -196,8 +357,8 @@ As a common convention in this Tutorial, we will use Thus you should not type these prompts, but only the lines that follow them.
- -
Now you are ready to try out your first grammar.
We start with one that is not written in GF language, but
@@ -220,7 +381,7 @@ It builds sentences (S) by assigning Qualities
they are small grammars describing some more or less well-defined
domain, such as in this case food.
The first GF command when using a grammar is to import it. @@ -269,7 +430,7 @@ you imported. Try parsing something else, and you fail no tree found
- +You can also use GF for linearizing @@ -300,7 +461,7 @@ a pipe. this fresh cheese is delicious
- +The gibberish code with parentheses returned by the parser does not @@ -318,7 +479,7 @@ parsing (and any other tree-producing command) can be piped:
Random generation can be quite amusing. So you may want to @@ -338,7 +499,7 @@ generate ten strings with one and the same command: this fish is boring
- +
To generate all sentence that a grammar
@@ -369,7 +530,7 @@ You get quite a few trees but not all of them: only up to a given
Quiz. If the command gt generated all
trees in your grammar, it would never terminate. Why?
A pipe of GF commands can have any length, but the "output type" @@ -393,7 +554,7 @@ This facility is good for test purposes: for instance, you may want to see if a grammar is ambiguous, i.e. contains strings that can be parsed in more than one way.
- +To save the outputs of GF commands into a file, you can @@ -416,7 +577,7 @@ the file separately. Without the flag, the grammar could not recognize the string in the file, because it is not a sentence but a sequence of ten sentences.
- +The syntax trees returned by GF's parser in the previous examples @@ -459,7 +620,7 @@ and so on. These labels are formed automatically when the grammar is compiled by GF, in a way that guarantees that different rules get different labels.
- +The labelled context-free grammar format permits user-defined @@ -496,7 +657,7 @@ With this grammar, the trees look as follows:
To see what there is in GF's shell state when a grammar @@ -521,7 +682,7 @@ one more way of defining the same grammar as in Then we will show how the full GF grammar format enables you to do things that are not possible in the weaker formats.
- +
A GF grammar consists of two main parts:
@@ -556,7 +717,7 @@ The latter rule, with the keyword lin, belongs to the concrete synt
It defines the linearization function for
syntax trees of form (Is item quality).
Rules in a GF grammar are called judgements, and the keywords @@ -612,7 +773,7 @@ First we will look at how judgements are grouped into modules, and show how the paleolithic grammar is expressed by using modules and judgements.
- +A GF grammar consists of modules, @@ -626,7 +787,7 @@ module forms are abstract syntax A, with judgements in the module body M. - +
The linearization type of a category is a record type, with @@ -686,7 +847,7 @@ can be used for lists of tokens. The expression
denotes the empty token list.
- +
To express the abstract syntax of food.cf in
@@ -718,7 +879,7 @@ Notice the use of shorthands permitting the sharing of
the keyword in subsequent judgements, and of the type
in subsequent fun judgements.
Each category introduced in Food.gf is
@@ -750,7 +911,7 @@ apply as in abstract modules.
}
Module name + .gf = file name
@@ -777,7 +938,7 @@ GF source files. When reading a module, GF decides whether
to use an existing .gfc file or to generate
a new one, by looking at modification times.
The main advantage of separating abstract from concrete syntax is that
@@ -790,7 +951,7 @@ translation. Let us buid an Italian concrete syntax for
Food and then test the resulting
multilingual grammar.
concrete FoodIta of Food = {
@@ -818,7 +979,7 @@ multilingual grammar.
-
+
Import the two grammars in the same GF session. @@ -857,7 +1018,7 @@ To see what grammars are in scope and which is the main one, use the command actual concretes : FoodIta FoodEng
- +
If translation is what you want to do with a set of grammars, a convenient
@@ -880,7 +1041,7 @@ A dot . terminates the translation session.
>
This is a simple language exercise that can be automatically
@@ -920,9 +1081,9 @@ file for later use, by the command translation_list = tl
The number flag gives the number of sentences generated.
The module system of GF makes it possible to extend a @@ -957,7 +1118,7 @@ be built for concrete syntaxes: The effect of extension is that all of the contents of the extended and extending module are put together.
- +
Specialized vocabularies can be represented as small grammars that
@@ -992,7 +1153,7 @@ At this point, you would perhaps like to go back to
Food and take apart Wine to build a special
Drink module.
When you have created all the abstract syntaxes and
@@ -1019,7 +1180,7 @@ The graph uses
-
+
To document your grammar, you may want to print the @@ -1047,9 +1208,9 @@ are available: > help -printer
- +
In comparison to the .cf format, the .gf format still looks rather
@@ -1071,7 +1232,7 @@ changing parts, parameters. In functional programming languages, such as
Haskell, it is possible to share muc more than in
the languages such as C and Java.
GF is a functional programming language, not only in the sense that
@@ -1101,7 +1262,7 @@ its type, and an expression defining it. As for the syntax of the defining
expression, notice the lambda abstraction form \x -> t of
the function.
Operator definitions can be included in a concrete syntax. @@ -1132,7 +1293,7 @@ Resource modules can extend other resource modules, in the same way as modules of other types can extend modules of the same type. Thus it is possible to build resource hierarchies.
- +
Any number of resource modules can be
@@ -1170,7 +1331,7 @@ opened in a new version of FoodEng.
The same string operations could be use to write FoodIta
more concisely.
Using operations defined in resource modules is a @@ -1182,7 +1343,7 @@ available through resource grammar modules, whose users only need to pick the right operations and not to know their implementation details.
- +Suppose we want to say, with the vocabulary included in @@ -1217,7 +1378,7 @@ and many new expression forms. We also need to generalize linearization types from strings to more complex types.
- +
We define the parameter type of number in Englisn by
@@ -1258,7 +1419,7 @@ operator !. For instance,
is a selection, whose value is "cheeses".
All English common nouns are inflected in number, most of them in the @@ -1292,7 +1453,7 @@ are written together to form one token. Thus, for instance, (regNoun "cheese").s ! Pl ---> "cheese" + "s" ---> "cheeses"
- +
Some English nouns, such as mouse, are so irregular that
@@ -1333,7 +1494,7 @@ interface (i.e. the system of type signatures) that makes it
correct to use these functions in concrete modules. In programming
terms, Noun is then treated as an abstract datatype.
In addition to the completely regular noun paradigm regNoun,
@@ -1365,7 +1526,7 @@ The operator init belongs to a set of operations in the
resource module Prelude, which therefore has to be
opened so that init can be used.
It may be hard for the user of a resource morphology to pick the right
@@ -1395,7 +1556,7 @@ this, either use mkNoun or modify
regNoun so that the "y" case does not
apply if the second-last character is a vowel.
Expressions of the table form are built from lists of
@@ -1431,14 +1592,14 @@ programming languages are syntactic sugar for table selections:
case e of {...} === table {...} ! e
A common idiom is to
gather the oper and param definitions
needed for inflecting words in
a language into a morphology module. Here is a simple
-example, MorphoEng.
+example, MorphoEng.
--# -path=.:prelude
@@ -1482,7 +1643,7 @@ module depends on. The directory prelude is a subdirectory of
set the environment variable GF_LIB_PATH to point to this
directory.
-
+
Testing ``resource`` modules
To test a resource module independently, you can import it
@@ -1525,7 +1686,7 @@ Why does the command also show the operations that form
Verb is first computed, and its value happens to be
the same as the value of Noun.
-
+
Using morphology in concrete syntax
We can now enrich the concrete syntax definitions to
@@ -1536,7 +1697,7 @@ parameters and linearization types are different in
different languages - but this does not prevent the
use of a common abstract syntax.
-
+
Parametric vs. inherent features, agreement
The rule of subject-verb agreement in English says that the verb
@@ -1566,14 +1727,22 @@ the predication structure:
The following section will present
FoodsEng, assuming the abstract syntax Foods
that is similar to Food but also has the
-plural determiners All and Most.
+plural determiners These and Those.
The reader is invited to inspect the way in which agreement works in
the formation of sentences.
-
+
English concrete syntax with parameters
+
+The grammar uses both
+Prelude and
+MorphoEng.
+We will later see how to make the grammar even
+more high-level by using a resource grammar library
+and parametrized modules.
+
- --# -path=.:prelude
+ --# -path=.:resource:prelude
concrete FoodsEng of Foods = open Prelude, MorphoEng in {
@@ -1584,10 +1753,10 @@ the formation of sentences.
lin
Is item quality = ss (item.s ++ (mkVerb "are" "is").s ! item.n ++ quality.s) ;
- This = det Sg "this" ;
- That = det Sg "that" ;
- All = det Pl "all" ;
- Most = det Pl "most" ;
+ This = det Sg "this" ;
+ That = det Sg "that" ;
+ These = det Pl "these" ;
+ Those = det Pl "those" ;
QKind quality kind = {s = \\n => quality.s ++ kind.s ! n} ;
Wine = regNoun "wine" ;
Cheese = regNoun "cheese" ;
@@ -1609,7 +1778,7 @@ the formation of sentences.
}
-
+
Hierarchic parameter types
The reader familiar with a functional programming language such as
@@ -1638,20 +1807,31 @@ yields an accurate system of three adjectival forms.
param AdjForm = ASg Gender | APl ;
- param Gender = Uter | Neuter ;
+ param Gender = Utr | Neutr ;
-In pattern matching, a constructor can have patterns as arguments. For instance,
-the adjectival paradigm in which the two singular forms are the same, can be defined
+Here is an example of pattern matching, the paradigm of regular adjectives.
- oper plattAdj : Str -> AdjForm => Str = \x -> table {
- ASg _ => x ;
- APl => x + "a" ;
+ oper regAdj : Str -> AdjForm => Str = \fin -> table {
+ ASg Utr => fin ;
+ ASg Neutr => fin + "t" ;
+ APl => fin + "a" ;
+ }
+
+
+A constructor can have patterns as arguments. For instance,
+the adjectival paradigm in which the two singular forms are the same,
+can be defined
+
+
+ oper plattAdj : Str -> AdjForm => Str = \platt -> table {
+ ASg _ => platt ;
+ APl => platt + "a" ;
}
-
+
Morphological analysis and morphology quiz
Even though in GF morphology
@@ -1691,7 +1871,7 @@ file for later use, by the command morpho_list = ml
The number flag gives the number of exercises generated.
-
+
Discontinuous constituents
A linearization type may contain more strings than one.
@@ -1707,8 +1887,8 @@ type with two strings and not just one. The second judgement
shows how the constituents are separated by the object in complementization.
- lincat TV = {s : Number => Str ; s2 : Str} ;
- lin ComplTV tv obj = {s = \\n => tv.s ! n ++ obj.s ++ tv.s2} ;
+ lincat TV = {s : Number => Str ; part : Str} ;
+ lin PredTV tv obj = {s = \\n => tv.s ! n ++ obj.s ++ tv.part} ;
There is no restriction in the number of discontinuous constituents
@@ -1721,9 +1901,31 @@ the parsing and linearization commands only give reliable results
for categories whose linearization type has a unique Str valued
field labelled s.
-
+
More constructs for concrete syntax
-
+
+Local definitions
+
+Local definitions ("let expressions") are used in functional
+programming for two reasons: to structure the code into smaller
+expressions, and to avoid repeated computation of one and
+the same expression. Here is an example, from
+``MorphoIta:
+
+
+ oper regNoun : Str -> Noun = \vino ->
+ let
+ vin = init vino ;
+ o = last vino
+ in
+ case o of {
+ "a" => mkNoun Fem vino (vin + "e") ;
+ "o" | "e" => mkNoun Masc vino (vin + "i") ;
+ _ => mkNoun Masc vino vino
+ } ;
+
+
+
Free variation
Sometimes there are many alternative ways to define a concrete syntax.
@@ -1733,7 +1935,7 @@ are in free variation. The variants construct of GF can
be used to give a list of strings in free variation. For example,
- NegVerb verb = {s = variants {["does not"] ; "doesn't} ++ verb.s} ;
+ NegVerb verb = {s = variants {["does not"] ; "doesn't} ++ verb.s ! Pl} ;
An empty variant list
@@ -1751,7 +1953,7 @@ user of the library has no way to choose among the variants.
Moreover, even though variants admits lists of any type,
its semantics for complex types can cause surprises.
-
+
Record extension and subtyping
Record types and records can be extended with new fields. For instance,
@@ -1781,7 +1983,7 @@ be used whenever a verb is required.
Contravariance means that a function taking an R as argument
can also be applied to any object of a subtype T.
-
+
Tuples and product types
Product types and tuples are syntactic sugar for record types and records:
@@ -1793,7 +1995,7 @@ Product types and tuples are syntactic sugar for record types and records:
Thus the labels p1, p2,...` are hard-coded.
-
+
Prefix-dependent choices
The construct exemplified in
@@ -1822,15 +2024,15 @@ This very example does not work in all situations: the prefix
} ;
-
+
GF has the following predefined categories in abstract syntax:
cat Int ; -- integers, e.g. 0, 5, 743145151019
- cat Float ; -- floats, e.g. 0.0, 3.1415926
- cat String ; -- strings, e.g. "", "foo", "123"
+ cat Float ; -- floats, e.g. 0.0, 3.1415926
+ cat String ; -- strings, e.g. "", "foo", "123"
The objects of each of these categories are literals @@ -1845,31 +2047,31 @@ they can be used as arguments. For example: -- e.g. (StreetAddress 10 "Downing Street") : Address
- +See resource library documentation
- +See an example built this way
- -Transfer means noncompositional tree-transforming operations. @@ -1888,9 +2090,9 @@ See the transfer language documentation for more information.
- +
Lexers and unlexers can be chosen from
@@ -1926,7 +2128,7 @@ Given by help -lexer, help -unlexer:
Issues: @@ -1937,7 +2139,7 @@ Issues:
-mcfg vs. others
-
+
Thespeak_aloud = sa command sends a string to the speech
@@ -1967,7 +2169,7 @@ The method words only for grammars of English.
Both Flite and ATK are freely available through the links
above, but they are not distributed together with GF.
The @@ -1984,18 +2186,18 @@ Here is a snapshot of the editor: The grammars of the snapshot are from the Letter grammar package.
- +Forthcoming.
- +Other processes can communicate with the GF command interpreter, and also with the GF syntax editor.
- +GF grammars can be used as parts of programs written in the @@ -2007,15 +2209,15 @@ following languages. The links give more documentation.
A summary is given in the following chart of GF grammar compiler phases:
Formal and Informal Software Specifications, @@ -2028,6 +2230,6 @@ English and German. A simpler example will be explained here.
- +