1
0
forked from GitHub/gf-core
Files
gf-core/doc/gf-tutorial.html
2007-12-21 15:10:38 +00:00

7953 lines
259 KiB
HTML
Raw Blame History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>Grammatical Framework Tutorial</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<P ALIGN="center"><CENTER><H1>Grammatical Framework Tutorial</H1>
<FONT SIZE="4">
<I>Aarne Ranta</I><BR>
Draft, November 2007
</FONT></CENTER>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<UL>
<LI><A HREF="#toc1">Getting started with GF</A>
<UL>
<LI><A HREF="#toc2">What GF is</A>
<LI><A HREF="#toc3">Getting the GF system</A>
<LI><A HREF="#toc4">Running the GF system</A>
<LI><A HREF="#toc5">A "Hello World" grammar</A>
<UL>
<LI><A HREF="#toc6">The program: abstract syntax and concrete syntaxes</A>
<LI><A HREF="#toc7">Using the grammar in the GF system</A>
</UL>
<LI><A HREF="#toc8">Using grammars from outside GF</A>
<LI><A HREF="#toc9">What else can be done with the grammar</A>
<LI><A HREF="#toc10">Summary of GF language features</A>
<UL>
<LI><A HREF="#toc11">Modules</A>
<LI><A HREF="#toc12">Judgements</A>
<LI><A HREF="#toc13">Types and terms</A>
<LI><A HREF="#toc14">Type checking</A>
</UL>
</UL>
<LI><A HREF="#toc15">Designing a grammar for complex phrases</A>
<UL>
<LI><A HREF="#toc16">The abstract syntax Food</A>
<LI><A HREF="#toc17">The concrete syntax FoodEng</A>
<LI><A HREF="#toc18">Commands for testing grammars</A>
<UL>
<LI><A HREF="#toc19">Generating trees and strings</A>
<LI><A HREF="#toc20">More on pipes; tracing</A>
<LI><A HREF="#toc21">Writing and reading files</A>
<LI><A HREF="#toc22">Visualizing trees</A>
<LI><A HREF="#toc23">System commands</A>
</UL>
<LI><A HREF="#toc24">An Italian concrete syntax</A>
<LI><A HREF="#toc25">Free variation</A>
<LI><A HREF="#toc26">More application of multilingual grammars</A>
<UL>
<LI><A HREF="#toc27">Multilingual treebanks</A>
<LI><A HREF="#toc28">Translation session</A>
<LI><A HREF="#toc29">Translation quiz</A>
<LI><A HREF="#toc30">Multilingual syntax editing</A>
</UL>
<LI><A HREF="#toc31">Context-free grammars and GF</A>
<UL>
<LI><A HREF="#toc32">The "cf" grammar format</A>
<LI><A HREF="#toc33">Restrictions of context-free grammars</A>
</UL>
<LI><A HREF="#toc34">Modules and files</A>
<LI><A HREF="#toc35">Using operations and resource modules</A>
<UL>
<LI><A HREF="#toc36">The golden rule of functional programming</A>
<LI><A HREF="#toc37">Operation definitions</A>
<LI><A HREF="#toc38">The ``resource`` module type</A>
<LI><A HREF="#toc39">Opening a resource</A>
<LI><A HREF="#toc40">Partial application</A>
<LI><A HREF="#toc41">Testing resource modules</A>
</UL>
<LI><A HREF="#toc42">Grammar architecture</A>
<UL>
<LI><A HREF="#toc43">Extending a grammar</A>
<LI><A HREF="#toc44">Multiple inheritance</A>
<LI><A HREF="#toc45">Visualizing module structure</A>
</UL>
<LI><A HREF="#toc46">Summary of GF language features</A>
<UL>
<LI><A HREF="#toc47">Modules</A>
<LI><A HREF="#toc48">Judgements</A>
<LI><A HREF="#toc49">Free variation</A>
<LI><A HREF="#toc50">The context-free grammar format</A>
<LI><A HREF="#toc51">Character encoding</A>
</UL>
</UL>
<LI><A HREF="#toc52">Grammars with parameters</A>
<UL>
<LI><A HREF="#toc53">The problem: words have to be inflected</A>
<LI><A HREF="#toc54">Parameters and tables</A>
<LI><A HREF="#toc55">Inflection tables and paradigms</A>
<LI><A HREF="#toc56">Using parameters in concrete syntax</A>
<UL>
<LI><A HREF="#toc57">Agreement</A>
<LI><A HREF="#toc58">Determiners</A>
<LI><A HREF="#toc59">Parametric vs. inherent features</A>
</UL>
<LI><A HREF="#toc60">An English concrete syntax for Foods with parameters</A>
<LI><A HREF="#toc61">More on inflection paradigms</A>
<UL>
<LI><A HREF="#toc62">Worst-case functions</A>
<LI><A HREF="#toc63">Intelligent paradigms</A>
<LI><A HREF="#toc64">Function types with variables</A>
<LI><A HREF="#toc65">Separating operation types and definitions</A>
<LI><A HREF="#toc66">Overloading of operations</A>
<LI><A HREF="#toc67">Morphological analysis and morphology quiz</A>
</UL>
<LI><A HREF="#toc68">The Italian Foods grammar</A>
<LI><A HREF="#toc69">Discontinuous constituents</A>
<LI><A HREF="#toc70">Strings at compile time vs. run time</A>
<LI><A HREF="#toc71">Summary of GF language features</A>
<UL>
<LI><A HREF="#toc72">Parameter and table types</A>
<LI><A HREF="#toc73">Pattern matching</A>
<LI><A HREF="#toc74">Overloading</A>
<LI><A HREF="#toc75">Local definitions</A>
<LI><A HREF="#toc76">Supplementary constructs</A>
</UL>
</UL>
<LI><A HREF="#toc77">Using the resource grammar library</A>
<UL>
<LI><A HREF="#toc78">The coverage of the library</A>
<LI><A HREF="#toc79">The structure of the library</A>
<UL>
<LI><A HREF="#toc80">Lexical vs. phrasal rules</A>
<LI><A HREF="#toc81">Lexical categories</A>
<LI><A HREF="#toc82">Lexical rules</A>
<LI><A HREF="#toc83">Phrasal categories</A>
</UL>
<LI><A HREF="#toc84">The resource API</A>
<LI><A HREF="#toc85">Example: English</A>
<LI><A HREF="#toc86">Functor implementation of multilingual grammars</A>
<LI><A HREF="#toc87">Interfaces and instances</A>
<LI><A HREF="#toc88">Adding languages to a functor implementation</A>
<LI><A HREF="#toc89">Division of labour revisited</A>
<LI><A HREF="#toc90">Restricted inheritance</A>
<LI><A HREF="#toc91">Grammar reuse</A>
<LI><A HREF="#toc92">Browsing the resource with GF commands</A>
<LI><A HREF="#toc93">An extended Foods grammar</A>
<UL>
<LI><A HREF="#toc94">Abstract syntax</A>
<LI><A HREF="#toc95">Linearization types</A>
<LI><A HREF="#toc96">Linearization rules</A>
</UL>
<LI><A HREF="#toc97">Tenses</A>
<LI><A HREF="#toc98">Summary of GF language features</A>
<UL>
<LI><A HREF="#toc99">Interfaces and instances</A>
<LI><A HREF="#toc100">Grammar reuse</A>
<LI><A HREF="#toc101">Functors</A>
<LI><A HREF="#toc102">Restricted inheritance</A>
</UL>
</UL>
<LI><A HREF="#toc103">Refining semantics in abstract syntax</A>
<UL>
<LI><A HREF="#toc104">GF as a logical framework</A>
<LI><A HREF="#toc105">Dependent types</A>
<LI><A HREF="#toc106">Polymorphism</A>
<UL>
<LI><A HREF="#toc107">Digression: dependent types in concrete syntax</A>
</UL>
<LI><A HREF="#toc108">Proof objects</A>
<UL>
<LI><A HREF="#toc109">Proof-carrying documents</A>
</UL>
<LI><A HREF="#toc110">Restricted polymorphism</A>
<LI><A HREF="#toc111">Variable bindings</A>
<LI><A HREF="#toc112">Semantic definitions</A>
<LI><A HREF="#toc113">Summary of GF language features</A>
<UL>
<LI><A HREF="#toc114">Judgements</A>
<LI><A HREF="#toc115">Dependent function types</A>
</UL>
</UL>
<LI><A HREF="#toc116">Grammars of formal languages</A>
<UL>
<LI><A HREF="#toc117">Arithmetic expressions</A>
<UL>
<LI><A HREF="#toc118">Abstract syntax</A>
<LI><A HREF="#toc119">Concrete syntax: a simple approach</A>
</UL>
<LI><A HREF="#toc120">Lexing and unlexing</A>
<LI><A HREF="#toc121">Precedence and fixity</A>
<LI><A HREF="#toc122">Code generation as linearization</A>
<LI><A HREF="#toc123">Speaking aloud arithmetic expressions</A>
<LI><A HREF="#toc124">Programs with variables</A>
<UL>
<LI><A HREF="#toc125">The concrete syntax of assignments</A>
<LI><A HREF="#toc126">A liberal syntax of variables</A>
</UL>
<LI><A HREF="#toc127">Conclusion</A>
<LI><A HREF="#toc128">Summary of GF language constructs</A>
<UL>
<LI><A HREF="#toc129">Lexers and unlexers</A>
<LI><A HREF="#toc130">Built-in abstract syntax types</A>
</UL>
</UL>
<LI><A HREF="#toc131">Embedded grammars</A>
<UL>
<LI><A HREF="#toc132">The portable grammar format</A>
<LI><A HREF="#toc133">The embedded interpreter and its API</A>
<LI><A HREF="#toc134">Embedded GF applications in Haskell</A>
<UL>
<LI><A HREF="#toc135">The EmbedAPI module</A>
<LI><A HREF="#toc136">First application: a translator</A>
<LI><A HREF="#toc137">A looping translator</A>
<LI><A HREF="#toc138">A question-answer system</A>
<LI><A HREF="#toc139">Exporting GF datatypes</A>
<LI><A HREF="#toc140">Putting it all together</A>
</UL>
<LI><A HREF="#toc141">Embedded GF applications in Java</A>
<UL>
<LI><A HREF="#toc142">Translets</A>
<LI><A HREF="#toc143">Dialogue systems</A>
</UL>
<LI><A HREF="#toc144">Language models for speech recognition</A>
<LI><A HREF="#toc145">Dependent types and spoken language models</A>
<UL>
<LI><A HREF="#toc146">Statistical language models</A>
</UL>
</UL>
</UL>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<P>
<h2>Overview</h2>
</P>
<P>
This tutorial gives a hands-on introduction to grammar writing in GF.
It has been written for all programmers
who want to learn to write grammars in GF.
It will go through the programming concepts of GF, and also
explain, without presupposing them, the main ingredients of GF:
linguistics, functional programming, and type theory.
This knowledge will be introduced as a part of grammar writing
practice.
Thus the tutorial should be accessible to anyone who has some
previous experience from any programming language; the basics
of using computers are also presupposed, e.g. the use of
text editors and the management of files.
</P>
<P>
We start in <a href="#chaptwo">the second chapter</a>
by building a "Hello World" grammar, which covers greetings
in three languages: English (<I>hello world</I>),
Finnish (<I>terve maailma</I>), and Italian (<I>ciao mondo</I>).
This <B>multilingual grammar</B> is based on the most central idea of GF:
the distinction between <B>abstract syntax</B>
(the logical structure) and <B>concrete syntax</B> (the
sequence of words).
</P>
<P>
From the "Hello World" example, we proceed
in <a href="#chapthree">the third chapter</a>
to a larger grammar for the domain of food.
In this grammar, you can say things like
<center>
<I>this Italian cheese is delicious</I>
</center>
in English and Italian. This grammar illustrates how translation is
more than just replacement of words. For instance, the order of
words may have to be changed:
<center>
<I>Italian cheese</I>
</P>
<P>
<I>formaggio italiano</I>
</center>
Moreover, words can have different forms, and which forms
they have vary from language to language. For instance,
Italian adjectives usually have four forms where English
has just one:
<center>
<I>delicious</I> (<I>wine, wines, pizza, pizzas</I>)
</P>
<P>
<I>vino delizioso, vini deliziosi, pizza deliziosa, pizze deliziose</I>
</center>
The <B>morphology</B> of a language describes the
forms of its words, and the basics of implementing morphology and
integrating it with syntax are covered in <a href="#chaptwo">the fourth chapter</a>.
</P>
<P>
The complete description of morphology and syntax in natural
languages is in GF preferably left to the <B>resource grammar library</B>.
Its use is therefore an important part of GF programming, and
it is covered in <a href="#chapfive">the fifth chapter</a>. How to contribute to resource
grammars as an author will only be covered in Part III;
however, the tutorial does go through all the
programming concepts of GF, including those involved in
resource grammars.
</P>
<P>
In addition to multilinguality, <B>semantics</B> is an important aspect of GF
grammars. The "purely linguistic" aspects (morphology and syntax) belong to
the concrete syntax part of GF, whereas semantics is expressed in the abstract
syntax. After the presentation of concrete syntax constructs, we proceed
in <a href="#chapsix">the sixth chapter</a> to the enrichment of abstract syntax with <B>dependent types</B>,
<B>variable bindings</B>, and <B>semantic definitions</B>.
<a href="#chapseven">the seventh chapter</a> concludes the tutorial by technical tips for implementing formal
languages. It will also illustrate the close relation between GF grammars
and compilers by actually implementing a small compiler from C-like statements
and expressions to machine code similar to Java Virtual Machine.
</P>
<P>
English and Italian are used as example languages in many grammars.
Of course, we will not presuppose that the reader knows any Italian.
We have chosen Italian because it has a rich structure
that illustrates very well the capacities of GF.
Moreover, even those readers who don't know Italian, will find many of
its words familiar, due to the Latin heritage.
The exercises will encourage the reader to
port the examples to other languages as well; in particular,
it should be instructive for the reader to look at her
own native language from the point of view of writing a grammar
implementation.
</P>
<P>
To learn how to write GF grammars is not the only goal of
this tutorial. We will also explain the most important
commands of the GF system, mostly in passing. With these commands,
simple application programs such as translation and
quiz systems, can be built simply by writing scripts for the
GF system. More complicated applications, such as natural-language
interfaces and dialogue systems, moreover require programming in
some general-purpose language; such applications are covered in <a href="#chapeight">the eighth chapter</a>.
</P>
<A NAME="toc1"></A>
<H1>Getting started with GF</H1>
<P>
<a name="chaptwo"></a>
</P>
<P>
In this chapter, we will introduce the GF system and write the first GF grammar,
a "Hello World" grammar. While extremely small, this grammar already illustrates
how GF can be used for the tasks of translation and multilingual
generation.
</P>
<A NAME="toc2"></A>
<H2>What GF is</H2>
<P>
We use the term GF for three different things:
</P>
<UL>
<LI>a <B>system</B> (computer program) used for working with grammars
<LI>a <B>programming language</B> in which grammars can be written
<LI>a <B>theory</B> about grammars and languages
</UL>
<P>
The relation between these things is obvious: the GF system is an implementation
of the GF programming language, which in turn is built on the ideas of the
GF theory. The main focus of this book is on the GF programming language.
We learn how grammars are written in this language. At the same time, we learn
the way of thinking in the GF theory. To make this all useful and fun, and
to encourage experimenting, we make the grammars run on a computer by
using the GF system.
</P>
<P>
A GF program is called a <B>grammar</B>. A grammar is, traditionally, a
definition of a language. From this definition, different language
processing components can be derived:
</P>
<UL>
<LI><B>parsing</B>: to analyse the language
<LI><B>linearization</B>: to generate the language
<LI><B>translation</B>: to analyse one language and generate another
</UL>
<P>
A GF grammar is thus a declarative program from which these
procedures can be automatically derived. In general, a GF grammar
is <B>multilingual</B>: it defines many languages and translations between them.
</P>
<A NAME="toc3"></A>
<H2>Getting the GF system</H2>
<P>
The GF system is open-source free software, which can be downloaded via the
GF Homepage:
<center>
<CODE>gf.digitalgrammars.com</CODE>
</center>
There you can download
</P>
<UL>
<LI>binaries for Linux, Mac OS X, and Windows
<LI>source code and documentation
<LI>grammar libraries and examples
</UL>
<P>
In particular, many of the examples in this book are included in the
subdirectory <CODE>examples/tutorial</CODE> of the source distribution package.
This directory is also available
<A HREF="http://digitalgrammars.com/gf/examples/tutorial">online</A>.
</P>
<P>
If you want to compile GF from source, you need a Haskell compiler.
To compile the interactive editor, you also need a Java compilers.
But normally you don't have to compile anything yourself, and you definitely
don't need to know Haskell or Java to use GF.
</P>
<P>
We are assuming the availability of a Unix shell. Linux and Mac OS X users
have it automatically, the latter under the name "terminal".
Windows users are recommended to install Cywgin, the free Unix shell for Windows.
</P>
<A NAME="toc4"></A>
<H2>Running the GF system</H2>
<P>
To start the GF system, assuming you have installed it, just type
<CODE>gf</CODE> in the Unix (or Cygwin) shell:
</P>
<PRE>
% gf
</PRE>
<P>
You will see GF's welcome message and the prompt <CODE>&gt;</CODE>.
The command
</P>
<PRE>
&gt; help
</PRE>
<P>
will give you a list of available commands.
</P>
<P>
As a common convention in this book, we will use
</P>
<UL>
<LI><CODE>%</CODE> as a prompt that marks system commands
<LI><CODE>&gt;</CODE> as a prompt that marks GF commands
</UL>
<P>
Thus you should not type these prompts, but only the characters that
follow them.
</P>
<A NAME="toc5"></A>
<H2>A "Hello World" grammar</H2>
<P>
The tradition in programming language tutorials is to start with a
program that prints "Hello World" on the terminal. GF should be no
exception. But our program has features that distinguish it from
most "Hello World" programs:
</P>
<UL>
<LI><B>Multilinguality</B>: the message is printed in many languages.
<LI><B>Reversibility</B>: in addition to printing, you can <B>parse</B> the
message and <B>translate</B> it to other languages.
</UL>
<A NAME="toc6"></A>
<H3>The program: abstract syntax and concrete syntaxes</H3>
<P>
A GF program, in general, is a <B>multilingual grammar</B>. Its main parts
are
</P>
<UL>
<LI>an <B>abstract syntax</B>
<LI>one or more <B>concrete syntaxes</B>
</UL>
<P>
The abstract syntax defines, in a language-independent way, what <B>meanings</B>
can be expressed in the grammar. In the "Hello World" grammar we want
to express <I>Greetings</I>, where we greet a <I>Recipient</I>, which can be
<I>World</I> or <I>Mum</I> or <I>Friends</I>. Here is the entire
GF code for the abstract syntax:
</P>
<PRE>
-- a "Hello World" grammar
abstract Hello = {
flags startcat = Greeting ;
cat Greeting ; Recipient ;
fun
Hello : Recipient -&gt; Greeting ;
World, Mum, Friends : Recipient ;
}
</PRE>
<P>
The code has the following parts:
</P>
<UL>
<LI>a <B>comment</B> (optional), saying what the module is doing
<LI>a <B>module header</B> indicating that it is an abstract syntax
module named <CODE>Hello</CODE>
<LI>a <B>module body</B> in braces, consisting of
<UL>
<LI>a <B>startcat flag declaration</B> stating that <CODE>Greeting</CODE> is the
main category, i.e. the one in which parsing and generation are
performed by default
<LI><B>category declarations</B> stating that <CODE>Greeting</CODE> and <CODE>Recipient</CODE>
are categories, i.e. types of meanings
<LI><B>function declarations</B> stating what meaning-building functions there
are; these are the function <CODE>Hello</CODE> constructing a greeting from a recipient,
as well as the three possible recipients
</UL>
</UL>
<P>
A concrete syntax defines a mapping from the abstract meanings to their
expressions in a language. We first give an English concrete syntax:
</P>
<PRE>
concrete HelloEng of Hello = {
lincat Greeting, Recipient = {s : Str} ;
lin
Hello recip = {s = "hello" ++ recip.s} ;
World = {s = "world"} ;
Mum = {s = "mum"} ;
Friends = {s = "friends"} ;
}
</PRE>
<P>
The major parts of this code are:
</P>
<UL>
<LI>a module header indicating that it is a concrete syntax of the abstract syntax
<CODE>Hello</CODE>, itself named <CODE>HelloEng</CODE>
<LI>a module body in curly brackets, consisting of
<UL>
<LI><B>linearization type definitions</B> stating that
<CODE>Greeting</CODE> and <CODE>Recipient</CODE> are <B>records</B> with a <B>string</B> <CODE>s</CODE>
<LI><B>linearization definitions</B> telling what records are assigned to
each of the meanings defined in the abstract syntax; the recipients are
linearized to records containing single words, whereas the <CODE>Hello</CODE> greeting
has a function telling that the word <CODE>hello</CODE> is prefixed to the string
<CODE>s</CODE> contained in the record <CODE>recip</CODE>
</UL>
</UL>
<P>
To make the grammar truly multilingual, we add a Finnish and an Italian concrete
syntax:
</P>
<PRE>
concrete HelloFin of Hello = {
lincat Greeting, Recipient = {s : Str} ;
lin
Hello recip = {s = "terve" ++ recip.s} ;
World = {s = "maailma"} ;
Mum = {s = "<22>iti"} ;
Friends = {s = "yst<73>v<EFBFBD>t"} ;
}
concrete HelloIta of Hello = {
lincat Greeting, Recipient = {s : Str} ;
lin
Hello recip = {s = "ciao" ++ recip.s} ;
World = {s = "mondo"} ;
Mum = {s = "mamma"} ;
Friends = {s = "amici"} ;
}
</PRE>
<P>
Now we have a trilingual grammar usable for translation and
many other tasks, which we will now start experimenting with.
</P>
<A NAME="toc7"></A>
<H3>Using the grammar in the GF system</H3>
<P>
In order to compile the grammar in GF, each of the four modules
has to be put into a file named <I>Modulename</I><CODE>.gf</CODE>:
</P>
<PRE>
Hello.gf HelloEng.gf HelloFin.gf HelloIta.gf
</PRE>
<P>
The first GF command needed when using a grammar is to <B>import</B> it.
The command has a long name, <CODE>import</CODE>, and a short name, <CODE>i</CODE>.
When you have started GF (by the shell command <CODE>gf</CODE>), you can thus type either
</P>
<PRE>
&gt; import HelloEng.gf
</PRE>
<P>
or
</P>
<PRE>
&gt; i HelloEng.gf
</PRE>
<P>
to get the same effect. In general, all GF commands have a long and a short name;
short names are convenient when typing commands by hand, whereas long command
names are more readable in scripts, i.e. files that include sequences of commands.
</P>
<P>
The effect of <CODE>import</CODE> is that the GF system <B>compiles</B> your grammar
into an internal representation, and shows a new prompt when it is ready.
It will also show how much CPU time was consumed:
</P>
<PRE>
&gt; i HelloEng.gf
- compiling Hello.gf... wrote file Hello.gfc 8 msec
- compiling HelloEng.gf... wrote file HelloEng.gfc 12 msec
12 msec
&gt;
</PRE>
<P>
You can now use GF for <B>parsing</B>:
</P>
<PRE>
&gt; parse "hello world"
Hello World
</PRE>
<P>
The <CODE>parse</CODE> (= <CODE>p</CODE>) command takes a <B>string</B>
(in double quotes) and returns an <B>abstract syntax tree</B> --- the meaning
of the string as defined in the abstract syntax.
A tree is, in general, something easier than a string
for a machine to understand and to process further, although this
is not so obvious in this simple grammar. The syntax for trees is that
of <B>function application</B>, which in GF is written
</P>
<PRE>
function argument1 ... argumentn
</PRE>
<P>
Parentheses are only needed for grouping. For instance, <CODE>f (a b)</CODE> is
<CODE>f</CODE> applied to the application of <CODE>a</CODE> to <CODE>b</CODE>. This is different
from <CODE>f a b</CODE>, which is <CODE>f</CODE> applied to <CODE>a</CODE> and <CODE>b</CODE>.
</P>
<P>
Strings that return a tree when parsed do so in virtue of the grammar
you imported. Try to parse something that is not in grammar, and you will fail
</P>
<PRE>
&gt; parse "hello dad"
Unknown words: dad
&gt; parse "world hello"
no tree found
</PRE>
<P>
In the first example, the failure is caused by an unknown word.
In the second example, the combination of words is ungrammatical.
</P>
<P>
In addition to parsing, you can also use GF for <B>linearization</B>
(<CODE>linearize = l</CODE>). This is the inverse of
parsing, taking trees into strings:
</P>
<PRE>
&gt; linearize Hello World
hello world
</PRE>
<P>
What is the use of this? Typically not that you type in a tree at
the GF prompt. The utility of linearization comes from the fact that
you can obtain a tree from somewhere else --- for instance, from
a parser. A prime example of this is <B>translation</B>: you parse
with one concrete syntax and linearize with another. Let us
now do this by first importing the Italian grammar:
</P>
<PRE>
&gt; import HelloIta.gf
</PRE>
<P>
We can now parse with <CODE>HelloEng</CODE> and <B>pipe</B> the result
into linearizing with <CODE>HelloIta</CODE>:
</P>
<PRE>
&gt; parse -lang=HelloEng "hello mum" | linearize -lang=HelloIta
ciao mamma
</PRE>
<P>
Notice that, since there are now two concrete syntaxes read into the
system, the commands use a <B>language flag</B> to indicate
which concrete syntax is used in each operation. If no language flag is
given, the last-imported language is applied.
</P>
<P>
To conclude the translation exercise, we import the Finnish grammar
and pipe English parsing into <B>multilingual generation</B>:
</P>
<PRE>
&gt; parse -lang=HelloEng "hello friends" | linearize -multi
terve yst<73>v<EFBFBD>t
ciao amici
hello friends
</PRE>
<P></P>
<P>
<B>Exercise</B>. Test the parsing and translation examples shown above, as well as
some other examples, in different combinations of languages.
</P>
<P>
<B>Exercise</B>. Extend the grammar <CODE>Hello.gf</CODE> and some of the
concrete syntaxes by five new recipients and one new greeting
form.
</P>
<P>
<B>Exercise</B>. Add a concrete syntax for some other
languages you might know.
</P>
<P>
<B>Exercise</B>. Add a pair of greetings that are expressed in one and the same way in
one language and in two different ways in another. For instance, <I>good morning</I>
and <I>good afternoon</I> in English are both expressed as <I>buongiorno</I> in Italian.
Test what happens when you translate <I>buongiorno</I> to English in GF.
</P>
<P>
<B>Exercise</B>. Inject errors in the <CODE>Hello</CODE> grammars, for example, leave out
some line, omit a variable in a <CODE>lin</CODE> rule, or change the name in one occurrence
of a variable. Inspect the error messages generated by GF.
</P>
<A NAME="toc8"></A>
<H2>Using grammars from outside GF</H2>
<P>
A normal "hello world" program written in C is executable from the
Unix shell and print its output on the terminal. This is possible in GF
as well, by using the <CODE>gf</CODE> program in a Unix pipe. Invoking <CODE>gf</CODE>
can be made with grammar names as arguments,
</P>
<PRE>
% gf HelloEng.gf HelloFin.gf HelloIta.gf
</PRE>
<P>
which has the same effect as opening <CODE>gf</CODE> and then importing the
grammars. A command can be send to this <CODE>gf</CODE> state by piping it from
Unix's <CODE>echo</CODE> command:
</P>
<PRE>
% echo "l -multi Hello Wordl" | gf HelloEng.gf HelloFin.gf HelloIta.gf
</PRE>
<P>
which will execute the command and then quit. Alternatively, one can write
a <B>script</B>, a file containing the lines
</P>
<PRE>
import HelloEng.gf
import HelloFin.gf
import HelloIta.gf
linearize -multi Hello World
</PRE>
<P>
If we name this script <CODE>hello.gfs</CODE>, we can do
</P>
<PRE>
$ gf -batch -s &lt;hello.gfs s
ciao mondo
terve maailma
hello world
</PRE>
<P>
The options <CODE>-batch</CODE> and <CODE>-s</CODE> ("silent") remove prompts, CPU time,
and other messages. Writing GF scripts and Unix shell scripts that call
GF is the simplest way to build application programs that use GF grammars.
In <a href="#chapeight">the eighth chapter</a>, we will see how to build stand-alone programs that don't need
the GF system to run.
</P>
<P>
<B>Exercise</B>. (For Unix hackers.) Write a GF application that reads
an English string from the standard input and writes an Italian
translation to the output.
</P>
<A NAME="toc9"></A>
<H2>What else can be done with the grammar</H2>
<P>
Now we have built our first multilingual grammar and seen the basic
functionalities of GF: parsing and linearization. We have tested
these functionalities inside the GF program. In the forthcoming
chapters, we will build larger grammars and can then get more out of
these functionalities. But we will also introduce new ones:
</P>
<UL>
<LI><B>morphological analysis</B>: find out the possible inflection forms of words
<LI><B>morphological synthesis</B>: generate all inflection forms of words
<LI><B>random generation</B>: generate random expressions
<LI><B>corpus generation</B>: generate all expressions
<LI><B>treebank generation</B>: generate a list of trees with their linearizations
<LI><B>teaching quizzes</B>: train morphology and translation
<LI><B>multilingual authoring</B>: create a document in many languages simultaneously
<LI><B>speech input</B>: optimize a speech recognition system for a grammar
</UL>
<P>
The usefulness of GF would be quite limited if grammars were
usable only inside the GF system. In <a href="#chapeight">the eighth chapter</a>,
we will see other ways of using grammars:
</P>
<UL>
<LI>compile them to new formats, such as speech recognition grammars
<LI>embed them in Java and Haskell programs
<LI>build applications using compilation and embedding:
<UL>
<LI>voice commands
<LI>spoken language translators
<LI>dialogue systems
<LI>user interfaces
<LI>localization: parametrize the messages printed by a program
to support different languages
</UL>
</UL>
<P>
All GF functionalities, both those inside the GF program and those
ported to other environments,
are of course already applicable to the simplest of grammars,
such as the <CODE>Hello</CODE> grammars presented above. But the main focus
of this tutorial will be on grammar writing. Thus we will show
how larger and more expressive grammars can be built by using
the constructs of the GF programming language, before entering the
applications.
</P>
<A NAME="toc10"></A>
<H2>Summary of GF language features</H2>
<P>
As the last section of each chapter, we will give a summary of the GF language
features covered in the chapter. The presentation is rather technical and intended
as a reference for later use, rather than to be read at once. The summaries
may cover some new features, which complement the discussion in the main chapter.
</P>
<A NAME="toc11"></A>
<H3>Modules</H3>
<P>
A GF grammar consists of <B>modules</B>,
into which judgements are grouped. The most important
module forms are
</P>
<UL>
<LI><CODE>abstract</CODE> A <CODE>= {...}</CODE> , abstract syntax A with judgements in
the <B>module body</B> <CODE>{...}</CODE>.
<LI><CODE>concrete</CODE> C <CODE>of</CODE> A <CODE>= {...}</CODE>, concrete syntax C of the
abstract syntax A, with judgements in the module body <CODE>{...}</CODE>.
</UL>
<P>
Each module is written in a file named <I>Modulename</I><CODE>.gf</CODE>.
</P>
<A NAME="toc12"></A>
<H3>Judgements</H3>
<P>
<a name="secjment"></a>
</P>
<P>
Rules in a module body are called <B>judgements</B>. Keywords such as
<CODE>fun</CODE> and <CODE>lin</CODE> are used for distinguishing between
<B>judgement forms</B>. Here is a summary of the most important
judgement forms, which we have considered by now:
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>form</TH>
<TH>reading</TH>
<TH COLSPAN="2">module type</TH>
</TR>
<TR>
<TD><CODE>cat</CODE> <I>C</I></TD>
<TD><I>C</I> is a category</TD>
<TD>abstract</TD>
</TR>
<TR>
<TD><CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <I>A</I></TD>
<TD><I>f</I> is a function of type <I>A</I></TD>
<TD>abstract</TD>
</TR>
<TR>
<TD><CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I></TD>
<TD>category <I>C</I> has linearization type <I>T</I></TD>
<TD>concrete</TD>
</TR>
<TR>
<TD><CODE>lin</CODE> <I>f <i>x</i><sub>1</sub> ... <i>x</i><sub>n</sub></I> <CODE>=</CODE> <I>t</I></TD>
<TD>function <I>f</I> has linearization <I>t</I></TD>
<TD>concrete</TD>
</TR>
<TR>
<TD><CODE>flags</CODE> <I>p</I> <CODE>=</CODE> <I>v</I></TD>
<TD>flag <I>p</I> has value <I>v</I></TD>
<TD>any</TD>
</TR>
</TABLE>
<P></P>
<P>
Both abstract and concrete modules may moreover contain <B>comments</B> of the forms
</P>
<UL>
<LI><CODE>--</CODE> <I>anything until a newline</I>
<LI><CODE>{-</CODE> <I>anything except hyphen followed by closing brace</I> <CODE>-}</CODE>
</UL>
<P>
Judgements are terminated by semicolons. Shorthands permit the sharing of
the keyword in subsequent judgements,
</P>
<PRE>
cat C ; D ; === cat C ; cat D ;
</PRE>
<P>
and of the right-hand-side in subsequent judgements of the same form
</P>
<PRE>
fun f, g : A ; === fun f : A ; g : A ;
</PRE>
<P>
We will use the symbol <CODE>===</CODE> to indicate <B>syntactic sugar</B> when
speaking about GF. Thus it is not a symbol of the GF language.
</P>
<P>
Each judgement declares a <B>name</B>, which is an <B>identifier</B>.
An identifier is a letter followed by a sequence of letters, digits, and
characters <CODE>'</CODE> or <CODE>_</CODE>. Each identifier can only be
defined once in the same module (that is, as next to the judgement keyword;
local variables such as those in <CODE>lin</CODE> judgemenrs can be
reused in other judgements).
</P>
<P>
Names are in <B>scope</B> in the rest of the module, i.e. usable in the other
judgements of the module (subject to type restrictions, of course). Also
the name of the module is an identifier in scope.
</P>
<P>
The order of judgements in a module is free. In particular, an identifier
need not be declared before it is used.
</P>
<A NAME="toc13"></A>
<H3>Types and terms</H3>
<P>
A <B>type</B> in an abstract syntax are either a <B>basic type</B>,
i.e. one introduced in a <CODE>cat</CODE> judgement, or a
<B>function type</B> of the form
</P>
<PRE>
A1 -&gt; ... -&gt; An -&gt; A
</PRE>
<P>
where each of <CODE>A1, ..., An, A</CODE> is a basic type.
The last type in the arrow-separated sequence
is the <B>value type</B> of the function type, and the earlier types are
its <B>argument types</B>.
</P>
<P>
In a concrete syntax, the available types include
</P>
<UL>
<LI>the type of <B>token lists</B>, <CODE>Str</CODE>
<LI><B>record types</B> of form <CODE>{</CODE> r1 : T1 ; ... ; rn : Tn <CODE>}</CODE>
</UL>
<P>
Token lists are often briefly called <B>strings</B>.
</P>
<P>
Each semi-colon separated part in a record type is called a
<B>field</B>. The identifier introduced by the left-hand-side of a field
is called a <B>label</B>.
</P>
<P>
A <B>term</B> in abstract syntax is a <B>function application</B> of form
</P>
<PRE>
f a1 ... an
</PRE>
<P>
where <CODE>f</CODE> is a function declared in a <CODE>fun</CODE> judgement and <CODE>a1 ... an</CODE>
are terms. These terms are also called <B>abstract syntax trees</B>, or just
<B>trees</B>.
The tree above is well-typed and has the type A, if
</P>
<PRE>
f : A1 -&gt; ... -&gt; An -&gt; A
</PRE>
<P>
and each <CODE>ai</CODE> has type <CODE>an</CODE>.
</P>
<P>
A term used in concrete syntax has one the forms
</P>
<UL>
<LI>quoted string: <CODE>"foo"</CODE>, of type <CODE>Str</CODE>
<LI>concatenation of strings: <CODE>"foo" ++ "bar"</CODE>,
<LI>record: <CODE>{</CODE> r1 = t1 ; ... ; rn = Tn <CODE>}</CODE>,
of type <CODE>{</CODE> r1 : R1 ; ... ; rn : Rn <CODE>}</CODE>
<LI>projection <CODE>t.r</CODE> of a term <CODE>t</CODE> that has a record type,
with the record label <CODE>r</CODE>; the projection has the corresponding record
field type
<LI>argument variable <CODE>x</CODE> bound by the left-hand-side of a <CODE>lin</CODE> rule,
of the corresponding linearization type
</UL>
<P>
Each quoted string is treated as one <B>token</B>, and strings concatenated by
<CODE>++</CODE> are treated as separate tokens. Tokens are, by default, written with
a space in between. This behaviour can be changed by <CODE>lexer</CODE> and <CODE>unlexer</CODE>
flags, as will be explained later "Rseclexing. Therefore it is usually
not correct to have a space in a token. Writing
</P>
<PRE>
"hello world"
</PRE>
<P>
in a grammar would give the parser the task to find a token with a space
in it, rather than two tokens <CODE>"hello"</CODE> and <CODE>"world"</CODE>. If the latter is
what is meant, it is possible to use the shorthand
</P>
<PRE>
["hello world"] === "hello" ++ "world"
</PRE>
<P>
The <B>empty string</B> is denoted by <CODE>[]</CODE> or, equivalently, <CODE>`` or ``[]</CODE>.
</P>
<A NAME="toc14"></A>
<H3>Type checking</H3>
<P>
An important functionality of the GF system is <B>static type checking</B>.
This means that the grammars are controlled to be well-formed, so that all
run-time errors are eliminated. The main type checking principles are the
following:
</P>
<UL>
<LI>a concrete syntax must define the <CODE>lincat</CODE> of each <CODE>cat</CODE> and a <CODE>lin</CODE>
for each <CODE>fun</CODE> in the abstract syntax that it is "<CODE>of</CODE>"
<LI><CODE>lin</CODE> rules are type checked with respect to the <CODE>lincat</CODE> and <CODE>fun</CODE>
rules
<LI>terms have types as defined in the previous section
</UL>
<A NAME="toc15"></A>
<H1>Designing a grammar for complex phrases</H1>
<P>
<a name="chapthree"></a>
</P>
<P>
In this chapter, we will write a grammar that has much more structure than
the <CODE>Hello</CODE> grammar. We will look at how the abstract syntax
is divided into suitable categories, and how infinitely many
phrases can be generated by using recursive rules. We will also
introduce modularity by showing how a grammar can be
divided into modules, and how functional programming
can be used to share code in and among modules.
</P>
<A NAME="toc16"></A>
<H2>The abstract syntax Food</H2>
<P>
We will write a grammar that
defines a set of phrases usable for speaking about food:
</P>
<UL>
<LI>the start category is <CODE>Phrase</CODE>
<LI>a <CODE>Phrase</CODE> can be built by assigning a <CODE>Quality</CODE> to an <CODE>Item</CODE>
(e.g. <I>this cheese is Italian</I>)
<LI>an<CODE>Item</CODE> are build from a <CODE>Kind</CODE> by prefixing <I>this</I> or <I>that</I>
(e.g. <I>this wine</I>)
<LI>a <CODE>Kind</CODE> is either <B>atomic</B> (e.g. <I>cheese</I>), or formed
qualifying a given <CODE>Kind</CODE> with a <CODE>Quality</CODE> (e.g. <I>Italian cheese</I>)
<LI>a <CODE>Quality</CODE> is either atomic (e.g. <I>Italian</I>,
or built by modifying a given <CODE>Quality</CODE> with the word <I>very</I> (e.g. <I>very warm</I>)
</UL>
<P>
These verbal descriptions can be expressed as the following abstract syntax:
</P>
<PRE>
abstract Food = {
flags startcat = Phrase ;
cat
Phrase ; Item ; Kind ; Quality ;
fun
Is : Item -&gt; Quality -&gt; Phrase ;
This, That : Kind -&gt; Item ;
QKind : Quality -&gt; Kind -&gt; Kind ;
Wine, Cheese, Fish : Kind ;
Very : Quality -&gt; Quality ;
Fresh, Warm, Italian, Expensive, Delicious, Boring : Quality ;
}
</PRE>
<P>
In this abstract syntax, we can build <CODE>Phrase</CODE>s such as
</P>
<PRE>
Is (This (QKind Delicious (QKind Italian Wine))) (Very (Very Expensive))
</PRE>
<P>
In the English concrete syntax, we will want to linearize this into
</P>
<PRE>
this delicious Italian wine is very very expensive
</PRE>
<P></P>
<A NAME="toc17"></A>
<H2>The concrete syntax FoodEng</H2>
<P>
The English concrete syntax gives no surprises:
</P>
<PRE>
concrete FoodEng of Food = {
lincat
Phrase, Item, Kind, Quality = {s : Str} ;
lin
Is item quality = {s = item.s ++ "is" ++ quality.s} ;
This kind = {s = "this" ++ kind.s} ;
That kind = {s = "that" ++ kind.s} ;
QKind quality kind = {s = quality.s ++ kind.s} ;
Wine = {s = "wine"} ;
Cheese = {s = "cheese"} ;
Fish = {s = "fish"} ;
Very quality = {s = "very" ++ quality.s} ;
Fresh = {s = "fresh"} ;
Warm = {s = "warm"} ;
Italian = {s = "Italian"} ;
Expensive = {s = "expensive"} ;
Delicious = {s = "delicious"} ;
Boring = {s = "boring"} ;
}
</PRE>
<P>
Let us test how the grammar works in parsing:
</P>
<PRE>
&gt; import FoodEng.gf
&gt; parse "this delicious wine is very very Italian"
Is (This (QKind Delicious Wine)) (Very (Very Italian))
</PRE>
<P>
We can also try parsing in other categories than the <CODE>startcat</CODE>,
by setting the command-line <CODE>cat</CODE> flag:
</P>
<PRE>
p -cat=Kind "very Italian wine"
QKind (Very Italian) Wine
</PRE>
<P></P>
<P>
<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar by ten new food kinds and
qualities, and run the parser with new kinds of examples.
</P>
<P>
<B>Exercise</B>. Add a rule that enables question phrases of the form
<I>is this cheese Italian</I>.
</P>
<P>
<B>Exercise</B>. Enable the optional prefixing of
phrases with the words "excuse me but". Do this in such a way that
the prefix can occur at most once.
</P>
<A NAME="toc18"></A>
<H2>Commands for testing grammars</H2>
<A NAME="toc19"></A>
<H3>Generating trees and strings</H3>
<P>
When we have a grammar above a trivial size, especially a recursive
one, we need more efficient ways of testing it than just by parsing
sentences that happen to come to our minds. One way to do this is
based on automatic generation, which can be either
<B>random generation</B> or <B>exhaustive generation</B>.
</P>
<P>
Random generation (<CODE>generate_random = gr</CODE>) is an operation that
builds a random tree in accordance with an abstract syntax:
</P>
<PRE>
&gt; generate_random
Is (This (QKind Italian Fish)) Fresh
</PRE>
<P>
By using a pipe, random generation can be fed into linearization:
</P>
<PRE>
&gt; generate_random | linearize
this Italian fish is fresh
</PRE>
<P>
Random generation is a good way to test a grammar. It can also give results
that are surprising, which shows how fast we lose intuition
when we write complex grammars.
</P>
<P>
By using the <CODE>number</CODE> flag, several trees can be generated
in one command:
</P>
<PRE>
&gt; gr -number=10 | l
that wine is boring
that fresh cheese is fresh
that cheese is very boring
this cheese is Italian
that expensive cheese is expensive
that fish is fresh
that wine is very Italian
this wine is Italian
this cheese is boring
this fish is boring
</PRE>
<P>
To generate <I>all</I> phrases that a grammar can produce,
GF provides the command <CODE>generate_trees = gt</CODE>.
</P>
<PRE>
&gt; generate_trees | l
that cheese is very Italian
that cheese is very boring
that cheese is very delicious
that cheese is very expensive
that cheese is very fresh
...
this wine is expensive
this wine is fresh
this wine is warm
</PRE>
<P>
We get quite a few trees but not all of them: only up to a given
<B>depth</B> of trees. The default depth is 3; the depth can be
set by using the <CODE>depth</CODE> flag:
</P>
<PRE>
&gt; generate_trees -depth=5 | l
</PRE>
<P>
Other options to the generation commands (like all commands) can be seen
by GF's <CODE>help = h</CODE> command:
</P>
<PRE>
&gt; help gr
&gt; help gt
</PRE>
<P></P>
<P>
<B>Exercise</B>. If the command <CODE>gt</CODE> generated all
trees in your grammar, it would never terminate. Why?
</P>
<P>
<B>Exercise</B>. Measure how many trees the grammar gives with depths 4 and 5,
respectively. <B>Hint</B>. You can
use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines.
</P>
<A NAME="toc20"></A>
<H3>More on pipes; tracing</H3>
<P>
A pipe of GF commands can have any length, but the "output type"
(either string or tree) of one command must always match the "input type"
of the next command, in order for the result to make sense.
</P>
<P>
The intermediate results in a pipe can be observed by putting the
<B>tracing</B> option <CODE>-tr</CODE> to each command whose output you
want to see:
</P>
<PRE>
&gt; gr -tr | l -tr | p
Is (This Cheese) Boring
this cheese is boring
Is (This Cheese) Boring
</PRE>
<P>
This facility is useful for test purposes: the pipe above can show
if a grammar is <B>ambiguous</B>, i.e.
contains strings that can be parsed in more than one way.
</P>
<P>
<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar so that it produces ambiguous
strings, and try out the ambiguity test.
</P>
<A NAME="toc21"></A>
<H3>Writing and reading files</H3>
<P>
To save the outputs of GF commands into a file, you can
pipe it to the <CODE>write_file = wf</CODE> command,
</P>
<PRE>
&gt; gr -number=10 | linearize | write_file exx.tmp
</PRE>
<P>
You can read the file back to GF with the
<CODE>read_file = rf</CODE> command,
</P>
<PRE>
&gt; read_file exx.tmp | parse -lines
</PRE>
<P>
Notice the flag <CODE>-lines</CODE> given to the parsing
command. This flag tells GF to parse each line of
the file separately. Without the flag, the grammar could
not recognize the string in the file, because it is not
a sentence but a sequence of ten sentences.
</P>
<P>
Files with examples can be used for <B>regression testing</B>
of grammars. The most systematic way to do this is by
generating treebanks; see <a href="#sectreebank">here</a>.
</P>
<A NAME="toc22"></A>
<H3>Visualizing trees</H3>
<P>
The gibberish code with parentheses returned by the parser does not
look like trees. Why is it called so? From the abstract mathematical
point of view, trees are a data structure that
represents <B>nesting</B>: trees are branching entities, and the branches
are themselves trees. Parentheses give a linear representation of trees,
useful for the computer. But the human eye may prefer to see a visualization;
for this purpose, GF provides the command <CODE>visualize_tree = vt</CODE>, to which
parsing (and any other tree-producing command) can be piped:
</P>
<PRE>
&gt; parse "this delicious cheese is very Italian" | visualize_tree
</PRE>
<P></P>
<P>
<IMG ALIGN="middle" SRC="mytree.png" BORDER="0" ALT="">
</P>
<P>
This command uses the programs Graphviz and Ghostview, which you
might not have, but which are freely available on the web.
</P>
<P>
Alternatively, you can print the tree into a file
e.g. a <CODE>.png</CODE> file that
can be be viewed with e.g. an HTML browser and also included in an
HTML document. You can do this
by saving the file <CODE>grphtmp.dot</CODE>, which the command <CODE>vt</CODE>
produces. Then you can process this file with the <CODE>dot</CODE>
program (from the Graphviz package).
</P>
<PRE>
% dot -Tpng grphtmp.dot &gt; mytree.png
</PRE>
<P></P>
<A NAME="toc23"></A>
<H3>System commands</H3>
<P>
If you don't have Ghostview, or want to view graphs in some other way,
you can call <CODE>dot</CODE> and a suitable
viewer (e.g. <CODE>open</CODE> in Mac) without leaving GF, by using
a <B>system command</B>: <CODE>!</CODE> followed by a Unix command,
</P>
<PRE>
&gt; ! dot -Tpng grphtmp.dot &gt; mytree.png
&gt; ! open mytree.png
</PRE>
<P>
Another form of system commands are those that receive arguments from
GF pipes. The escape symbol
is then <CODE>?</CODE>.
</P>
<PRE>
&gt; generate_trees | ? wc
</PRE>
<P></P>
<P>
<B>Exercise</B>. (Exercise drom 3.3.1 revisited.)
Measure how many trees the grammar <CODE>FoodEng</CODE> gives with depths 4 and 5,
respectively. Use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines, and
a pipe from a GF command into a Unix command.
</P>
<A NAME="toc24"></A>
<H2>An Italian concrete syntax</H2>
<P>
<a name="secanitalian"></a>
</P>
<P>
We write the Italian grammar in a straightforward way, by replacing
English words with their dictionary equivalents:
</P>
<PRE>
concrete FoodIta of Food = {
lincat
Phrase, Item, Kind, Quality = {s : Str} ;
lin
Is item quality = {s = item.s ++ "<22>" ++ quality.s} ;
This kind = {s = "questo" ++ kind.s} ;
That kind = {s = "quello" ++ kind.s} ;
QKind quality kind = {s = kind.s ++ quality.s} ;
Wine = {s = "vino"} ;
Cheese = {s = "formaggio"} ;
Fish = {s = "pesce"} ;
Very quality = {s = "molto" ++ quality.s} ;
Fresh = {s = "fresco"} ;
Warm = {s = "caldo"} ;
Italian = {s = "italiano"} ;
Expensive = {s = "caro"} ;
Delicious = {s = "delizioso"} ;
Boring = {s = "noioso"} ;
}
</PRE>
<P>
An alert reader, or one who already knows Italian, may notice one point in
which the change is more substantial than just replacement of words: the order of
a quality and the kind it modifies in
</P>
<PRE>
QKind quality kind = {s = kind.s ++ quality.s} ;
</PRE>
<P>
Thus Italian says <CODE>vino italiano</CODE> for <CODE>Italian wine</CODE>. (Some Italian adjectives
are put before the noun. This distinction can be controlled by parameters, which
are introduced in <a href="#chaptwo">the fourth chapter</a>.)
</P>
<P>
<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for some other language.
You will probably end up with grammatically incorrect linearizations --- but don't
worry about this yet.
</P>
<P>
<B>Exercise</B>. If you have written <CODE>Food</CODE> for German, Swedish, or some
other language, test with random or exhaustive generation what constructs
come out incorrect, and prepare a list of those ones that cannot be helped
with the currently available fragment of GF. You can return to your list
after having worked out <a href="#chaptwo">the fourth chapter</a>.
</P>
<A NAME="toc25"></A>
<H2>Free variation</H2>
<P>
Sometimes there are alternative ways to define a concrete syntax.
For instance, if we use the <CODE>Food</CODE> grammars in a restaurant phrase
book, we may want to accept different words for expressing the quality
"delicious" ---- and different languages can differ in how many
such words they have. Then we don't want to put the distinctions into
the abstract syntax, but into concrete syntaxes. Such semantically
neutral distinctions are known as <B>free variation</B> in linguistics.
</P>
<P>
The <CODE>variants</CODE> construct of GF expresses free variation. For example,
</P>
<PRE>
lin Delicious = {s = variants {"delicious" ; "exquisit" ; "tasty"}} ;
</PRE>
<P>
says that <CODE>Delicious</CODE> can be linearized to any of <I>delicious</I>,
<I>exquisit</I>, and <I>tasty</I>. As a consequence, both these words result in the
tree <CODE>Delicious</CODE> when parsed. By default, the <CODE>linearize</CODE> command
shows only the first variant from each <CODE>variants</CODE> list; to see them
all, the option <CODE>-all</CODE> can be used:
</P>
<PRE>
&gt; p "this exquisit wine is delicious" | l -all
this delicious wine is delicious
this delicious wine is exquisit
...
</PRE>
<P>
In linguistics, it is well known that free variation is almost
non-existing, if all aspects of expressions are taken into account, including style.
Therefore, free variation should not be used in grammars that are meant as
libraries for other grammars, as in <a href="#chapfive">the fifth chapter</a>. However, in a specific
application, free variation is an excellent way to scale up the parser to
variations in user input that make no difference in the semantic
treatment.
</P>
<P>
An example that clearly illustrates these points is the
English negation. If we added to the <CODE>Food</CODE> grammar the negation
of a quality, we could accept both contracted and uncontracted <I>not</I>:
</P>
<PRE>
fun IsNot : Item -&gt; Quality -&gt; Phrase ;
lin IsNot item qual =
{s = item.s ++ variants {"isn't" ; ["is not"]} ++ qual.s} ;
</PRE>
<P>
Both forms are likely to occur in user input. Since there is no
corresponding contrast in Italian, we do not want to put the distinction
in the abstract syntax. Yet there is a stylistic difference between
these two forms. In particular, if we are doing generation rather
than parsing, we will want to choose the one or
the other depending on the kind of language we want to generate.
</P>
<P>
A limiting case of free variation is an empty variant list
</P>
<PRE>
variants {}
</PRE>
<P>
It can be used e.g. if a word lacks a certain inflection form.
</P>
<P>
Free variation works for all types in concrete syntax; all terms in
a <CODE>variants</CODE> list must be of the same type.
</P>
<P>
<B>Exercise</B>. Modify <CODE>FoodIta</CODE> in such a way that a quality can
be assigned to an item by using two different word orders, exemplified
by <I>questo vino <20> delizioso</I> and <I><EFBFBD> delizioso questo vino</I>
(a real variation in Italian),
and that it is impossible to say that something is boring
(a rather contrived example).
</P>
<A NAME="toc26"></A>
<H2>More application of multilingual grammars</H2>
<A NAME="toc27"></A>
<H3>Multilingual treebanks</H3>
<P>
<a name="sectreebank"></a>
</P>
<P>
A <B>multilingual treebank</B> is a set of trees with their
translations in different languages:
</P>
<PRE>
&gt; gr -number=2 | tree_bank
Is (That Cheese) (Very Boring)
quello formaggio <20> molto noioso
that cheese is very boring
Is (That Cheese) Fresh
quello formaggio <20> fresco
that cheese is fresh
</PRE>
<P>
There is also an XML format for treebanks and a set of commands
suitable for regression testing; see <CODE>help tb</CODE> for more details.
</P>
<A NAME="toc28"></A>
<H3>Translation session</H3>
<P>
If translation is what you want to do with a set of grammars, a convenient
way to do it is to open a <CODE>translation_session = ts</CODE>. In this session,
you can translate between all the languages that are in scope.
A dot <CODE>.</CODE> terminates the translation session.
</P>
<PRE>
&gt; ts
trans&gt; that very warm cheese is boring
quello formaggio molto caldo <20> noioso
that very warm cheese is boring
trans&gt; questo vino molto italiano <20> molto delizioso
questo vino molto italiano <20> molto delizioso
this very Italian wine is very delicious
trans&gt; .
&gt;
</PRE>
<P></P>
<A NAME="toc29"></A>
<H3>Translation quiz</H3>
<P>
This is a simple language exercise that can be automatically
generated from a multilingual grammar. The system generates a set of
random sentences, displays them in one language, and checks the user's
answer given in another language. The command <CODE>translation_quiz = tq</CODE>
makes this in a subshell of GF.
</P>
<PRE>
&gt; translation_quiz FoodEng FoodIta
Welcome to GF Translation Quiz.
The quiz is over when you have done at least 10 examples
with at least 75 % success.
You can interrupt the quiz by entering a line consisting of a dot ('.').
this fish is warm
questo pesce <20> caldo
&gt; Yes.
Score 1/1
this cheese is Italian
questo formaggio <20> noioso
&gt; No, not questo formaggio <20> noioso, but
questo formaggio <20> italiano
Score 1/2
this fish is expensive
</PRE>
<P>
You can also generate a list of translation exercises and save it in a
file for later use, by the command <CODE>translation_list = tl</CODE>
</P>
<PRE>
&gt; translation_list -number=25 FoodEng FoodIta | write_file transl.txt
</PRE>
<P>
The <CODE>number</CODE> flag gives the number of sentences generated.
</P>
<A NAME="toc30"></A>
<H3>Multilingual syntax editing</H3>
<P>
<a name="secediting"></a>
</P>
<P>
Any multilingual grammar can be used in the graphical syntax editor, which is
opened by the shell
command <CODE>gfeditor</CODE> followed by the names of the grammar files.
Thus
</P>
<PRE>
% gfeditor FoodEng.gf FoodIta.gf
</PRE>
<P>
opens the editor for the two <CODE>Food</CODE> grammars.
</P>
<P>
The editor supports commands for manipulating an abstract syntax tree.
The process is started by choosing a category from the "New" menu.
Choosing <CODE>Phrase</CODE> creates a new tree of type <CODE>Phrase</CODE>. A new tree
is in general completely unknown: it consists of a <B>metavariable</B>
<CODE>?1</CODE>. However, since the category <CODE>Phrase</CODE> in <CODE>Food</CODE> has
only one possible constructor, <CODE>Is</CODE>, the tree is readily
given the form <CODE>Is ?1 ?2</CODE>. Here is what the editor looks like at
this stage:
</P>
<P>
<IMG ALIGN="right" SRC="food1.png" BORDER="0" ALT="">
</P>
<P>
Editing goes on by <B>refinements</B>, i.e. choices of constructors from
the menu, until no metavariables remain. Here is a tree resulting from the
current editing session:
</P>
<P>
<IMG ALIGN="right" SRC="food2.png" BORDER="0" ALT="">
</P>
<P>
Editing can be continued even when the tree is finished. The user can shift
the <B>focus</B> to some of the subtrees by clicking at it or the corresponding
part of a linearization. In the picture, the focus is on "fish".
Since there are no metavariables,
the menu shows no refinements, but some other possible actions:
</P>
<UL>
<LI>to <B>change</B> "fish" to "cheese" or "wine"
<LI>to <B>delete</B> "fish", i.e. change it to a metavariable
<LI>to <B>wrap</B> "fish" in a qualification, i.e. change it to
<CODE>QKind ? Fish</CODE>, where the quality can be given in a later refinement
</UL>
<P>
In addition to menu-based editing, the tool supports refinement by parsing,
which is accessible by middle-clicking in the tree or in the linearization field.
</P>
<P>
<B>Exercise</B>. Construct the sentence
<I>this very expensive cheese is very very delicious</I>
and its Italian translation by using <CODE>gfeditor</CODE>.
</P>
<A NAME="toc31"></A>
<H2>Context-free grammars and GF</H2>
<P>
Readers not familar with context-free grammars, also known as BNF grammars, can
skip this section. Those that are familar with them will find here the exact
relation between GF and context-free grammars. We will moreover show how
the BNF format can be used as input to the GF program; it is often more
concise than GF proper, but also more restricted in expressive power.
</P>
<A NAME="toc32"></A>
<H3>The "cf" grammar format</H3>
<P>
The grammar <CODE>FoodEng</CODE> could be written in a BNF format as follows:
</P>
<PRE>
Is. Phrase ::= Item "is" Quality ;
That. Item ::= "that" Kind ;
This. Item ::= "this" Kind ;
QKind. Kind ::= Quality Kind ;
Cheese. Kind ::= "cheese" ;
Fish. Kind ::= "fish" ;
Wine. Kind ::= "wine" ;
Italian. Quality ::= "Italian" ;
Boring. Quality ::= "boring" ;
Delicious. Quality ::= "delicious" ;
Expensive. Quality ::= "expensive" ;
Fresh. Quality ::= "fresh" ;
Very. Quality ::= "very" Quality ;
Warm. Quality ::= "warm" ;
</PRE>
<P>
In this format, each rule is prefixed by a <B>label</B> that gives
the constructor function GF gives in its <CODE>fun</CODE> rules. In fact,
each context-free rule is a fusion of a <CODE>fun</CODE> and a <CODE>lin</CODE> rule:
it states simultaneously that
</P>
<UL>
<LI>the label is a function from the nonterminal categories
on the right-hand side to the category on the left-hand side;
the first rule gives
<PRE>
fun Is : Item -&gt; Quality -&gt; Phrase
</PRE>
<LI>trees built by the label are linearized in the way indicated
by the right-hand side;
the first rule gives
<PRE>
lin Is item quality = {s = item.s ++ "is" ++ quality.s}
</PRE>
</UL>
<P>
The translation from BNF to GF described above is in fact used in
the GF system to convert BNF grammars into GF. BNF files are recognized
by the file name suffix <CODE>.cf</CODE>; thus the grammar above can be
put into a file named <CODE>food.cf</CODE> and read into GF by
</P>
<PRE>
&gt; import food.cf
</PRE>
<P></P>
<A NAME="toc33"></A>
<H3>Restrictions of context-free grammars</H3>
<P>
Even though we managed to write <CODE>FoodEng</CODE> in the context-free format,
we cannot do this for GF grammars in general. It is enough to try this
with <CODE>FoodIta</CODE> at the same time as <CODE>FoodEng</CODE>,
we lose an important aspect of multilinguality:
that the order of constituents is defined only in concrete syntax.
Thus we could not use context-free <CODE>FoodEng</CODE> and <CODE>FoodIta</CODE> in a multilingual
grammar that supports translation via common abstract syntax: the
qualification function <CODE>QKind</CODE> has different types in the two
grammars.
</P>
<P>
In general terms, the separation of concrete and abstract syntax allows
three deviations from context-free grammar:
</P>
<UL>
<LI><B>permutation</B>: changing the order of constituents
<LI><B>suppression</B>: omitting constituents
<LI><B>reduplication</B>: repeating constituents
</UL>
<P>
The third property is the one that definitely shows that GF is
stronger than context-free: GF can define the <B>copy language</B>
<CODE>{x x | x &lt;- (a|b)*}</CODE>, which is known not to be context-free.
The other properties have more to do with the kind of trees that
the grammar can associate with strings: permutation is important
in multilingual grammars, and suppression is exploited in grammars
where trees carry some hidden semantic information (see <a href="#chapsix">the sixth chapter</a>
below).
</P>
<P>
Of course, context-free grammars are also restricted from the
grammar engineering point of view. They give no support to
modules, functions, and parameters, which are so central
for the productivity of GF. Despite the initial conciseness
of context-free grammars, GF can easily produce grammars where
30 lines of GF code would need hundreds of lines of
context-free grammar code to produce; see exercises
<a href="#secitalian">here</a> and <a href="#sectense">here</a>.
</P>
<P>
<B>Exercise</B>. GF can also interpret unlabelled BNF grammars, by
creating labels automatically. The right-hand sides of BNF rules
can moreover be disjunctions, e.g.
</P>
<PRE>
Quality ::= "fresh" | "Italian" | "very" Quality ;
</PRE>
<P>
Experiment with this format in GF, possibly with a grammar that
you import from some other source, such as a programming language
document.
</P>
<P>
<B>Exercise</B>. Define the copy language <CODE>{x x | x &lt;- (a|b)*}</CODE> in GF.
</P>
<A NAME="toc34"></A>
<H2>Modules and files</H2>
<P>
GF uses suffixes to recognize different file formats. The most
important ones are:
</P>
<UL>
<LI>Source files: <I>Modulename</I><CODE>.gf</CODE>
<LI>Target files: <I>Modulename</I><CODE>.gfc</CODE>
</UL>
<P>
When you import <CODE>FoodEng.gf</CODE>, you see the target files being
generated:
</P>
<PRE>
&gt; i FoodEng.gf
- compiling Food.gf... wrote file Food.gfc 16 msec
- compiling FoodEng.gf... wrote file FoodEng.gfc 20 msec
</PRE>
<P>
You also see that the GF program does not only read the file
<CODE>FoodEng.gf</CODE>, but also all other files that it
depends on --- in this case, <CODE>Food.gf</CODE>.
</P>
<P>
For each file that is compiled, a <CODE>.gfc</CODE> file
is generated. The GFC format (="GF Canonical") is the
"machine code" of GF, which is faster to process than
GF source files. When reading a module, GF decides whether
to use an existing <CODE>.gfc</CODE> file or to generate
a new one, by looking at modification times.
</P>
<P>
<I>In GF version 3, the</I> <CODE>gfc</CODE> <I>format is replaced by the format suffixed</I>
<CODE>gfo</CODE>, <I>"GF object"</I>.
</P>
<P>
<B>Exercise</B>. What happens when you import <CODE>FoodEng.gf</CODE> for
a second time? Try this in different situations:
</P>
<UL>
<LI>Right after importing it the first time (the modules are kept in
the memory of GF and need no reloading).
<LI>After issuing the command <CODE>empty</CODE> (<CODE>e</CODE>), which clears the memory
of GF.
<LI>After making a small change in <CODE>FoodEng.gf</CODE>, be it only an added space.
<LI>After making a change in <CODE>Food.gf</CODE>.
</UL>
<A NAME="toc35"></A>
<H2>Using operations and resource modules</H2>
<A NAME="toc36"></A>
<H3>The golden rule of functional programming</H3>
<P>
When writing a grammar, you have to type lots of
characters. You have probably
done this by the copy-and-paste method, which is a universally
available way to avoid repeating work.
</P>
<P>
However, there is a more elegant way to avoid repeating work than
the copy-and-paste
method. The <B>golden rule of functional programming</B> says that
</P>
<UL>
<LI>whenever you find yourself programming by copy-and-paste,
write a function instead.
</UL>
<P>
A function separates the shared parts of different computations from the
changing parts, its <B>arguments</B>, or <B>parameters</B>.
In functional programming languages, such as
Haskell, it is possible to share much more
code with functions than in languages such as C and Java, because
of higher-order functions (functions that takes functions as arguments).
</P>
<A NAME="toc37"></A>
<H3>Operation definitions</H3>
<P>
GF is a functional programming language, not only in the sense that
the abstract syntax is a system of functions (<CODE>fun</CODE>), but also because
functional programming can be used when defining concrete syntax. This is
done by using a new form of judgement, with the keyword <CODE>oper</CODE> (for
<B>operation</B>), distinct from <CODE>fun</CODE> for the sake of clarity.
Here is a simple example of an operation:
</P>
<PRE>
oper ss : Str -&gt; {s : Str} = \x -&gt; {s = x} ;
</PRE>
<P>
The operation can be <B>applied</B> to an argument, and GF will
<B>compute</B> the application into a value. For instance,
</P>
<PRE>
ss "boy" ===&gt; {s = "boy"}
</PRE>
<P>
We use the symbol <CODE>===</CODE> to indicate how an expression is
computed into a value; this symbol is not a part of GF.
</P>
<P>
Thus an <CODE>oper</CODE> judgement includes the name of the defined operation,
its type, and an expression defining it. As for the syntax of the defining
expression, notice the <B>lambda abstraction</B> form <CODE>\</CODE><I>x</I> <CODE>-&gt;</CODE> <I>t</I> of
the function. It reads: function with variable <I>x</I> and <B>function body</B>
<I>t</I>. Any occurrence of <I>x</I> in <I>t</I> is said to be <B>bound</B> in <I>t</I>.
</P>
<P>
For lambda abstraction with multiple arguments, we have the shorthand
</P>
<PRE>
\x,y -&gt; t === \x -&gt; \y -&gt; t
</PRE>
<P>
The notation we have used for linearization rules, where
variables are bound on the left-hand side, is actually syntactic
sugar for abstraction:
</P>
<PRE>
lin f x = t === lin f = \x -&gt; t
</PRE>
<P></P>
<A NAME="toc38"></A>
<H3>The ``resource`` module type</H3>
<P>
Operator definitions can be included in a concrete syntax.
But they are usually not really tied to a particular
set of linearization rules.
They should rather be seen as <B>resources</B>
usable in many concrete syntaxes.
</P>
<P>
The <CODE>resource</CODE> module type is used to package
<CODE>oper</CODE> definitions into reusable resources. Here is
an example, with a handful of operations to manipulate
strings and records.
</P>
<PRE>
resource StringOper = {
oper
SS : Type = {s : Str} ;
ss : Str -&gt; SS = \x -&gt; {s = x} ;
cc : SS -&gt; SS -&gt; SS = \x,y -&gt; ss (x.s ++ y.s) ;
prefix : Str -&gt; SS -&gt; SS = \p,x -&gt; ss (p ++ x.s) ;
}
</PRE>
<P></P>
<A NAME="toc39"></A>
<H3>Opening a resource</H3>
<P>
Any number of <CODE>resource</CODE> modules can be
<B>open</B>ed in a <CODE>concrete</CODE> syntax, which
makes definitions contained
in the resource usable in the concrete syntax. Here is
an example, where the resource <CODE>StringOper</CODE> is
opened in a new version of <CODE>FoodEng</CODE>.
</P>
<PRE>
concrete FoodEng of Food = open StringOper in {
lincat
S, Item, Kind, Quality = SS ;
lin
Is item quality = cc item (prefix "is" quality) ;
This k = prefix "this" k ;
That k = prefix "that" k ;
QKind k q = cc k q ;
Wine = ss "wine" ;
Cheese = ss "cheese" ;
Fish = ss "fish" ;
Very = prefix "very" ;
Fresh = ss "fresh" ;
Warm = ss "warm" ;
Italian = ss "Italian" ;
Expensive = ss "expensive" ;
Delicious = ss "delicious" ;
Boring = ss "boring" ;
}
</PRE>
<P></P>
<P>
<B>Exercise</B>. Use the same string operations to write <CODE>FoodIta</CODE>
more concisely.
</P>
<A NAME="toc40"></A>
<H3>Partial application</H3>
<P>
<a name="secpartapp"></a>
</P>
<P>
GF, like Haskell, permits <B>partial application</B> of
functions. An example of this is the rule
</P>
<PRE>
lin This k = prefix "this" k ;
</PRE>
<P>
which can be written more concisely
</P>
<PRE>
lin This = prefix "this" ;
</PRE>
<P>
The first form is perhaps more intuitive to write
but, once you get used to partial application, you will appreciate its
conciseness and elegance. The logic of partial application
is known as <B>currying</B>, with a reference to Haskell B. Curry.
The idea is that any <I>n</I>-place function can be seen as a 1-place
function whose value is an <I>n-</I>1 -place function. Thus
</P>
<PRE>
oper prefix : Str -&gt; SS -&gt; SS ;
</PRE>
<P>
can be used as a 1-place function that takes a <CODE>Str</CODE> into a
function <CODE>SS -&gt; SS</CODE>. The expected linearization of <CODE>This</CODE> is exactly
a function of such a type, operating on an argument of type <CODE>Kind</CODE>
whose linearization is of type <CODE>SS</CODE>. Thus we can define the
linearization directly as <CODE>prefix "this"</CODE>.
</P>
<P>
An important part of the art of functional programming is to decide the order
of arguments in a function, so that partial application can be used as much
as possible. For instance, of the operation <CODE>prefix</CODE> we know that it
will be typically applied to linearization variables with constant strings.
This is the reason to put the <CODE>Str</CODE> argument before the <CODE>SS</CODE> argument --- not
the prefixity. A <CODE>postfix</CODE> function would have exactly the same order of arguments.
</P>
<P>
<B>Exercise</B>. Define an operation <CODE>infix</CODE> analogous to <CODE>prefix</CODE>,
such that it allows you to write
</P>
<PRE>
lin Is = infix "is" ;
</PRE>
<P></P>
<A NAME="toc41"></A>
<H3>Testing resource modules</H3>
<P>
To test a <CODE>resource</CODE> module independently, you must import it
with the flag <CODE>-retain</CODE>, which tells GF to retain <CODE>oper</CODE> definitions
in the memory; the usual behaviour is that <CODE>oper</CODE> definitions
are just applied to compile linearization rules
(this is called <B>inlining</B>) and then thrown away.
</P>
<PRE>
&gt; import -retain StringOper.gf
</PRE>
<P>
The command <CODE>compute_concrete = cc</CODE> computes any expression
formed by operations and other GF constructs. For example,
</P>
<PRE>
&gt; compute_concrete prefix "in" (ss "addition")
{
s : Str = "in" ++ "addition"
}
</PRE>
<P></P>
<A NAME="toc42"></A>
<H2>Grammar architecture</H2>
<P>
<a name="secarchitecture"></a>
</P>
<A NAME="toc43"></A>
<H3>Extending a grammar</H3>
<P>
The module system of GF makes it possible to write a new module that <B>extend</B>s
an old one. The syntax of extension is
shown by the following example. We extend <CODE>Food</CODE> into <CODE>MoreFood</CODE> by
adding a category of questions and two new functions.
</P>
<PRE>
abstract Morefood = Food ** {
cat
Question ;
fun
QIs : Item -&gt; Quality -&gt; Question ;
Pizza : Kind ;
}
</PRE>
<P>
Parallel to the abstract syntax, extensions can
be built for concrete syntaxes:
</P>
<PRE>
concrete MorefoodEng of Morefood = FoodEng ** {
lincat
Question = {s : Str} ;
lin
QIs item quality = {s = "is" ++ item.s ++ quality.s} ;
Pizza = {s = "pizza"} ;
}
</PRE>
<P>
The effect of extension is that all of the contents of the extended
and extending module are put together. We also say that the new
module <B>inherits</B> the contents of the old module.
</P>
<P>
At the same time as extending a module of the same type, a concrete
syntax module may open resources. Since <CODE>open</CODE> takes effect in
the module body and not in the extended module, its logical place
in the module header is after the extend part:
</P>
<PRE>
concrete MorefoodIta of Morefood = FoodIta ** open StringOper in {
lincat
Question = SS ;
lin
QIs item quality = ss (item.s ++ "<22>" ++ quality.s) ;
Pizza = ss "pizza" ;
}
</PRE>
<P>
Resource modules can extend other resource modules, in the
same way as modules of other types can extend modules of the
same type. Thus it is possible to build resource hierarchies.
</P>
<A NAME="toc44"></A>
<H3>Multiple inheritance</H3>
<P>
Specialized vocabularies can be represented as small grammars that
only do "one thing" each. For instance, the following are grammars
for fruit and mushrooms
</P>
<PRE>
abstract Fruit = {
cat Fruit ;
fun Apple, Peach : Fruit ;
}
abstract Mushroom = {
cat Mushroom ;
fun Cep, Agaric : Mushroom ;
}
</PRE>
<P>
They can afterwards be combined into bigger grammars by using
<B>multiple inheritance</B>, i.e. extension of several grammars at the
same time:
</P>
<PRE>
abstract Foodmarket = Food, Fruit, Mushroom ** {
fun
FruitKind : Fruit -&gt; Kind ;
MushroomKind : Mushroom -&gt; Kind ;
}
</PRE>
<P>
The main advantages with splitting a grammar to modules are
<B>reusability</B>, <B>separate compilation</B>, and <B>division of labour</B>.
Reusability means
that one and the same module can be put into different uses; for instance,
a module with mushroom names might be used in a mycological information system
as well as in a restaurant phrasebook. Separate compilation means that a module
once compiled into <CODE>.gfc</CODE> need not be compiled again unless changes have
taken place.
Division of labour means simply that programmers that are experts in
special areas can work on modules belonging to those areas.
</P>
<P>
<B>Exercise</B>. Refactor <CODE>Food</CODE> by taking apart <CODE>Wine</CODE> into a special
<CODE>Drink</CODE> module.
</P>
<A NAME="toc45"></A>
<H3>Visualizing module structure</H3>
<P>
When you have created all the abstract syntaxes and
one set of concrete syntaxes needed for <CODE>Foodmarket</CODE>,
your grammar consists of eight GF modules. To see how their
dependences look like, you can use the command
<CODE>visualize_graph = vg</CODE>,
</P>
<PRE>
&gt; visualize_graph
</PRE>
<P>
and the graph will pop up in a separate window:
</P>
<P>
<IMG ALIGN="middle" SRC="foodmarket.png" BORDER="0" ALT="">
</P>
<P>
The graph uses
</P>
<UL>
<LI>oval boxes for abstract modules
<LI>square boxes for concrete modules
<LI>black-headed arrows for inheritance
<LI>white-headed arrows for the concrete-of-abstract relation
</UL>
<P>
Just as the <CODE>visualize_tree = vt</CODE> command, the freely available tools
Ghostview and Graphviz are needed. As an alternative, you can again print
the graph into a <CODE>.dot</CODE> file by using the command <CODE>print_multi = pm</CODE>:
</P>
<PRE>
&gt; print_multi -printer=graph | write_file Foodmarket.dot
&gt; ! dot -Tpng Foodmarket.dot &gt; Foodmarket.png
</PRE>
<P></P>
<A NAME="toc46"></A>
<H2>Summary of GF language features</H2>
<A NAME="toc47"></A>
<H3>Modules</H3>
<P>
The general form of a module is
<center>
<I>Moduletype</I> <I>M</I> <I>Of</I> <CODE>=</CODE> (<I>Extends</I> <CODE>**</CODE>)? (<CODE>open</CODE> <I>Opens</I> <CODE>in</CODE>)? <I>Body</I>
</center>
where <I>Moduletype</I> is one of <CODE>abstract</CODE>, <CODE>concrete</CODE>, and <CODE>resource</CODE>.
</P>
<P>
If <I>Moduletype</I> is <CODE>concrete</CODE>, the <I>Of</I>-part has the form <CODE>of</CODE> <I>A</I>,
where <I>A</I> is the name of an abstract module. Otherwise it is empty.
</P>
<P>
The name of the module is given by the identifier <I>M</I>.
</P>
<P>
The optional <I>Extends</I> part is a comma-separated
list of module names, which have to be modules of
the same <I>Moduletype</I>. The contents of these modules are <B>inherited</B> by
<I>M</I>. This means that they are both usable in <I>Body</I> and exported by <I>M</I>,
i.e. inherited when <I>M</I> is inherited and available when <I>M</I> is opened.
(Exception: <CODE>oper</CODE> and <CODE>param</CODE> judgements are not inherited from
<CODE>concrete</CODE> modules.)
</P>
<P>
The optional <I>Opens</I> part is a comma-separated
list of resource module names. The contents of these
modules are usable in the <I>Body</I>, but they are not exported.
</P>
<P>
Opening can be <B>qualified</B>, e.g.
</P>
<PRE>
concrete C of A = open (P = Prelude) in ...
</PRE>
<P>
This means that the names from <CODE>Prelude</CODE> are only available in the form
<CODE>P.name</CODE>. This form of qualifying a name is always possible, and it can
be used to resolve <B>name conflicts</B>, which result when the same name is
declared in more than one module that is in scope.
</P>
<A NAME="toc48"></A>
<H3>Judgements</H3>
<P>
The <I>Body</I> part consists of judgements. The judgement form table #secjment
is extended with the following forms:
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>form</TH>
<TH>reading</TH>
<TH COLSPAN="2">module type</TH>
</TR>
<TR>
<TD ALIGN="center"><CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I></TD>
<TD>operation <I>h</I> of type <I>T</I> is defined as <I>t</I></TD>
<TD>resource, concrete</TD>
</TR>
<TR>
<TD ALIGN="right"><CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD>
<TD>parameter type P has constructors <I>C1...Cn</I></TD>
<TD>resource, concrete</TD>
</TR>
</TABLE>
<P></P>
<P>
The <CODE>param</CODE> judgement will be explained in the next chapter.
</P>
<P>
The type part of an <CODE>oper</CODE> judgement can be omitted, if the type can be inferred
by the GF compiler.
</P>
<PRE>
oper hello = "hello" ++ "world" ;
</PRE>
<P>
As a rule, type inference works for all terms except lambda abstracts.
</P>
<P>
<B>Lambda abstracts</B> are expressions of the form <CODE>\</CODE><I>x</I> <CODE>-&gt;</CODE> <I>t</I>,
where <I>x</I> is a variable <B>bound</B> in the expression <I>t</I>, which is the
<B>body</B> of the lambda abstract. The type of the lambda abstract is
<I>A</I> <CODE>-&gt;</CODE><I>B</I>, where <I>A</I> is the type of the variable <CODE>x</CODE> and
<I>B</I> the type of the body <I>t</I>.
</P>
<P>
For multiple lambda abstractions, there is a shorthand
</P>
<PRE>
\x,y -&gt; t === \x -&gt; \y -&gt; t
</PRE>
<P>
For <CODE>lin</CODE> judgements, there is the shorthand
</P>
<PRE>
lin f x = t === lin f = \x -&gt; t
</PRE>
<P></P>
<A NAME="toc49"></A>
<H3>Free variation</H3>
<P>
The <CODE>variants</CODE> construct of GF can be used to give a list of
concrete syntax terms, of the same type, in free variation. For example,
</P>
<PRE>
variants {["does not"] ; "doesn't"}
</PRE>
<P>
A limiting case is the empty variant list <CODE>variants {}</CODE>.
</P>
<A NAME="toc50"></A>
<H3>The context-free grammar format</H3>
<P>
The <CODE>.cf</CODE> file format is used for <B>context-free grammars</B>, which are
always interpretable as GF grammars. Files of this format consist of
rules of the form
<center>
(<I>Label</I> <CODE>.</CODE>)? <I>Cat</I> <CODE>::=</CODE> <I>RHS</I> <CODE>;</CODE>
</center>
where the <I>RHS</I> is a sequence of terminals (quoted strings) and
nonterminals (identifiers). The optional <I>Label</I> gives the abstract
syntax function created. If it is omitted, a function name is generated
automatically. Then it is also possible to have more than one <I>RHS</I>,
separated by <I>|</I>. An empty <I>RHS</I> is interpreted as an empty sequence
of terminals, not as an empty disjunction.
</P>
<P>
The <B>Extended BNF</B> format (<B>EBNF</B>) can also be used, in files suffixed <CODE>.ebnf</CODE>.
This format does not allow user-written labels. The right-hand-side of a rule
can contain everything that is possible in the <CODE>.cf</CODE> format, but also
optional parts (<CODE>p ?</CODE>), sequences (<CODE>p *</CODE>) and non-empty sequences (<CODE>p +</CODE>).
For example, the phrases in <CODE>FoodEng</CODE> could be recognized with the following
EBNF grammar:
</P>
<PRE>
Phrase ::=
("this" | "that") Quality* ("wine" | "cheese" | "fish") "is" Quality ;
Quality ::=
("very"* ("fresh" | "warm" | "boring" | "Italian" | "expensive")) ;
</PRE>
<P></P>
<A NAME="toc51"></A>
<H3>Character encoding</H3>
<P>
The default encoding is iso-latin-1. UTF-8 can be set by the flag <CODE>coding=utf8</CODE>
in the grammar. The resource grammar libraries are in iso-latin-1, except Russian
and Arabic, which are in UTF-8. The resources may be changed to UTF-8 in future.
Letters in identifiers must currently be iso-latin-1.
</P>
<A NAME="toc52"></A>
<H1>Grammars with parameters</H1>
<P>
<a name="chapfour"></a>
</P>
<P>
In this chapter, we will introduce the techniques needed for
describing the inflection of words, as well as the rules by
which correct word forms are selected in syntactic combinations.
These techniques are already needed in a very slight extension
of the Food grammar of the previous chapter. While explaining
how the linguistic problems are solved for English and Italian,
we also cover all the language constructs GF has for
defining concrete syntax.
</P>
<P>
It is in principle possible to skip this chapter and go directly
to the next, since the use of the GF Resource Grammar library
makes it unnecessary to use any more constructs of GF than we
have already covered: parameters could be left to library implementors.
</P>
<A NAME="toc53"></A>
<H2>The problem: words have to be inflected</H2>
<P>
Suppose we want to say, with the vocabulary included in
<CODE>Food.gf</CODE>, things like
<center>
<I>these Italian wines are delicious</I>
</center>
The new grammatical facility we need are the plural forms
of nouns and verbs (<I>wines, are</I>), as opposed to their
singular forms.
</P>
<P>
The introduction of plural forms requires two things:
</P>
<UL>
<LI>the <B>inflection</B> of nouns and verbs in singular and plural
<LI>the <B>agreement</B> of the verb to subject:
the verb must have the same number as the subject
</UL>
<P>
Different languages have different types of inflection and agreement.
For instance, Italian has also agreement in gender (masculine vs. feminine).
In a multilingual grammar,
we want to express such differences between languages in the
concrete syntax while ignoring them in the abstract syntax.
</P>
<P>
To be able to do all this, we need one new judgement form
and some new expression forms.
We also need to generalize linearization types
from strings to more complex types.
</P>
<P>
<B>Exercise</B>. Make a list of the possible forms that nouns,
adjectives, and verbs can have in some languages that you know.
</P>
<A NAME="toc54"></A>
<H2>Parameters and tables</H2>
<P>
We define the <B>parameter type</B> of number in English by
using a new form of judgement:
</P>
<PRE>
param Number = Sg | Pl ;
</PRE>
<P>
This judgement defines the parameter type <CODE>Number</CODE> by listing
its two <B>constructors</B>, <CODE>Sg</CODE> and <CODE>Pl</CODE> (common shorthands for
singular and plural).
</P>
<P>
To state that <CODE>Kind</CODE> expressions in English have a linearization
depending on number, we replace the linearization type <CODE>{s : Str}</CODE>
with a type where the <CODE>s</CODE> field is a <B>table</B> depending on number:
</P>
<PRE>
lincat Kind = {s : Number =&gt; Str} ;
</PRE>
<P>
The <B>table type</B> <CODE>Number =&gt; Str</CODE> is in many respects similar to
a function type (<CODE>Number -&gt; Str</CODE>). The main difference is that the
argument type of a table type must always be a parameter type. This means
that the argument-value pairs can be listed in a finite table. The following
example shows such a table:
</P>
<PRE>
lin Cheese = {
s = table {
Sg =&gt; "cheese" ;
Pl =&gt; "cheeses"
}
} ;
</PRE>
<P>
The table consists of <B>branches</B>, where a <B>pattern</B> on the
left of the arrow <CODE>=&gt;</CODE> is assigned a <B>value</B> on the right.
</P>
<P>
The application of a table to a parameter is done by the <B>selection</B>
operator <CODE>!</CODE>, which is computed by <B>pattern matching</B>: it returns
the value from the first branch whose pattern matches the
selection argument. For instance,
</P>
<PRE>
table {Sg =&gt; "cheese" ; Pl =&gt; "cheeses"} ! Pl
===&gt; "cheeses"
</PRE>
<P>
As syntactic sugar for table selections, we can define the
<B>case expressions</B>, which are common in functional programming and also
handy to use in GF.
</P>
<PRE>
case e of {...} === table {...} ! e
</PRE>
<P></P>
<P>
A parameter type can have any number of constructors, and these can
also take arguments from other parameter types. For instance, an accurate
type system for English verbs (except <I>be</I>) is
</P>
<PRE>
param VerbForm = VPresent Number | VPast | VPastPart | VPresPart ;
</PRE>
<P>
This system expresses accurately the fact that only the present tense has
number variation. (Agreement also requires variation in person, but
this can be defined in syntax rules, by picking the singular form for third person
singular subjects and the plural forms for all others). As an example of
a table, here are the forms of the verb <I>drink</I>:
</P>
<PRE>
table {
VPresent Sg =&gt; "drinks" ;
VPresent Pl =&gt; "drink" ;
VPast =&gt; "drank" ;
VPastPart =&gt; "drunk" ;
VPresPart =&gt; "drinking"
}
</PRE>
<P></P>
<P>
<B>Exercise</B>. In an earlier exercise (previous section),
you made a list of the possible
forms that nouns, adjectives, and verbs can have in some languages that
you know. Now take some of the results and implement them by
using parameter type definitions and tables. Write them into a <CODE>resource</CODE>
module, which you can test by using the command <CODE>compute_concrete</CODE>.
</P>
<A NAME="toc55"></A>
<H2>Inflection tables and paradigms</H2>
<P>
All English common nouns are inflected for number, most of them in the
same way: the plural form is obtained from the singular by adding the
ending <I>s</I>. This rule is an example of
a <B>paradigm</B> --- a formula telling how a class of words is inflected.
</P>
<P>
From the GF point of view, a paradigm is a function that takes
a <B>lemma</B> --- also known as a <B>dictionary form</B> or a <B>citation form</B> --- and
returns an inflection
table of desired type. Paradigms are not functions in the sense of the
<CODE>fun</CODE> judgements of abstract syntax (which operate on trees and not
on strings), but operations defined in <CODE>oper</CODE> judgements.
The following operation defines the regular noun paradigm of English:
</P>
<PRE>
oper regNoun : Str -&gt; {s : Number =&gt; Str} = \dog -&gt; {
s = table {
Sg =&gt; dog ;
Pl =&gt; dog + "s"
}
} ;
</PRE>
<P>
The <B>gluing</B> operator <CODE>+</CODE> tells that
the string held in the variable <CODE>dog</CODE> and the ending <CODE>"s"</CODE>
are written together to form one <B>token</B>. Thus, for instance,
</P>
<PRE>
(regNoun "cheese").s ! Pl ===&gt; "cheese" + "s" ===&gt; "cheeses"
</PRE>
<P>
A more complex example are regular verbs:
</P>
<PRE>
oper regVerb : Str -&gt; {s : VerbForm =&gt; Str} = \talk -&gt; {
s = table {
VPresent Sg =&gt; talk + "s" ;
VPresent Pl =&gt; talk ;
VPresPart =&gt; talk + "ing" ;
_ =&gt; talk + "ed"
}
} ;
</PRE>
<P>
Notice how a catch-all case for the past tense and the past participle
is expressed by using a <B>wild card</B> pattern <CODE>_</CODE>. Here again, pattern matching
tries all patterns in order until it finds a matching pattern;
and it is the wild card that is the first match for both <CODE>VPast</CODE> and
<CODE>VPastPart</CODE>.
</P>
<P>
<B>Exercise</B>. Identify cases in which the <CODE>regNoun</CODE> paradigm does not
apply in English, and implement some alternative paradigms.
</P>
<P>
<B>Exercise</B>. Implement some regular paradigms for other languages you have
considered in earlier exercises.
</P>
<A NAME="toc56"></A>
<H2>Using parameters in concrete syntax</H2>
<P>
We can now enrich the concrete syntax definitions to
comprise morphology. This will permit a more radical
variation between languages (e.g. English and Italian)
than just the use of different words. In general,
parameters and linearization types are different in
different languages --- but this does not prevent using a
the common abstract syntax.
</P>
<P>
We consider a grammar <CODE>Foods</CODE>, which is similar to
<CODE>Food</CODE>, with the addition two rules for forming plural items:
</P>
<PRE>
fun These, Those : Kind -&gt; Item ;
</PRE>
<P>
We also add a noun which in Italian has the feminine case; all nouns in
<CODE>Food</CODE> were carefully chosen to be masculine!
</P>
<PRE>
fun Pizza : Kind ;
</PRE>
<P>
This noun will force us to deal with gender in the Italian grammar,
which is what we need for the grammar to scale up for larger applications.
</P>
<A NAME="toc57"></A>
<H3>Agreement</H3>
<P>
In the English <CODE>Foods</CODE> grammar, we need just one type of parameters:
<CODE>Number</CODE> as defined above. The phrase-forming rule
</P>
<PRE>
fun Is : Item -&gt; Quality -&gt; Phrase ;
</PRE>
<P>
is affected by the number because of <B>subject-verb agreement</B>.
In English, agreement says that the verb of a sentence
must be inflected in the number of the subject. Thus we will linearize
</P>
<PRE>
Is (This Pizza) Warm ===&gt; "this pizza is warm"
Is (These Pizza) Warm ===&gt; "these pizzas are warm"
</PRE>
<P>
Here it is the <B>copula</B>, i.e. the verb <I>be</I> that is affected. We define
the copula as the operation
</P>
<PRE>
oper copula : Number -&gt; Str = \n -&gt;
case n of {
Sg =&gt; "is" ;
Pl =&gt; "are"
} ;
</PRE>
<P>
We don't need to inflect the copula for person and tense in this grammar.
</P>
<P>
The form of the copula in a sentence depends on the
<B>subject</B> of the sentence, i.e. the item
that is qualified. This means that an <CODE>Item</CODE> must have such a number to provide.
The obvious way to guarantee this is by including a number field in
the linearization type:
</P>
<PRE>
lincat Item = {s : Str ; n : Number} ;
</PRE>
<P>
Now we can write precisely the <CODE>Is</CODE> rule that expresses agreement:
</P>
<PRE>
lin Is item qual = {s = item.s ++ copula item.n ++ qual.s} ;
</PRE>
<P>
The copula receives the number that it needs from the subject item.
</P>
<A NAME="toc58"></A>
<H3>Determiners</H3>
<P>
Let us turn to <CODE>Item</CODE> subjects and see how they receive their
numbers. The two rules
</P>
<PRE>
fun This, These : Kind -&gt; Item ;
</PRE>
<P>
form <CODE>Item</CODE>s from <CODE>Kind</CODE>s by adding <B>determiners</B>, either
<I>this</I> or <I>these</I>. The determiners
require different numbers of their <CODE>Kind</CODE> arguments: <CODE>This</CODE>
requires the singular (<I>this pizza</I>) and <CODE>These</CODE> the plural
(<I>these pizzas</I>). The <CODE>Kind</CODE> is the same in both cases: <CODE>Pizza</CODE>.
Thus a <CODE>Kind</CODE> must have both singular and plural forms.
The obvious way to express this is by using a table:
</P>
<PRE>
lincat Kind = {s : Number =&gt; Str} ;
</PRE>
<P>
The linearization rules for <CODE>This</CODE> and <CODE>These</CODE> can now be written
</P>
<PRE>
lin This kind = {
s = "this" ++ kind.s ! Sg ;
n = Sg
} ;
lin These kind = {
s = "these" ++ kind.s ! Pl ;
n = Pl
} ;
</PRE>
<P>
The grammatical relation between the determiner and the noun is similar to
agreement, but due to some differences into which we don't go here
it is often called <B>government</B>.
</P>
<P>
Since the same pattern for determination is used four times in
the <CODE>FoodsEng</CODE> grammar, we codify it as an operation,
</P>
<PRE>
oper det :
Str -&gt; Number -&gt; {s : Number =&gt; Str} -&gt; {s : Str ; n : Number} =
\det,n,kind -&gt; {
s = det ++ kind.s ! n ;
n = n
} ;
</PRE>
<P>
Now we can write, for instance,
</P>
<PRE>
lin This = det Sg "this" ;
lin These = det Pl "these" ;
</PRE>
<P>
Notice the order of arguments that permits partial
application (<a href="#secpartapp">here</a>).
</P>
<P>
In a more <B>lexicalized</B> grammar, determiners would be made into a
category of their own and given an inherent number:
</P>
<PRE>
lincat Det = {s : Str ; n : Number} ;
fun Det : Det -&gt; Kind -&gt; Item ;
lin Det det kind = {
s = det.s ++ kind.s ! det.n ;
n = det.n
} ;
</PRE>
<P>
Linguistically motivated grammars, such as the GF resource grammars,
usually favour lexicalized treatments of words; see <a href="#seclexical">here</a> below.
Notice that the fields of the record in <CODE>Det</CODE> are precisely the two
arguments needed in the <CODE>det</CODE> operation.
</P>
<A NAME="toc59"></A>
<H3>Parametric vs. inherent features</H3>
<P>
<CODE>Kind</CODE>s, as in general <B>common nouns</B> in English, have both singular
and plural forms; what form is chosen is determined by the construction
in which the noun is used. We say that the number is a
<B>parametric feature</B> of nouns. In GF, parametric features
appear as argument types of tables in linearization types.
</P>
<PRE>
lincat Kind = {s : Number =&gt; Str} ;
</PRE>
<P>
<CODE>Item</CODE>s, as in general <B>noun phrases</B> in English, don't
have variation in number. The number is instead an <B>inherent feature</B>,
which the noun phrase passes to the verb. In GF, inherent features
appear as record fields in linearization types.
</P>
<PRE>
lincat Item = {s : Str ; n : Number} ;
</PRE>
<P>
A category can have both parametric and inherent features. As we will see
in the Italian <CODE>Foods</CODE> grammar, nouns have parametric number and
inherent gender:
</P>
<PRE>
lincat Kind = {s : Number =&gt; Str ; g : Gender} ;
</PRE>
<P>
Nothing prevents the same parameter type from appearing both
as parametric and inherent feature, or the appearance of several inherent
features of the same type, etc. Determining the linearization types
of categories is one of the most crucial steps in the design of a GF
grammar. These two conditions must be in balance:
</P>
<UL>
<LI>existence: what forms are possible to build by morphological and
other means?
<LI>need: what features are expected via agreement or government?
</UL>
<P>
Grammar books and dictionaries give good advice on existence; for instance,
an Italian dictionary has entries such as
<center>
<B>uomo</B>, pl. <I>uomini</I>, n.m. "man"
</center>
which tells that <I>uomo</I> is a masculine noun with the plural form <I>uomini</I>.
From this alone, or with a couple more examples, we can generalize to the type
for all nouns in Italian: they have both singular and plural forms and thus
a parametric number, and they have an inherent gender.
</P>
<P>
The distinction between parametric and inherent features can be stated in
object-oriented programming terms: a linearization type is like a <B>class</B>,
which has a <B>method</B> for linearization and also some <B>attributes</B>.
In this class, the parametric features appear as arguments to the
linearization method, whereas the inherent features appear as attributes.
</P>
<P>
For words, inherent features are usually given <I>ad hoc</I> as lexical information.
For combinations, they are typically <I>inherited</I> from some part of the construction.
For instance, qualified noun constructs in Italian inherit their gender from noun part
(called the <B>head</B> of the construction in linguistics):
</P>
<PRE>
lin QKind qual kind =
let gen = kind.g in {
s = table {n =&gt; kind.s ! n ++ qual.s ! gen ! n} ;
g = gen
} ;
</PRE>
<P>
This rule uses a <B>local definition</B> (also known as a <B>let expression</B>) to
avoid computing <CODE>kind.g</CODE> twice, and also to express the linguistic
generalization that it is the same gender that is both passed to
the adjective and inherited by the construct.
The parametric number feature is in this rule passed to both the noun and
the adjective. In the table, a <B>variable pattern</B> is used to match
any possible number. Variables introduced in patterns are in scope in
the right-hand sides of corresponding branches. Again, it is good to
use a variable to express the linguistic generalization that the number
is passed to the parts, rather than expand the table into <CODE>Sg</CODE> and <CODE>Pl</CODE>
branches.
</P>
<P>
Sometimes the puzzle of making agreement and government work in a grammar has
several solutions. For instance, <B>precedence</B> in programming languages can
be equivalently described by a parametric or an inherent feature
(see <a href="#secprecedence">here</a> below).
</P>
<P>
In natural language applications that use the resource grammar library,
all parameters are hidden from the user, who thereby does not need to bother
about them. The only thing that she has to think about is what linguistic
categories are given as linearization types to each semantic category.
</P>
<P>
For instance, the GF resource grammar library has a category <CODE>NP</CODE> of
noun phrases, <CODE>AP</CODE> of adjectival phrases, and <CODE>Cl</CODE> of sentence-like clauses.
In the implementation of <CODE>Foods</CODE> <a href="#secenglish">here</a>, we will define
</P>
<PRE>
lincat Phrase = Cl ; Item = NP ; Quality = AP ;
</PRE>
<P>
To express that an item has a quality, we will use a resource function
</P>
<PRE>
mkCl : NP -&gt; AP -&gt; Cl ;
</PRE>
<P>
in the linearization rule:
</P>
<PRE>
lin Is = mkCl ;
</PRE>
<P>
In this way, we have no need to think about parameters and agreement.
<a href="#chapfive">the fifth chapter</a> will show a complete implementation of <CODE>Foods</CODE> by the
resource grammar, port it to many new languages, and extend it with
many new constructs.
</P>
<A NAME="toc60"></A>
<H2>An English concrete syntax for Foods with parameters</H2>
<P>
We repeat some of the rules above by showing the entire
module <CODE>FoodsEng</CODE>, equipped with parameters. The parameters and
operations are, for the sake of brevity, included in the same module
and not in a separate <CODE>resource</CODE>. However, some string operations
from the library <CODE>Prelude</CODE> are used.
</P>
<PRE>
--# -path=.:prelude
concrete FoodsEng of Foods = open Prelude in {
lincat
S, Quality = SS ;
Kind = {s : Number =&gt; Str} ;
Item = {s : Str ; n : Number} ;
lin
Is item quality = ss (item.s ++ copula item.n ++ quality.s) ;
This = det Sg "this" ;
That = det Sg "that" ;
These = det Pl "these" ;
Those = det Pl "those" ;
QKind quality kind = {s = table {n =&gt; quality.s ++ kind.s ! n}} ;
Wine = regNoun "wine" ;
Cheese = regNoun "cheese" ;
Fish = noun "fish" "fish" ;
Pizza = regNoun "pizza" ;
Very = prefixSS "very" ;
Fresh = ss "fresh" ;
Warm = ss "warm" ;
Italian = ss "Italian" ;
Expensive = ss "expensive" ;
Delicious = ss "delicious" ;
Boring = ss "boring" ;
param
Number = Sg | Pl ;
oper
det : Number -&gt; Str -&gt; {s : Number =&gt; Str} -&gt; {s : Str ; n : Number} =
\n,d,cn -&gt; {
s = d ++ cn.s ! n ;
n = n
} ;
noun : Str -&gt; Str -&gt; {s : Number =&gt; Str} =
\man,men -&gt; {s = table {
Sg =&gt; man ;
Pl =&gt; men
}
} ;
regNoun : Str -&gt; {s : Number =&gt; Str} =
\car -&gt; noun car (car + "s") ;
copula : Number -&gt; Str =
\n -&gt; case n of {
Sg =&gt; "is" ;
Pl =&gt; "are"
} ;
}
</PRE>
<P>
To find the Prelude library --- or in general,
GF files located in other directories, a <B>path directive</B> is needed
either on the command line or as the first line of
the topmost file compiled.
The paths in the path list are separated by colons (<CODE>:</CODE>), and every item
is interpreted primarily relative to the current directory and, secondarily,
to the value of <CODE>GF_LIB_PATH</CODE> (<B>GF library path</B>). Hence it is a
good idea to make <CODE>GF_LIB_PATH</CODE> to point into your <CODE>GF/lib/</CODE> whenever
you start working in GF. For instance, in the Bash shell this is done by
</P>
<PRE>
% export GF_LIB_PATH=&lt;the location of GF/lib in your file system&gt;
</PRE>
<P></P>
<A NAME="toc61"></A>
<H2>More on inflection paradigms</H2>
<P>
<a name="secinflection"></a>
</P>
<P>
Let us try to extend the English noun paradigms so that we can
deal with all nouns, not just the regular ones. The goal is to
provide a morphology module that is maximally easy to use when
words are added to the lexicon. In fact, we can think of a
division of labour where a linguistically trained grammarian
writes a morphology and hands it over to the lexicon writer
who knows much less about the rules of inflection.
</P>
<P>
In passing, we will introduce some new GF constructs: local definitions,
regular expression patterns, and operation overloading.
</P>
<A NAME="toc62"></A>
<H3>Worst-case functions</H3>
<P>
To start with, it is useful to perform <B>data abstraction</B> from the type
of nouns by writing a constructor operation, a <B>worst-case function</B>:
</P>
<PRE>
oper mkNoun : Str -&gt; Str -&gt; Noun = \x,y -&gt; {
s = table {
Sg =&gt; x ;
Pl =&gt; y
}
} ;
</PRE>
<P>
This presupposes that we have defined
</P>
<PRE>
oper Noun : Type = {s : Number =&gt; Str} ;
</PRE>
<P>
Using <CODE>mkNoun</CODE>, we can define
</P>
<PRE>
lin Mouse = mkNoun "mouse" "mice" ;
</PRE>
<P>
and
</P>
<PRE>
oper regNoun : Str -&gt; Noun = \x -&gt; mkNoun x (x + "s") ;
</PRE>
<P>
instead of writing the inflection tables explicitly.
</P>
<P>
Nouns like <I>mouse</I>-<I>mice</I>, are so irregular that
it hardly makes sense to see them as instances of a
paradigm that forms the plural from the singular form.
But in general, as we will see, there can be different
regular patterns in a language.
</P>
<P>
The grammar engineering advantage of worst-case functions is that
the author of the resource module may change the definitions of
<CODE>Noun</CODE> and <CODE>mkNoun</CODE>, and still retain the
interface (i.e. the system of type signatures) that makes it
correct to use these functions in concrete modules. In programming
terms, <CODE>Noun</CODE> is then treated as an <B>abstract datatype</B>:
its definition is not available, but only an indirect way of constructing
its objects.
</P>
<P>
A case where a change of the <CODE>Noun</CODE> type could
actually happen is if we introduces <B>case</B> (nominative or
genitive) in the noun inflection:
</P>
<PRE>
param Case = Nom | Gen ;
oper Noun : Type = {s : Number =&gt; Case =&gt; Str} ;
</PRE>
<P>
Now we have to redefine the worst-case function
</P>
<PRE>
oper mkNoun : Str -&gt; Str -&gt; Noun = \x,y -&gt; {
s = table {
Sg =&gt; table {
Nom =&gt; x ;
Gen =&gt; x + "'s"
} ;
Pl =&gt; table {
Nom =&gt; y ;
Gen =&gt; y + case last y of {
"s" =&gt; "'" ;
_ =&gt; "'s"
}
}
} ;
</PRE>
<P>
But up from this level, we can retain the old definitions
</P>
<PRE>
lin Mouse = mkNoun "mouse" "mice" ;
oper regNoun : Str -&gt; Noun = \x -&gt; mkNoun x (x + "s") ;
</PRE>
<P>
which will just compute to different values now.
</P>
<P>
In the last definition of <CODE>mkNoun</CODE>, we used a case expression
on the last character of the plural form to decide if the genitive
should be formed with an <CODE>'</CODE> (as in <I>dogs</I>-<I>dogs'</I>) or with
<CODE>'s</CODE> (as in <I>mice</I>-<I>mice's</I>). The expression <CODE>last y</CODE>
uses the <CODE>Prelude</CODE> operation
</P>
<PRE>
last : Str -&gt; Str ;
</PRE>
<P>
The case expression uses <B>pattern matching over strings</B>, which
is supported in GF, alongside with pattern matching over
parameters.
</P>
<A NAME="toc63"></A>
<H3>Intelligent paradigms</H3>
<P>
Between the completely regular <I>dog</I>-<I>dogs</I> and the completely
irregular <I>mouse</I>-<I>mice</I>, there are some
predictable variations:
</P>
<UL>
<LI>nouns ending with an <I>y</I>: <I>fly</I>-<I>flies</I>, except if
a vowel precedes the <I>y</I>: <I>boy</I>-<I>boys</I>
<LI>nouns ending with <I>s</I>, <I>ch</I>, and a number of
other endings: <I>bus</I>-<I>buses</I>, <I>leech</I>-<I>leeches</I>
</UL>
<P>
One way to deal with them would be to provide alternative paradigms:
</P>
<PRE>
noun_y : Str -&gt; Noun = \fly -&gt; mkNoun fly (init fly + "ies") ;
noun_s : Str -&gt; Noun = \bus -&gt; mkNoun bus (bus + "es") ;
</PRE>
<P>
The Prelude function <CODE>init</CODE> drops the last character of a token.
But this solution has some drawbacks:
</P>
<UL>
<LI>it can be difficult to select the correct paradigm
<LI>it can be difficult to remember the names of the different paradigms
</UL>
<P>
To help the lexicon builder in this task, the morphology programmer
can put some intelligence in the regular noun paradigm. The easiest
way to express this in GF is by the use of <B>regular expression patterns</B>:
</P>
<PRE>
regNoun : Str -&gt; Noun = \w -&gt;
let
ws : Str = case w of {
_ + ("a" | "e" | "i" | "o") + "o" =&gt; w + "s" ; -- bamboo
_ + ("s" | "x" | "sh" | "o") =&gt; w + "es" ; -- bus, hero
_ + "z" =&gt; w + "zes" ;-- quiz
_ + ("a" | "e" | "o" | "u") + "y" =&gt; w + "s" ; -- boy
x + "y" =&gt; x + "ies" ;-- fly
_ =&gt; w + "s" -- car
}
in
mkNoun w ws
</PRE>
<P>
In this definition, we have used a local definition just in order to
structure the code, even though there is no multiple evaluation to eliminate.
In the case expression itself, we have used
</P>
<UL>
<LI><B>disjunctive patterns</B> <I>P</I> <CODE>|</CODE> <I>Q</I>
<LI><B>concatenation patterns</B> <I>P</I> <CODE>+</CODE> <I>Q</I>
</UL>
<P>
The patterns are ordered in such a way that, for instance,
the suffix <CODE>"oo"</CODE> prevents <I>bamboo</I> from matching the suffix
<CODE>"o"</CODE>.
</P>
<P>
<B>Exercise</B>. The same rules that form plural nouns in English also
apply in the formation of third-person singular verbs.
Write a regular verb paradigm that uses this idea, but first
rewrite <CODE>regNoun</CODE> so that the analysis needed to build <I>s</I>-forms
is factored out as a separate <CODE>oper</CODE>, which is shared with
<CODE>regVerb</CODE>.
</P>
<P>
<B>Exercise</B>. Extend the verb paradigms to cover all verb forms
in English, with special care taken of variations with the suffix
<I>ed</I> (e.g. <I>try</I>-<I>tried</I>, <I>use</I>-<I>used</I>).
</P>
<P>
<B>Exercise</B>. Implement the German <B>Umlaut</B> operation on word stems.
The operation changes the vowel of the stressed stem syllable as follows:
<I>a</I> to <I><EFBFBD></I>, <I>au</I> to <I><EFBFBD>u</I>, <I>o</I> to <I><EFBFBD></I>, and <I>u</I> to <I><EFBFBD></I>. You
can assume that the operation only takes syllables as arguments. Test the
operation to see whether it correctly changes <I>Arzt</I> to <I><EFBFBD>rzt</I>,
<I>Baum</I> to <I>B<EFBFBD>um</I>, <I>Topf</I> to <I>T<EFBFBD>pf</I>, and <I>Kuh</I> to <I>K<EFBFBD>h</I>.
</P>
<A NAME="toc64"></A>
<H3>Function types with variables</H3>
<P>
In <a href="#chapsix">the sixth chapter</a>, we will introduce <B>dependent function types</B>, where
the value type depends on the argument. For this end, we need a notation
that binds a variable to the argument type, as in
</P>
<PRE>
switchOff : (k : Kind) -&gt; Action k
</PRE>
<P>
Function types <I>without</I>
variables are actually a shorthand notation: writing
</P>
<PRE>
PredVP : NP -&gt; VP -&gt; S
</PRE>
<P>
is shorthand for
</P>
<PRE>
PredVP : (x : NP) -&gt; (y : VP) -&gt; S
</PRE>
<P>
or any other naming of the variables. Actually the use of variables
sometimes shortens the code, since they can share a type:
</P>
<PRE>
octuple : (x,y,z,u,v,w,s,t : Str) -&gt; Str
</PRE>
<P>
If a bound variable is not used, it can here, as elsewhere in GF, be replaced by
a wildcard:
</P>
<PRE>
octuple : (_,_,_,_,_,_,_,_ : Str) -&gt; Str
</PRE>
<P>
A good practice for functions with many arguments of the same type
is to indicate the number of arguments:
</P>
<PRE>
octuple : (x1,_,_,_,_,_,_,x8 : Str) -&gt; Str
</PRE>
<P>
One can also use heuristic variable names to document what
information each argument is expected to provide.
This is very handy in the types of inflection paradigms:
</P>
<PRE>
mkNoun : (mouse,mice : Str) -&gt; Noun
</PRE>
<P></P>
<A NAME="toc65"></A>
<H3>Separating operation types and definitions</H3>
<P>
In grammars intended as libraries, it is useful to separate oparation
definitions from their type signatures. The user is only interested
in the type, whereas the definition is kept for the implementor and
the maintainer. This is possible by using separate <CODE>oper</CODE> fragments
for the two parts:
</P>
<PRE>
oper regNoun : Str -&gt; Noun ;
oper regNoun s = mkNoun s (s + "s") ;
</PRE>
<P>
The type checker combines the two into one <CODE>oper</CODE> judgement to see
if the definition matches the type. Notice that, in this syntax, it
is moreover possible to bind the argument variables on the left hand side
instead of using lambda abstration.
</P>
<P>
In the library module, the type signatures are typically placed in
the beginning and the definitions in the end. A more radical separation
can be achieved by using the <CODE>interface</CODE> and <CODE>instance</CODE> module types
(see <a href="#secinterface">here</a>): the type signatures are placed in the interface
and the definitions in the instance.
</P>
<A NAME="toc66"></A>
<H3>Overloading of operations</H3>
<P>
Large libraries, such as the GF Resource Grammar Library, may define
hundreds of names. This can be unpractical
for both the library author and the user: the author has to invent longer
and longer names which are not always intuitive,
and the author has to learn or at least be able to find all these names.
A solution to this problem, adopted by languages such as C++,
is <B>overloading</B>: one and the same name can be used for several functions.
When such a name is used, the
compiler performs <B>overload resolution</B> to find out which of
the possible functions is meant. Overload resolution is based on
the types of the functions: all functions that
have the same name must have different types.
</P>
<P>
In C++, functions with the same name can be scattered everywhere in the program.
In GF, they must be grouped together in <CODE>overload</CODE> groups. Here is an example
of an overload group, giving the different ways to define nouns in English:
</P>
<PRE>
oper mkN : overload {
mkN : (dog : Str) -&gt; Noun ; -- regular nouns
mkN : (mouse,mice : Str) -&gt; Noun ; -- irregular nouns
}
</PRE>
<P>
Intuitively, the function comes very close to the way in which
regular and irregular words are given in most dictionaries. If the
word is regular, just one form is needed. If it is irregular,
more forms are given. There is no need to use explicit paradigm
names.
</P>
<P>
The <CODE>mkN</CODE> example gives only the possible types of the overloaded
operation. Their definitions can be given separately, possibly in another module.
Here is a definition of the above overload group:
</P>
<PRE>
oper mkN = overload {
mkN : (dog : Str) -&gt; Noun = regNoun ;
mkN : (mouse,mice : Str) -&gt; Noun = mkNoun ;
}
</PRE>
<P>
Notice that the types of the branches must be repeated so that they can be
associated with proper definitions; the order of the branches has no
significance.
</P>
<P>
<B>Exercise</B>. Design a system of English verb paradigms presented by
an overload group.
</P>
<A NAME="toc67"></A>
<H3>Morphological analysis and morphology quiz</H3>
<P>
Even though morphology is in GF
mostly used as an auxiliary for syntax, it
can also be useful on its own right. The command <CODE>morpho_analyse = ma</CODE>
can be used to read a text and return for each word the analyses that
it has in the current concrete syntax.
</P>
<PRE>
&gt; read_file bible.txt | morpho_analyse
</PRE>
<P>
In the same way as translation exercises, morphological exercises can
be generated, by the command <CODE>morpho_quiz = mq</CODE>. Usually,
the category is then set to some lexical category. For instance,
French irregular verbs in the resource grammar library can be trained as
follows:
</P>
<PRE>
% gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/IrregFre.gfc
&gt; morpho_quiz -cat=V
Welcome to GF Morphology Quiz.
...
r<>appara<72>tre : VFin VCondit Pl P2
r<>apparaitriez
&gt; No, not r<>apparaitriez, but
r<>appara<72>triez
Score 0/1
</PRE>
<P>
Just like translation exercises, a list of morphological exercises can be generated
off-line and saved in a
file for later use, by the command <CODE>morpho_list = ml</CODE>
</P>
<PRE>
&gt; morpho_list -number=25 -cat=V | write_file exx.txt
</PRE>
<P>
The <CODE>number</CODE> flag gives the number of exercises generated.
</P>
<A NAME="toc68"></A>
<H2>The Italian Foods grammar</H2>
<P>
<a name="secitalian"></a>
</P>
<P>
We conclude the parametrization of the Food grammar by presenting an
Italian variant, now complete with parameters, inflection, and
agreement.
</P>
<P>
The header part is similar to English:
</P>
<PRE>
--# -path=.:prelude
concrete FoodsIta of Foods = open Prelude in {
</PRE>
<P>
Parameters include not only number but also gender.
</P>
<PRE>
param
Number = Sg | Pl ;
Gender = Masc | Fem ;
</PRE>
<P>
Qualities are inflected for gender and number, whereas kinds
have a parametric number (as in English) and an inherent gender.
Items have an inherent number (as in English) but also gender.
</P>
<PRE>
lincat
Phr = SS ;
Quality = {s : Gender =&gt; Number =&gt; Str} ;
Kind = {s : Number =&gt; Str ; g : Gender} ;
Item = {s : Str ; g : Gender ; n : Number} ;
</PRE>
<P>
A Quality is expressed by an adjective, which in Italian has one form for each
gender-number combination.
</P>
<PRE>
oper
adjective : (_,_,_,_ : Str) -&gt; {s : Gender =&gt; Number =&gt; Str} =
\nero,nera,neri,nere -&gt; {
s = table {
Masc =&gt; table {
Sg =&gt; nero ;
Pl =&gt; neri
} ;
Fem =&gt; table {
Sg =&gt; nera ;
Pl =&gt; nere
}
}
} ;
</PRE>
<P>
The very common case of regular adjectives works by adding
endings to the stem.
</P>
<PRE>
regAdj : Str -&gt; {s : Gender =&gt; Number =&gt; Str} = \nero -&gt;
let ner = init nero
in adjective nero (ner + "a") (ner + "i") (ner + "e") ;
</PRE>
<P></P>
<P>
For noun inflection, there are several paradigms; since only two forms
are ever needed, we will just give them explicitly (the resource grammar
library also has a paradigm that takes the singular form and infers the
plural and the gender from it).
</P>
<PRE>
noun : Str -&gt; Str -&gt; Gender -&gt; {s : Number =&gt; Str ; g : Gender} =
\vino,vini,g -&gt; {
s = table {
Sg =&gt; vino ;
Pl =&gt; vini
} ;
g = g
} ;
</PRE>
<P>
As in <CODE>FoodEng</CODE>, we need only number variation for the copula.
</P>
<PRE>
copula : Number -&gt; Str =
\n -&gt; case n of {
Sg =&gt; "<22>" ;
Pl =&gt; "sono"
} ;
</PRE>
<P>
Determination is more complex than in English, because of gender:
it uses separate determiner forms for the two genders, and selects
one of them as function of the noun determined.
</P>
<PRE>
det : Number -&gt; Str -&gt; Str -&gt; {s : Number =&gt; Str ; g : Gender} -&gt;
{s : Str ; g : Gender ; n : Number} =
\n,m,f,cn -&gt; {
s = case cn.g of {Masc =&gt; m ; Fem =&gt; f} ++ cn.s ! n ;
g = cn.g ;
n = n
} ;
</PRE>
<P>
Here is, finally, the complete set of linearization rules.
</P>
<PRE>
lin
Is item quality =
ss (item.s ++ copula item.n ++ quality.s ! item.g ! item.n) ;
This = det Sg "questo" "questa" ;
That = det Sg "quello" "quella" ;
These = det Pl "questi" "queste" ;
Those = det Pl "quelli" "quelle" ;
QKind quality kind = {
s = \\n =&gt; kind.s ! n ++ quality.s ! kind.g ! n ;
g = kind.g
} ;
Wine = noun "vino" "vini" Masc ;
Cheese = noun "formaggio" "formaggi" Masc ;
Fish = noun "pesce" "pesci" Masc ;
Pizza = noun "pizza" "pizze" Fem ;
Very qual = {s = \\g,n =&gt; "molto" ++ qual.s ! g ! n} ;
Fresh = adjective "fresco" "fresca" "freschi" "fresche" ;
Warm = regAdj "caldo" ;
Italian = regAdj "italiano" ;
Expensive = regAdj "caro" ;
Delicious = regAdj "delizioso" ;
Boring = regAdj "noioso" ;
}
</PRE>
<P></P>
<P>
<B>Exercise</B>. Experiment with multilingual generation and translation in the
<CODE>Foods</CODE> grammars.
</P>
<P>
<B>Exercise</B>. Add items, qualities, and determiners to the grammar, and try to get
their inflection and inherent features right.
</P>
<P>
<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for a language of your choice,
now aiming for complete grammatical correctness by the use of parameters.
</P>
<P>
<B>Exercise</B>. Measure the size of the context-free grammar corresponding to
<CODE>FoodsIta</CODE>. You can do this by printing the grammar in the context-free format
(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines.
</P>
<A NAME="toc69"></A>
<H2>Discontinuous constituents</H2>
<P>
A linearization type may contain more strings than one.
An example of where this is useful are English particle
verbs, such as <I>switch off</I>. The linearization of
a sentence may place the object between the verb and the particle:
<I>he switched it off</I>.
</P>
<P>
The following judgement defines transitive verbs as
<B>discontinuous constituents</B>, i.e. as having a linearization
type with two strings and not just one.
</P>
<PRE>
lincat TV = {s : Number =&gt; Str ; part : Str} ;
</PRE>
<P>
In the abstract syntax, we can now have a rule that combines a subject and an
object item with a transitive verb to form a sentence:
</P>
<PRE>
fun AppTV : Item -&gt; TV -&gt; Item -&gt; Phrase ;
</PRE>
<P>
The linearization rule places the object between the two parts of the verb:
</P>
<PRE>
lin AppTV subj tv obj =
{s = subj.s ++ tv.s ! subj.n ++ obj.s ++ tv.part} ;
</PRE>
<P>
There is no restriction in the number of discontinuous constituents
(or other fields) a <CODE>lincat</CODE> may contain. The only condition is that
the fields must be built from records, tables,
parameters, and <CODE>Str</CODE>, but not functions.
</P>
<P>
Notice that the parsing and linearization commands only give accurate
results for categories whose linearization type has a unique <CODE>Str</CODE>
valued field labelled <CODE>s</CODE>. Therefore, discontinuous constituents
are not a good idea in top-level categories accessed by the users
of a grammar application.
</P>
<P>
<B>Exercise</B>. Define the language <CODE>a^n b^n c^n</CODE> in GF, i.e.
any number of <I>a</I>'s followed by the same number of <I>b</I>'s and
the same number of <I>c</I>'s. This language is not context-free,
but can be defined in GF by using discontinuous constituents.
</P>
<A NAME="toc70"></A>
<H2>Strings at compile time vs. run time</H2>
<P>
A common difficulty in GF are the conditions under which tokens
can be created. Tokens are created in the following ways:
</P>
<UL>
<LI>quoted string: <CODE>"foo"</CODE>
<LI>gluing : <CODE>t + s</CODE>
<LI>predefined operations <CODE>init, tail, tk, dp</CODE>
<LI>pattern matching over strings
</UL>
<P>
The general principle is that
<I>tokens must be known at compile time</I>. This means that the above operations
may not have <B>run-time variables</B> in their arguments. Run-time variables, in
turn, are the variables that stand for function arguments in linearization rules.
</P>
<P>
Hence it is not legal to write
</P>
<PRE>
cat Noun ;
fun Plural : Noun -&gt; Noun ;
lin Plural n = {s = n.s + "s"} ;
</PRE>
<P>
because <CODE>n</CODE> is a run-time variable. Also
</P>
<PRE>
lin Plural n = {s = (regNoun n).s ! Pl} ;
</PRE>
<P>
is incorrect with <CODE>regNoun</CODE> as defined <a href="#secinflection">here</a>, because the run-time
variable is eventually sent to string pattern matching and gluing.
</P>
<P>
Writing tokens together without a space is an often-wanted behaviour, for instance,
with punctuation marks. Thus one might try to write
</P>
<PRE>
lin Question p = {s = p + "?"} ;
</PRE>
<P>
which is incorrect. The way to go is to use an <B>unlexer</B> that creates correct spacing
after linearization. Correspondingly, a <B>lexer</B> that e.g. analyses <CODE>"warm?"</CODE> into
to tokens is needed before parsing. This can be done by using flags:
</P>
<PRE>
flags lexer=text ; unlexer=text ;
</PRE>
<P>
works in the desired way for English text. More on lexers and unlexers will be
told <a href="#seclexing">here</a>.
</P>
<A NAME="toc71"></A>
<H2>Summary of GF language features</H2>
<A NAME="toc72"></A>
<H3>Parameter and table types</H3>
<P>
A judgement of the form
<center>
<CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <I>X1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I> <I>Xn</I>
</center>
defines a <B>parameter type</B> <I>P</I> with <B>constructors</B> <I>C1</I> ... <I>Cn</I>.
Each constructor has a <B>context</B> <I>X</I>, which is a (possibly empty)
sequence of parameter types. A <B>parameter value</B> is an application
of a constructor to a sequence of parameter values from each type in
its context.
</P>
<P>
In addition to types defined in <CODE>param</CODE> judgements, also
records of parameter types are parameter types. Their values are records
of corresponding field values.
</P>
<P>
Moreover, the type <CODE>Ints</CODE> <I>n</I> is a parameter type for any positive
integer <I>n</I>, and its values are <CODE>0</CODE>, ..., <I>n-1</I>.
</P>
<P>
A <B>table type</B> <I>P</I> <CODE>=&gt;</CODE> <I>T</I> must have a parameter type <I>P</I> as
its argument type. The normal form of an object of this type is a <B>table</B>
<center>
<CODE>table {</CODE> <I>V1</I> <CODE>=&gt;</CODE> <I>t1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>Vm</I> <CODE>=&gt;</CODE> <I>tm</I> <CODE>}</CODE>
</center>
which has a <B>branch</B> for every parameter value <I>Vi</I> of type <I>P</I>.
A table can be given in many other ways by using pattern matching.
</P>
<P>
Tables with only one branch are a common special case.
GF provides syntactic sugar for writing one-branch tables concisely:
</P>
<PRE>
\\P,...,Q =&gt; t === table {P =&gt; ... table {Q =&gt; t} ...}
</PRE>
<P></P>
<A NAME="toc73"></A>
<H3>Pattern matching</H3>
<P>
<a name="secmatching"></a>
</P>
<P>
We will list all forms of patterns that can be used in table branches.
the following are available for any parameter types, as well
as for the types <CODE>Int</CODE> and <CODE>Str</CODE>
</P>
<UL>
<LI>a constructor pattern <I>C P1 ... Pn</I> matches any value <I>C V1 ... Vn</I> where
each <I>Vi</I> matches <I>Pi</I>,
and binds the union of all variables bound in the subpatterns <I>Pi</I>
<LI>a record pattern
<CODE>{</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>}</CODE>
matches any record that has values of the corresponding fields.
and binds the union of all variables bound in the subpatterns <I>Pi</I>
<LI>a variable pattern <I>x</I>
(identifier other than constant parameter) matches any value, and
binds <I>x</I> to this value
<LI>the wild card <CODE>_</CODE> matches any value
<LI>a disjunctive pattern <I>P</I> <CODE>|</CODE> <I>Q</I> matches anything that
either <I>P</I> or <I>Q</I> matches; bindings must be the same in both
<LI>a negative pattern <CODE>-</CODE><I>P</I> matches anything that <I>P</I> does not match;
no bindings are returned
<LI>an alias pattern <I>x</I> <CODE>@</CODE> <I>P</I> matches whatever value <I>P</I> matches and
binds <I>x</I> to this value; also the bindings in <I>P</I> are returned
</UL>
<P>
The following patterns are only available for the type <CODE>Str</CODE>:
</P>
<UL>
<LI>a string literal pattern, e.g. <CODE>"s"</CODE>, matches the same string
<LI>a concatenation pattern <I>P</I> <CODE>+</CODE> <I>Q</I> matches any string that consists
of a prefix matching <I>P</I> and a suffix matching <I>Q</I>;
the union of bindings is returned
<LI>a repetition pattern <I>P</I><CODE>*</CODE> matches any string that can be decomposed
into strings that match <I>P</I>; no bindings are returned
</UL>
<P>
The following pattern is only available for the types <CODE>Int</CODE> and <CODE>Ints</CODE> <I>n</I>:
</P>
<UL>
<LI>an integer literal pattern, e.g. <CODE>214</CODE>, matches the same integer
</UL>
<P>
Pattern matching is performed in the order in which the branches
appear in the table: the branch of the first matching pattern is followed.
The type checker reject sets of patterns that are not exhaustive, and
warns for completely overshadowed patterns.
To guarantee exhaustivity when the infinite types <CODE>Int</CODE> and <CODE>Str</CODE> are
used as argument types, the last pattern must be a "catch-all" variable
or wild card.
</P>
<P>
It follows from the definition of record pattern matching
that it can utilize partial records: the branch
</P>
<PRE>
{g = Fem} =&gt; t
</PRE>
<P>
in a table of type <CODE>{g : Gender ; n : Number} =&gt; T</CODE> means the same as
</P>
<PRE>
{g = Fem ; n = _} =&gt; t
</PRE>
<P>
Variables in regular expression patterns
are always bound to the <B>first match</B>, which is the first
in the sequence of binding lists. For example:
</P>
<UL>
<LI><CODE>x + "e" + y</CODE> matches <CODE>"peter"</CODE> with <CODE>x = "p", y = "ter"</CODE>
<LI><CODE>x + "er"*</CODE> matches <CODE>"burgerer"</CODE> with ``x = "burg"
</UL>
<A NAME="toc74"></A>
<H3>Overloading</H3>
<P>
Judgements of the <CODE>oper</CODE> form can introduce overloaded functions.
The syntax is record-like, but all fields must have the same
name and different types.
</P>
<PRE>
oper mkN = overload {
mkN : (dog : Str) -&gt; Noun = regNoun ;
mkN : (mouse,mice : Str) -&gt; Noun = mkNoun ;
}
</PRE>
<P>
To give just the type of an overloaded operation, the record type
syntax is used.
</P>
<PRE>
oper mkN : overload {
mkN : (dog : Str) -&gt; Noun ; -- regular nouns
mkN : (mouse,mice : Str) -&gt; Noun ; -- irregular nouns
}
</PRE>
<P>
Overloading is not possible in other forms of judgement.
</P>
<A NAME="toc75"></A>
<H3>Local definitions</H3>
<P>
Local definitions ("<CODE>let</CODE> expressions") can appear in groups:
</P>
<PRE>
oper regNoun : Str -&gt; Noun = \vino -&gt;
let
vin : Str = init vino ;
o = last vino
in
...
</PRE>
<P>
The type can be omitted if it can be inferred. Later definitions may
refer to earlier ones.
</P>
<A NAME="toc76"></A>
<H3>Supplementary constructs</H3>
<P>
The rest of the GF language constructs are presented for the sake
of completeness. They will not be used in the rest of this tutorial.
</P>
<H4>Record extension and subtyping</H4>
<P>
Record types and records can be <B>extended</B> with new fields. For instance,
in German it is natural to see transitive verbs as verbs with a case, which
is usually accusative or dative, and is passed to the object of the verb.
The symbol <CODE>**</CODE> is used for both record types and record objects.
</P>
<PRE>
lincat TV = Verb ** {c : Case} ;
lin Follow = regVerb "folgen" ** {c = Dative} ;
</PRE>
<P>
To extend a record type or a record with a field whose label it
already has is a type error. It is also an error to extend a type or
object that is not a record.
</P>
<P>
A record type <I>T</I> is a <B>subtype</B> of another one <I>R</I>, if <I>T</I> has
all the fields of <I>R</I> and possibly other fields. For instance,
an extension of a record type is always a subtype of it.
If <I>T</I> is a subtype of <I>R</I>, then <I>R</I> is a <B>supertype</B> of <I>T</I>.
</P>
<P>
If <I>T</I> is a subtype of <I>R</I>, an object of <I>T</I> can be used whenever
an object of <I>R</I> is required.
For instance, a transitive verb can be used whenever a verb is required.
</P>
<P>
<B>Covariance</B> means that a function returning a record <I>T</I> as value can
also be used to return a value of a supertype <I>R</I>.
<B>Contravariance</B> means that a function taking an <I>R</I> as argument
can also be applied to any object of a subtype <I>T</I>.
</P>
<H4>Tuples and product types</H4>
<P>
Product types and tuples are syntactic sugar for record types and records:
</P>
<PRE>
T1 * ... * Tn === {p1 : T1 ; ... ; pn : Tn}
&lt;t1, ..., tn&gt; === {p1 = T1 ; ... ; pn = Tn}
</PRE>
<P>
Thus the labels <CODE>p1, p2,...</CODE> are hard-coded.
As patterns, tuples are translated to record patterns in the
same way as tuples to records; partial patterns make it
possible to write, slightly surprisingly,
</P>
<PRE>
case &lt;g,n,p&gt; of {
&lt;Fem&gt; =&gt; t
...
}
</PRE>
<P></P>
<H4>Prefix-dependent choices</H4>
<P>
Sometimes a token has different forms depending on the token
that follows. An example is the English indefinite article,
which is <I>an</I> if a vowel follows, <I>a</I> otherwise.
Which form is chosen can only be decided at run time, i.e.
when a string is actually build. GF has a special construct for
such tokens, the <CODE>pre</CODE> construct exemplified in
</P>
<PRE>
oper artIndef : Str =
pre {"a" ; "an" / strs {"a" ; "e" ; "i" ; "o"}} ;
</PRE>
<P>
Thus
</P>
<PRE>
artIndef ++ "cheese" ---&gt; "a" ++ "cheese"
artIndef ++ "apple" ---&gt; "an" ++ "apple"
</PRE>
<P>
This very example does not work in all situations: the prefix
<I>u</I> has no general rules, and some problematic words are
<I>euphemism, one-eyed, n-gram</I>. Since the branches are matched in
order, it is possible to write
</P>
<PRE>
oper artIndef : Str =
pre {"a" ;
"a" / strs {"eu" ; "one"} ;
"an" / strs {"a" ; "e" ; "i" ; "o" ; "n-"}
} ;
</PRE>
<P>
Somewhat illogically, the default value is given as the first element in the list.
</P>
<P>
<I>Prefix-dependent choice may be deprecated in GF version 3.</I>
</P>
<A NAME="toc77"></A>
<H1>Using the resource grammar library</H1>
<P>
<a name="chapfive"></a>
</P>
<P>
In this chapter, we will take a look at the GF resource grammar library.
We will use the library to implement the <CODE>Foods</CODE> grammar of the
previous chapter
and port it to some new languages. Some new concepts of GF's module system
are also introduced, most notably the technique of <B>parametrized modules</B>,
which has become an important "design pattern" for multilingual grammars.
</P>
<A NAME="toc78"></A>
<H2>The coverage of the library</H2>
<P>
The GF Resource Grammar Library contains grammar rules for
10 languages. In addition, 2 languages are available as yet incomplete
implementations, and a few more are under construction. The purpose
of the library is to define the low-level morphological and syntactic
rules of languages, and thereby enable application programmers
to concentrate on the semantic and stylistic
aspects of their grammars. The guiding principle is that
<center>
grammar checking becomes type checking
</center>
that is, whatever is type-correct in the resource grammar is also
grammatically correct.
</P>
<P>
The intended level of application grammarians
is that of a skilled programmer with
a practical knowledge of the target languages, but without
theoretical knowledge about their grammars.
Such a combination of
skills is typical of programmers who, for instance, want to localize
language software to new languages.
</P>
<P>
The current resource languages are
</P>
<UL>
<LI><CODE>Ara</CODE>bic (incomplete)
<LI><CODE>Cat</CODE>alan (incomplete)
<LI><CODE>Dan</CODE>ish
<LI><CODE>Eng</CODE>lish
<LI><CODE>Fin</CODE>nish
<LI><CODE>Fre</CODE>nch
<LI><CODE>Ger</CODE>man
<LI><CODE>Ita</CODE>lian
<LI><CODE>Nor</CODE>wegian
<LI><CODE>Rus</CODE>sian
<LI><CODE>Spa</CODE>nish
<LI><CODE>Swe</CODE>dish
</UL>
<P>
The first three letters (<CODE>Eng</CODE> etc) are used in grammar module names.
We use the three-letter codes for languages from the ISO 639 standard.
</P>
<P>
The incomplete Arabic and Catalan implementations are
sufficient for use in some applications; they both contain, amoung other
things, complete inflectional morphology.
</P>
<A NAME="toc79"></A>
<H2>The structure of the library</H2>
<P>
<a name="seclexical"></a>
</P>
<A NAME="toc80"></A>
<H3>Lexical vs. phrasal rules</H3>
<P>
So far we have looked at grammars from a semantic point of view:
a grammar defines a system of meanings (specified in the abstract syntax) and
tells how they are expressed in some language (as specified in the concrete syntax).
In resource grammars, as often in the linguistic tradition, the goal is more modest:
to specify the <B>grammatically correct combinations of words</B>, whatever their
meanings are. With this more modest goal, it is possible to achieve a much
wider coverage than with semantic grammars.
</P>
<P>
Given the focus on <I>words</I> and their combinations,
the resource grammar has two kinds of categories and two kinds of rules:
</P>
<UL>
<LI>lexical:
<UL>
<LI>lexical categories, to classify words
<LI>lexical rules, to define words and their properties
</UL>
</UL>
<UL>
<LI>phrasal (combinatorial, syntactic):
<UL>
<LI>phrasal categories, to classify phrases of arbitrary size
<LI>phrasal rules, to combine phrases into larger phrases
</UL>
</UL>
<P>
Some grammar formalisms make a formal distinction between
the lexical and syntactic
components; sometimes it is necessary to use separate formalisms for these
two kinds of rules. GF has no such restrictions.
Nevertheless, it has turned out
to be a good discipline to maintain a distinction between
the lexical and syntactic components in the resource grammar. This fits
also well with what is needed in applications: while syntactic structures
are more or less the same across applications, vocabularies can be
very different.
</P>
<A NAME="toc81"></A>
<H3>Lexical categories</H3>
<P>
Within lexical categories, there is a further classification
into <B>closed</B> and <B>open</B> categories. The definining property
of closed categories is that the
words in them can easily be enumerated; it is very seldom that any
new words are introduced in them. In general, closed categories
contain <B>structural words</B>, also known as <B>function words</B>.
Examples of closed categories are
</P>
<PRE>
QuantSg ; -- singular quantifier e.g. "this"
QuantPl ; -- plural quantifier e.g. "those"
AdA ; -- adadjective e.g. "very"
</PRE>
<P>
We have already used words of all these categories in the <CODE>Food</CODE>
examples; they have just not been assigned a category, but
treated as <B>syncategorematic</B>. In GF, a syncategoramatic
word is one that is introduced in a linearization rule of
some construction alongside with some other expressions that
are combined; there is no abstract syntax tree for that word
alone. Thus in the rules
</P>
<PRE>
fun That : Kind -&gt; Item ;
lin That k = {"that" ++ k.s} ;
</PRE>
<P>
the word <I>that</I> is syncategoramatic. In linguistically motivated
grammars, syncategorematic words are avoided, whereas in
semantically motivated grammars, structural words are typically treated
as syncategoramatic. This is partly so because the function expressed
by a structural word in one language is often expressed by some other
means than an individual word in another. For instance, the definite
article <I>the</I> is a determiner word in English, whereas Swedish expresses
determination by inflecting the determined noun: <I>the wine</I> is <I>vinet</I>
in Swedish.
</P>
<P>
As for open categories, we will start with these two:
</P>
<PRE>
N ; -- noun e.g. "pizza"
A ; -- adjective e.g. "good"
</PRE>
<P>
Later in this chapter we will also need verbs of different kinds.
</P>
<P>
<I>Note</I>. Having adadjectives as a closed category is not quite right, because
one can form adadjectives from adjectives: <I>incredibly warm</I>.
</P>
<A NAME="toc82"></A>
<H3>Lexical rules</H3>
<P>
The words of closed categories can be listed once and for all in a
library. In the first example, the <CODE>Foods</CODE> grammar of the previous section,
we will use the following structural words from the <CODE>Syntax</CODE> module:
</P>
<PRE>
this_QuantSg, that_QuantSg : QuantSg ;
these_QuantPl, those_QuantPl : QuantPl ;
very_AdA : AdA ;
</PRE>
<P>
The naming convention for lexical rules is that we use a word followed by
the category. In this way we can for instance distinguish the quantifier
<I>that</I> from the conjunction <I>that</I>.
</P>
<P>
Open lexical categories have no objects in <CODE>Syntax</CODE>. Such objects
will be built as they are needed in applications. The abstract
syntax of words in applications is already familiar, e.g.
</P>
<PRE>
fun Wine : Kind ;
</PRE>
<P>
The concrete syntax can be given directly, e.g.
</P>
<PRE>
lin Wine = mkN "wine" ;
</PRE>
<P>
by using the morphological paradigm library <CODE>ParadigmsEng</CODE>.
However, there are some advantages in giving the concrete syntax
indirectly, via the creation of a <B>resource lexicon</B>. In this lexicon,
there will be entries such as
</P>
<PRE>
oper wine_N : N = mkN "wine" ;
</PRE>
<P>
which can then be used in the linearization rules,
</P>
<PRE>
lin Wine = wine_N ;
</PRE>
<P>
One advantage of this indirect method is that each new word gives
an addition to a reusable resource lexicon, instead of just doing
the job of implementing the application. Another advantage will
be shown <a href="#secfunctor">here</a>: the possibility to write functors over
lexicon interfaces.
</P>
<A NAME="toc83"></A>
<H3>Phrasal categories</H3>
<P>
There are just four phrasal categories needed in the first application:
</P>
<PRE>
Cl ; -- clause e.g. "this pizza is good"
NP ; -- noun phrase e.g. "this pizza"
CN ; -- common noun e.g. "warm pizza"
AP ; -- adjectival phrase e.g. "very warm"
</PRE>
<P>
Clauses are, roughly, the same as declarative sentences; we will
define in <a href="#secextended">here</a> a sentence <CODE>S</CODE> as a clause that has a fixed tense.
The distinction between common nouns and noun phrases is that common nouns
cannot generally be used alone as subjects (?<I>dog sleeps</I>),
whereas noun phrases can (<I>the dog sleeps</I>).
Noun phrases can be built from common nouns by adding determiners,
such as quantifiers; but there are also other kinds of noun phrases, e.g.
pronouns.
</P>
<P>
The syntactic combinations we need are the following:
</P>
<PRE>
mkCl : NP -&gt; AP -&gt; Cl ; -- e.g. "this pizza is very warm"
mkNP : QuantSg -&gt; CN -&gt; NP ; -- e.g. "this pizza"
mkNP : QuantPl -&gt; CN -&gt; NP ; -- e.g. "these pizzas"
mkCN : AP -&gt; CN -&gt; CN ; -- e.g. "warm pizza"
mkAP : AdA -&gt; AP -&gt; AP ; -- e.g. "very warm"
</PRE>
<P>
To start building phrases, we need rules of <B>lexical insertion</B>, which
form phrases from single words:
</P>
<PRE>
mkCN : N -&gt; NP ;
mkAP : A -&gt; AP ;
</PRE>
<P>
Notice that all (or, as many as possible) operations in the resource library
have the name <CODE>mk</CODE><I>C</I>, where <I>C</I> is the value category of the operation.
This means of course heavy overloading. For instance, the current library
(version 1.2) has no less than 23 operations named <CODE>mkNP</CODE>!
</P>
<P>
Now the sentence
<center>
<I>these very warm pizzas are Italian</I>
</center>
can be built as follows:
</P>
<PRE>
mkCl
(mkNP these_QuantPl
(mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN)))
(mkAP italian_AP)
</PRE>
<P>
The task we are facing now is to define the concrete syntax of <CODE>Foods</CODE> so that
this syntactic tree gives the value of linearizing the semantic tree
</P>
<PRE>
Is (These (QKind (Very Warm) Pizza)) Italian
</PRE>
<P></P>
<A NAME="toc84"></A>
<H2>The resource API</H2>
<P>
The resource library API is divided into language-specific
and language-independent parts. To put it roughly,
</P>
<UL>
<LI>the syntax API is language-independent, i.e. has the same types and
functions for all languages.
Its name is <CODE>Syntax</CODE><I>L</I> for each language <I>L</I>
<LI>the morphology API is language-specific, i.e. has partly
different types and functions
for different languages.
Its name is <CODE>Paradigms</CODE><I>L</I> for each language <I>L</I>
</UL>
<P>
A full documentation of the API is available on-line in the
<B>resource synopsis</B>.
For the examples of this chapter, we will only need a
fragment of the full API. The fragment needed for <CODE>Foods</CODE> has
already been introduced, but let us summarize the descriptions
by giving tables of the same form as used in the resource synopsis.
</P>
<P>
Thus we will make use of the following categories from the module <CODE>Syntax</CODE>.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Category</TH>
<TH>Explanation</TH>
<TH COLSPAN="2">Example</TH>
</TR>
<TR>
<TD><CODE>Cl</CODE></TD>
<TD>clause (sentence), with all tenses</TD>
<TD><I>she looks at this</I></TD>
</TR>
<TR>
<TD><CODE>AP</CODE></TD>
<TD>adjectival phrase</TD>
<TD><I>very warm</I></TD>
</TR>
<TR>
<TD><CODE>CN</CODE></TD>
<TD>common noun (without determiner)</TD>
<TD><I>red house</I></TD>
</TR>
<TR>
<TD><CODE>NP</CODE></TD>
<TD>noun phrase (subject or object)</TD>
<TD><I>the red house</I></TD>
</TR>
<TR>
<TD><CODE>AdA</CODE></TD>
<TD>adjective-modifying adverb,</TD>
<TD><I>very</I></TD>
</TR>
<TR>
<TD><CODE>QuantSg</CODE></TD>
<TD>singular quantifier</TD>
<TD><I>these</I></TD>
</TR>
<TR>
<TD><CODE>QuantPl</CODE></TD>
<TD>plural quantifier</TD>
<TD><I>these</I></TD>
</TR>
<TR>
<TD><CODE>A</CODE></TD>
<TD>one-place adjective</TD>
<TD><I>warm</I></TD>
</TR>
<TR>
<TD><CODE>N</CODE></TD>
<TD>common noun</TD>
<TD><I>house</I></TD>
</TR>
</TABLE>
<P></P>
<P>
We will use the following syntax rules from <CODE>Syntax</CODE>.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH>Type</TH>
<TH COLSPAN="2">Example</TH>
</TR>
<TR>
<TD><CODE>mkCl</CODE></TD>
<TD><CODE>NP -&gt; AP -&gt; Cl</CODE></TD>
<TD><I>John is very old</I></TD>
</TR>
<TR>
<TD><CODE>mkNP</CODE></TD>
<TD><CODE>QuantSg -&gt; CN -&gt; NP</CODE></TD>
<TD><I>this old man</I></TD>
</TR>
<TR>
<TD><CODE>mkNP</CODE></TD>
<TD><CODE>QuantPl -&gt; CN -&gt; NP</CODE></TD>
<TD><I>these old man</I></TD>
</TR>
<TR>
<TD><CODE>mkCN</CODE></TD>
<TD><CODE>N -&gt; CN</CODE></TD>
<TD><I>house</I></TD>
</TR>
<TR>
<TD><CODE>mkCN</CODE></TD>
<TD><CODE>AP -&gt; CN -&gt; CN</CODE></TD>
<TD><I>very big blue house</I></TD>
</TR>
<TR>
<TD><CODE>mkAP</CODE></TD>
<TD><CODE>A -&gt; AP</CODE></TD>
<TD><I>old</I></TD>
</TR>
<TR>
<TD><CODE>mkAP</CODE></TD>
<TD><CODE>AdA -&gt; AP -&gt; AP</CODE></TD>
<TD><I>very very old</I></TD>
</TR>
</TABLE>
<P></P>
<P>
We will use the following structural words from <CODE>Syntax</CODE>.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH>Type</TH>
<TH COLSPAN="2">In English</TH>
</TR>
<TR>
<TD><CODE>this_QuantSg</CODE></TD>
<TD><CODE>QuantSg</CODE></TD>
<TD><I>this</I></TD>
</TR>
<TR>
<TD><CODE>that_QuantSg</CODE></TD>
<TD><CODE>QuantSg</CODE></TD>
<TD><I>that</I></TD>
</TR>
<TR>
<TD><CODE>these_QuantPl</CODE></TD>
<TD><CODE>QuantPl</CODE></TD>
<TD><I>this</I></TD>
</TR>
<TR>
<TD><CODE>those_QuantPl</CODE></TD>
<TD><CODE>QuantPl</CODE></TD>
<TD><I>that</I></TD>
</TR>
<TR>
<TD><CODE>very_AdA</CODE></TD>
<TD><CODE>AdA</CODE></TD>
<TD><I>very</I></TD>
</TR>
</TABLE>
<P></P>
<P>
For English, we will use the following part of <CODE>ParadigmsEng</CODE>.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH COLSPAN="2">Type</TH>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(dog : Str) -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(man,men : Str) -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkA</CODE></TD>
<TD><CODE>(cold : Str) -&gt; A</CODE></TD>
</TR>
</TABLE>
<P></P>
<P>
For Italian, we need just the following part of <CODE>ParadigmsIta</CODE>
(Exercise). The "smart" paradigms will take care of variations
such as <I>formaggio</I>-<I>formaggi</I>, and also infer the genders
correctly.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH COLSPAN="2">Type</TH>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(vino : Str) -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkA</CODE></TD>
<TD><CODE>(caro : Str) -&gt; A</CODE></TD>
</TR>
</TABLE>
<P></P>
<P>
For German, we will use the following part of <CODE>ParadigmsGer</CODE>.
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH COLSPAN="2">Type</TH>
</TR>
<TR>
<TD><CODE>Gender</CODE></TD>
<TD><CODE>Type</CODE></TD>
</TR>
<TR>
<TD><CODE>masculine</CODE></TD>
<TD><CODE>Gender</CODE></TD>
</TR>
<TR>
<TD><CODE>feminine</CODE></TD>
<TD><CODE>Gender</CODE></TD>
</TR>
<TR>
<TD><CODE>neuter</CODE></TD>
<TD><CODE>Gender</CODE></TD>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(Stufe : Str) -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(Bild,Bilder : Str) -&gt; Gender -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkA</CODE></TD>
<TD><CODE>(klein : Str) -&gt; A</CODE></TD>
</TR>
<TR>
<TD><CODE>mkA</CODE></TD>
<TD><CODE>(gut,besser,beste : Str) -&gt; A</CODE></TD>
</TR>
</TABLE>
<P></P>
<P>
For Finnish, we only need the smart regular paradigms:
</P>
<TABLE CELLPADDING="4" BORDER="1">
<TR>
<TH>Function</TH>
<TH COLSPAN="2">Type</TH>
</TR>
<TR>
<TD><CODE>mkN</CODE></TD>
<TD><CODE>(talo : Str) -&gt; N</CODE></TD>
</TR>
<TR>
<TD><CODE>mkA</CODE></TD>
<TD><CODE>(hieno : Str) -&gt; A</CODE></TD>
</TR>
</TABLE>
<P></P>
<P>
<B>Exercise</B>. Try out the morphological paradigms in different languages. Do
as follows:
</P>
<PRE>
&gt; i -path=alltenses:prelude -retain alltenses/ParadigmsGer.gfr
&gt; cc mkN "Farbe"
&gt; cc mkA "gut" "besser" "beste"
</PRE>
<P></P>
<A NAME="toc85"></A>
<H2>Example: English</H2>
<P>
<a name="secenglish"></a>
</P>
<P>
We work with the abstract syntax <CODE>Foods</CODE> from <a href="#chaptwo">the fourth chapter</a>, and
build first an English implementation. Now we can do it without
thinking about inflection and agreement, by just picking appropriate
functions from the resource grammar library.
</P>
<P>
The concrete syntax opens <CODE>SyntaxEng</CODE> and <CODE>ParadigmsEng</CODE>
to get access to the resource libraries needed. In order to find
the libraries, a <CODE>path</CODE> directive is prepended. It contains
two resource subdirectories --- <CODE>present</CODE> and <CODE>prelude</CODE> ---
which are found relative to the environment variable <CODE>GF_LIB_PATH</CODE>.
It also contains the current directory <CODE>.</CODE> and the directory <CODE>../foods</CODE>,
in which <CODE>Foods.gf</CODE> resides.
</P>
<PRE>
--# -path=.:../foods:present:prelude
concrete FoodsEng of Foods = open SyntaxEng,ParadigmsEng in {
</PRE>
<P>
As linearization types, we will use clauses for <CODE>Phrase</CODE>, noun phrases
for <CODE>Item</CODE>, common nouns for <CODE>Kind</CODE>, and adjectival phrases for <CODE>Quality</CODE>.
</P>
<PRE>
lincat
Phrase = Cl ;
Item = NP ;
Kind = CN ;
Quality = AP ;
</PRE>
<P>
These types fit perfectly with the way we have used the categories
in the application; hence
the combination rules we need almost write themselves automatically:
</P>
<PRE>
lin
Is item quality = mkCl item quality ;
This kind = mkNP this_QuantSg kind ;
That kind = mkNP that_QuantSg kind ;
These kind = mkNP these_QuantPl kind ;
Those kind = mkNP those_QuantPl kind ;
QKind quality kind = mkCN quality kind ;
Very quality = mkAP very_AdA quality ;
</PRE>
<P>
We write the lexical part of the grammar by using resource paradigms directly.
Notice that we have to apply the lexical insertion rules to get type-correct
linearizations. Notice also that we need to use the two-place noun paradigm for
<I>fish</I>, but everythins else is regular.
</P>
<PRE>
Wine = mkCN (mkN "wine") ;
Pizza = mkCN (mkN "pizza") ;
Cheese = mkCN (mkN "cheese") ;
Fish = mkCN (mkN "fish" "fish") ;
Fresh = mkAP (mkA "fresh") ;
Warm = mkAP (mkA "warm") ;
Italian = mkAP (mkA "Italian") ;
Expensive = mkAP (mkA "expensive") ;
Delicious = mkAP (mkA "delicious") ;
Boring = mkAP (mkA "boring") ;
}
</PRE>
<P></P>
<P>
<B>Exercise</B>. Compile the grammar <CODE>FoodsEng</CODE> and generate
and parse some sentences.
</P>
<P>
<B>Exercise</B>. Write a concrete syntax of <CODE>Foods</CODE> for Italian
or some other language included in the resource library. You can
compare the results with the hand-written
grammars presented earlier in this tutorial.
</P>
<A NAME="toc86"></A>
<H2>Functor implementation of multilingual grammars</H2>
<P>
<a name="secfunctor"></a>
</P>
<P>
If you did the exercise of writing a concrete syntax of <CODE>Foods</CODE> for some other
language, you probably noticed that much of the code looks exactly the same
as for English. The reason for this is that the <CODE>Syntax</CODE> API is the
same for all languages. This is in turn possible because
all languages (at least those in the resource package)
implement the same syntactic structures. Moreover, languages tend to use the
syntactic structures in similar ways, even though this is not exceptionless.
But usually, it is only the lexical parts of a concrete syntax that
we need to write anew for a new language. Thus, to port a grammar to
a new language, you
</P>
<OL>
<LI>copy the concrete syntax of a given language
<LI>change the words (strings and inflection paradigms)
</OL>
<P>
Now, programming by copy-and-paste is not worthy of a functional programmer!
So, can we write a <I>function</I> that takes care of the shared parts of grammar modules?
Yes, we can. It is not a function in the <CODE>fun</CODE> or <CODE>oper</CODE> sense, but
a function operating on modules, called a <B>functor</B>. This construct
is familiar from the functional programming
languages ML and OCaml, but it does not
exist in Haskell. It also bears some resemblance to templates in C++.
Functors are also known as <B>parametrized modules</B>.
</P>
<P>
In GF, a functor is a module that <CODE>open</CODE>s one or more <B>interfaces</B>.
An <CODE>interface</CODE> is a module similar to a <CODE>resource</CODE>, but it only
contains the <I>types</I> of <CODE>oper</CODE>s, not their definitions. You can think
of an interface as a kind of a record type. The <CODE>oper</CODE> names are the
labels of this record type. The corresponding <I>record</I> is called an
<B>instance</B> of the interface.
Thus a functor is a module-level function taking instances as
arguments and producing modules as values.
</P>
<P>
Let us now write a functor implementation of the <CODE>Food</CODE> grammar.
Consider its module header first:
</P>
<PRE>
incomplete concrete FoodsI of Foods = open Syntax, LexFoods in
</PRE>
<P>
A functor is distinguished from an ordinary module by the leading
keyword <CODE>incomplete</CODE>.
</P>
<P>
In the functor-function analogy, <CODE>FoodsI</CODE> would be presented as a function
with the following type signature:
</P>
<PRE>
FoodsI :
instance of Syntax -&gt; instance of LexFoods -&gt; concrete of Foods
</PRE>
<P>
It takes as arguments instances of two interfaces:
</P>
<UL>
<LI><CODE>Syntax</CODE>, the resource grammar interface
<LI><CODE>LexFoods</CODE>, the domain-specific lexicon interface
</UL>
<P>
Functors opening <CODE>Syntax</CODE> and a domain lexicon interface are in fact
so typical in GF applications, that this structure could be called
a <B>design pattern</B>
for GF grammars. What makes this pattern so useful is, again, that
languages tend to use the same syntactic structures and only differ in words.
</P>
<P>
We will show the exact syntax of interfaces and instances in next Section.
Here it is enough to know that we have
</P>
<UL>
<LI><CODE>SyntaxGer</CODE>, an instance of <CODE>Syntax</CODE>
<LI><CODE>LexFoodsGer</CODE>, an instance of <CODE>LexFoods</CODE>
</UL>
<P>
Then we can complete the German implementation by "applying" the functor:
</P>
<PRE>
FoodI SyntaxGer LexFoodsGer : concrete of Foods
</PRE>
<P>
The GF syntax for doing so is
</P>
<PRE>
concrete FoodsGer of Foods = FoodsI with
(Syntax = SyntaxGer),
(LexFoods = LexFoodsGer) ;
</PRE>
<P>
Notice that this is the <I>whole</I> module, not just a header of it.
The module body is received from <CODE>FoodsI</CODE>, by instantiating the
interface constants with their definitions given in the German
instances. A module of this form, characterized by the keyword <CODE>with</CODE>, is
called a <B>functor instantiation</B>.
</P>
<P>
Here is the complete code for the functor <CODE>FoodsI</CODE>:
</P>
<PRE>
--# -path=.:../foods:present:prelude
incomplete concrete FoodsI of Foods = open Syntax, LexFoods in {
lincat
Phrase = Cl ;
Item = NP ;
Kind = CN ;
Quality = AP ;
lin
Is item quality = mkCl item quality ;
This kind = mkNP this_QuantSg kind ;
That kind = mkNP that_QuantSg kind ;
These kind = mkNP these_QuantPl kind ;
Those kind = mkNP those_QuantPl kind ;
QKind quality kind = mkCN quality kind ;
Very quality = mkAP very_AdA quality ;
Wine = mkCN wine_N ;
Pizza = mkCN pizza_N ;
Cheese = mkCN cheese_N ;
Fish = mkCN fish_N ;
Fresh = mkAP fresh_A ;
Warm = mkAP warm_A ;
Italian = mkAP italian_A ;
Expensive = mkAP expensive_A ;
Delicious = mkAP delicious_A ;
Boring = mkAP boring_A ;
}
</PRE>
<P></P>
<A NAME="toc87"></A>
<H2>Interfaces and instances</H2>
<P>
<a name="secinterface"></a>
</P>
<P>
Let us now define the <CODE>LexFoods</CODE> interface:
</P>
<PRE>
interface LexFoods = open Syntax in {
oper
wine_N : N ;
pizza_N : N ;
cheese_N : N ;
fish_N : N ;
fresh_A : A ;
warm_A : A ;
italian_A : A ;
expensive_A : A ;
delicious_A : A ;
boring_A : A ;
}
</PRE>
<P>
In this interface, only lexical items are declared. In general, an
interface can declare any functions and also types. The <CODE>Syntax</CODE>
interface does so.
</P>
<P>
Here is a German instance of the interface.
</P>
<PRE>
instance LexFoodsGer of LexFoods = open SyntaxGer, ParadigmsGer in {
oper
wine_N = mkN "Wein" ;
pizza_N = mkN "Pizza" "Pizzen" feminine ;
cheese_N = mkN "K<>se" "K<>sen" masculine ;
fish_N = mkN "Fisch" ;
fresh_A = mkA "frisch" ;
warm_A = mkA "warm" "w<>rmer" "w<>rmste" ;
italian_A = mkA "italienisch" ;
expensive_A = mkA "teuer" ;
delicious_A = mkA "k<>stlich" ;
boring_A = mkA "langweilig" ;
}
</PRE>
<P>
Notice that when an interface opens an interface, such as <CODE>Syntax</CODE>,
here, then its instance has to open an instance of it. But the instance
may also open some other resources --- very typically, like here,
a domain lexicon instance opens a <CODE>Paradigms</CODE> module.
</P>
<P>
Just to complete the picture, we repeat the German functor instantiation
for <CODE>FoodsI</CODE>, this time with a path directive that makes it compilable.
</P>
<PRE>
--# -path=.:../foods:present:prelude
concrete FoodsGer of Foods = FoodsI with
(Syntax = SyntaxGer),
(LexFoods = LexFoodsGer) ;
</PRE>
<P></P>
<P>
<B>Exercise</B>. Compile and test <CODE>FoodsGer</CODE>.
</P>
<P>
<B>Exercise</B>. Refactor <CODE>FoodsEng</CODE> into a functor instantiation.
</P>
<A NAME="toc88"></A>
<H2>Adding languages to a functor implementation</H2>
<P>
Once we have an application grammar defined by using a functor,
adding a new language is simple. Just two modules need to be written:
</P>
<UL>
<LI>a domain lexicon instance
<LI>a functor instantiation
</UL>
<P>
The functor instantiation is completely mechanical to write.
Here is one for Finnish:
</P>
<PRE>
--# -path=.:../foods:present:prelude
concrete FoodsFin of Foods = FoodsI with
(Syntax = SyntaxFin),
(LexFoods = LexFoodsFin) ;
</PRE>
<P>
The domain lexicon instance requires some knowledge of the words of the
language: what words are used for which concepts, how the words are
inflected, plus features such as genders. Here is a lexicon instance for
Finnish:
</P>
<PRE>
instance LexFoodsFin of LexFoods = open SyntaxFin, ParadigmsFin in {
oper
wine_N = mkN "viini" ;
pizza_N = mkN "pizza" ;
cheese_N = mkN "juusto" ;
fish_N = mkN "kala" ;
fresh_A = mkA "tuore" ;
warm_A = mkA "l<>mmin" ;
italian_A = mkA "italialainen" ;
expensive_A = mkA "kallis" ;
delicious_A = mkA "herkullinen" ;
boring_A = mkA "tyls<6C>" ;
}
</PRE>
<P></P>
<P>
<B>Exercise</B>. Instantiate the functor <CODE>FoodsI</CODE> to some language of
your choice.
</P>
<A NAME="toc89"></A>
<H2>Division of labour revisited</H2>
<P>
One purpose with the resource grammars was stated to be a division
of labour between linguists and application grammarians. We can now
reflect on what this means more precisely, by asking ourselves what
skills are required of grammarians working on different components.
</P>
<P>
Building a GF application starts from the abstract syntax. Writing
an abstract syntax requires
</P>
<UL>
<LI>understanding of the semantic structure of the application domain
<LI>knowledge of the GF fragment with categories and functions
</UL>
<P>
If the concrete syntax is written by using a functor, the programmer
has to decide what parts of the implementation are put to the interface
and what parts are shared in the functor. This requires
</P>
<UL>
<LI>knowing how the domain concepts are expressed in natural language
<LI>knowledge of the resource grammar library --- the categories and combinators
<LI>understanding what parts are likely to be expressed in language-dependent
ways, so that they are put to an interface and not the functor
<LI>knowledge of the GF fragment with function applications and strings
</UL>
<P>
Instantiating a ready-made functor to a new language is less demanding.
It requires essentially
</P>
<UL>
<LI>knowing how the domain words are expressed in the language
<LI>knowing, roughly, how these words are inflected
<LI>knowledge of the paradigms available in the library
<LI>knowledge of the GF fragment with function applications and strings
</UL>
<P>
Notice that none of these tasks requires the use of GF records, tables,
or parameters. Thus only a small fragment of GF is needed; the rest of
GF is only relevant for those who write the libraries. Essentially,
all the machinery introduced in <a href="#chaptwo">the fourth chapter</a> is unnecessary!
</P>
<P>
Of course, grammar writing is not always just straightforward usage of libraries.
For example, GF can be used for other languages than just those in the
libraries --- for both natural and formal languages. A knowledge of records
and tables can, unfortunately, also be needed for understanding GF's error
messages.
</P>
<P>
<B>Exercise</B>. Design a small grammar that can be used for controlling
an MP3 player. The grammar should be able to recognize commands such
as <I>play this song</I>, with the following variations:
</P>
<UL>
<LI>verbs: <I>play</I>, <I>remove</I>
<LI>objects: <I>song</I>, <I>artist</I>
<LI>determiners: <I>this</I>, <I>the previous</I>
<LI>verbs without arguments: <I>stop</I>, <I>pause</I>
</UL>
<P>
The implementation goes in the following phases:
</P>
<OL>
<LI>abstract syntax
<LI>functor and lexicon interface
<LI>lexicon instance for the first language
<LI>functor instantiation for the first language
<LI>lexicon instance for the second language
<LI>functor instantiation for the second language
<LI>...
</OL>
<A NAME="toc90"></A>
<H2>Restricted inheritance</H2>
<P>
A functor implementation using the resource <CODE>Syntax</CODE> interface
works well as long as all concepts are expressed by using the same structures
in all languages. If this is not the case, the deviant linearization can
be made into a parameter and moved to the domain lexicon interface.
</P>
<P>
The <CODE>Foods</CODE> grammar works so well that we have to
take a contrived example: assume that English has
no word for <CODE>Pizza</CODE>, but has to use the paraphrase <I>Italian pie</I>.
This paraphrase is no longer a noun <CODE>N</CODE>, but a complex phrase
in the category <CODE>CN</CODE>. An obvious way to solve this problem is
to change interface <CODE>LexFoods</CODE> so that the constant declared for
<CODE>Pizza</CODE> gets a new type:
</P>
<PRE>
oper pizza_CN : CN ;
</PRE>
<P>
But this solution is unstable: we may end up changing the interface
and the function with each new language, and we must every time also
change the interface instances for the old languages to maintain
type correctness.
</P>
<P>
A better solution is to use <B>restricted inheritance</B>: the English
instantiation inherits the functor implementation except for the
constant <CODE>Pizza</CODE>. This is how we write:
</P>
<PRE>
--# -path=.:../foods:present:prelude
concrete FoodsEng of Foods = FoodsI - [Pizza] with
(Syntax = SyntaxEng),
(LexFoods = LexFoodsEng) **
open SyntaxEng, ParadigmsEng in {
lin Pizza = mkCN (mkA "Italian") (mkN "pie") ;
}
</PRE>
<P>
Restricted inheritance is available for all inherited modules. One can for
instance exclude some mushrooms and pick up just some fruit in
the <CODE>FoodMarket</CODE> example "Rsecarchitecture:
</P>
<PRE>
abstract Foodmarket = Food, Fruit [Peach], Mushroom - [Agaric]
</PRE>
<P>
A concrete syntax of <CODE>Foodmarket</CODE> must then have the same inheritance
restrictions, in order to be well-typed with respect to the abstract syntax.
</P>
<A NAME="toc91"></A>
<H2>Grammar reuse</H2>
<P>
The alert reader has certainly noticed an analogy between <CODE>abstract</CODE>
and <CODE>concrete</CODE>, on the one hand, and <CODE>interface</CODE> and <CODE>instance</CODE>,
on the other. Why are these two pairs of module types kept separate
at all? There is, in fact, a very close correspondence between
judgements in the two kinds of modules:
</P>
<PRE>
cat C &lt;---&gt; oper C : Type
fun f : A &lt;---&gt; oper f : A
lincat C = T &lt;---&gt; oper C : Type = T
lin f = t &lt;---&gt; oper f : A = t
</PRE>
<P>
But there are also some differences:
</P>
<UL>
<LI><CODE>abstract</CODE> and <CODE>concrete</CODE> modules define <B>top-level grammars</B>, i.e.
grammars that can be used for parsing and linearization; this is because
<LI>the types and terms in <CODE>concrete</CODE> modules are restricted to a subset
of those available in <CODE>interface</CODE>, <CODE>instance</CODE>, and <CODE>resource</CODE>
<LI><CODE>param</CODE> judgements have no counterparts in top-level grammars
</UL>
<P>
The term that can be used for interfaces, instances, and resources is
<B>resource-level grammars</B>.
From these explanations and the above translations it follows that top-level
grammars are, in a sense, a special case of resource-level grammars.
</P>
<P>
Thus, indeed, abstract syntax modules can be used like interfaces, and concrete syntaxes
as their instances. The use of top-level grammars as resources
is called <B>grammar reuse</B>. Whether a library module is a top-level or a
resource-level module is mostly invisible to application programmers
(see the Summary <a href="#seclock">here</a>
for an exception to this). The GF resource grammar
library itself is in fact built in two layers:
</P>
<UL>
<LI>the <B>ground resource</B>: a set of top-level grammars for syntactic structures
<LI>the <B>surface resource</B>: a resource-level grammar with overloaded operations
defined in terms of the ground resource
</UL>
<P>
Both the ground
resource and the surface resource can be used by application programmers,
but it is the surface resource that we use in this book. Because of overloading,
it has much fewer function names and also flatter trees. For instance, the clause
<center>
<I>these very warm pizzas are Italian</I>
</center>
which in the surface resource can be built as
</P>
<PRE>
mkCl
(mkNP these_QuantPl
(mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN)))
(mkAP italian_AP)
</PRE>
<P>
has in the ground resource the much more complex tree
</P>
<PRE>
PredVP
(DetCN (DetPl (PlQuant this_Quant) NoNum NoOrd)
(AdjCN (AdAP very_AdA (PositA warm_A)) (UseN pizza_N)))
(UseComp (CompAP (PositA italian_A)))
</PRE>
<P>
The main advantage of using the ground resource is that the trees can then be found
by using the parser, as shown in the next section. Otherwise, the overloaded surface
resource constants are much easier to use.
</P>
<P>
Needless to say, once a library has been defined in some way, it is easy to
build layers of <B>derived libraries</B> on top of it, by using grammar reuse
and, in the case of multilingual libraries, functors. This is indeed how
the surface resource has been implemented: as a functored parametrized on
the abstract syntax of the ground resource.
</P>
<A NAME="toc92"></A>
<H2>Browsing the resource with GF commands</H2>
<P>
<a name="secbrowsing"></a>
</P>
<P>
In addition to reading the
<A HREF="../../lib/resource-1.0/synopsis.html">resource synopsis</A>, you
can find resource function combinations by using the parser. This
is so because the resource library is in the end implemented as
a top-level <CODE>abstract-concrete</CODE> grammar, on which parsing
and linearization work.
</P>
<P>
Unfortunately, currently (GF 2.8)
only English and the Scandinavian languages can be
parsed within acceptable computer resource limits when the full
resource is used.
</P>
<P>
To look for a syntax tree in the overload API by parsing, do like this:
</P>
<PRE>
% gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/OverLangEng.gfc
&gt; p -cat=S -overload "this grammar is too big"
mkS (mkCl (mkNP this_QuantSg grammar_N) (mkAP too_AdA big_A))
</PRE>
<P>
The <CODE>-overload</CODE> option given to the parser is a directive to find the
shallowest overloaded term that matches the parse tree.
</P>
<P>
To view linearizations in all languages by parsing from English:
</P>
<PRE>
% gf $GF_LIB_PATH/alltenses/langs.gfcm
&gt; p -cat=S -lang=LangEng "this grammar is too big" | tb
UseCl TPres ASimul PPos (PredVP (DetCN (DetSg (SgQuant this_Quant)
NoOrd) (UseN grammar_N)) (UseComp (CompAP (AdAP too_AdA (PositA big_A)))))
Den h<>r grammatiken <20>r f<>r stor
Esta gram<61>tica es demasiado grande
(Cyrillic: eta grammatika govorit des'at' jazykov)
Denne grammatikken er for stor
Questa grammatica <20> troppo grande
Diese Grammatik ist zu gro<72>
Cette grammaire est trop grande
T<>m<EFBFBD> kielioppi on liian suuri
This grammar is too big
Denne grammatik er for stor
</PRE>
<P>
This method shows the unambiguous ground resource functions and not
the overloaded ones. It uses a precompiled grammar package of the GFCM or GFCC
format; see <a href="#chapeight">the eighth chapter</a> for more information on this.
</P>
<P>
Unfortunately, the Russian grammar uses at the moment a different
character encoding than the rest and is therefore not displayed correctly
in a terminal window. However, the GF syntax editor does display all
examples correctly --- again, using the ground resource:
</P>
<PRE>
% gfeditor $GF_LIB_PATH/alltenses/langs.gfcm
</PRE>
<P>
When you have constructed the tree, you will see the following screen:
</P>
<P>
<center>
</P>
<P>
<IMG ALIGN="right" SRC="10lang-small.png" BORDER="0" ALT="">
</P>
<P>
</center>
</P>
<P>
<B>Exercise</B>. Find the resource grammar translations for the following
English phrases (parse in the category <CODE>Phr</CODE>). You can first try to
build the terms manually.
</P>
<P>
<I>every man loves a woman</I>
</P>
<P>
<I>this grammar speaks more than ten languages</I>
</P>
<P>
<I>which languages aren't in the grammar</I>
</P>
<P>
<I>which languages did you want to speak</I>
</P>
<A NAME="toc93"></A>
<H2>An extended Foods grammar</H2>
<P>
<a name="secextended"></a>
</P>
<P>
Now that we know how to find information in the resource grammar,
we can easily extend the <CODE>Foods</CODE> fragment considerably. We shall enable
the following new expressions:
</P>
<UL>
<LI>questions: <I>Is this pizza Italian?</I> <I>Which pizza do you want to eat?</I>
<LI>imperatives: <I>Eat that pizza please!</I>
<LI>denials: <I>These pizzas are not Italian.</I>
<LI>verbs: <I>eat</I>, <I>pay</I>
<LI>guests, in addition to food items: <I>I, you, this lady</I>
</UL>
<A NAME="toc94"></A>
<H3>Abstract syntax</H3>
<P>
Since we don't want to change the already existing <CODE>Foods</CODE> module,
we build an extension of it, <CODE>ExtFoods</CODE>:
</P>
<PRE>
abstract ExtFoods = Foods ** {
flags startcat=Move ;
cat
Move ; -- dialogue move: declarative, question, or imperative
Verb ; -- transitive verb
Guest ; -- guest in restaurant
GuestKind ; -- type of guest
fun
MAssert : Phrase -&gt; Move ; -- This pizza is warm.
MDeny : Phrase -&gt; Move ; -- This pizza isn't warm.
MAsk : Phrase -&gt; Move ; -- Is this pizza warm?
PVerb : Guest -&gt; Verb -&gt; Item -&gt; Phrase ; -- we eat this pizza
PVerbWant : Guest -&gt; Verb -&gt; Item -&gt; Phrase ; -- we want to eat this pizza
WhichVerb :
Kind -&gt; Guest -&gt; Verb -&gt; Move ; -- Which pizza do you eat?
WhichVerbWant :
Kind -&gt; Guest -&gt; Verb -&gt; Move ; -- Which pizza do you want to eat?
WhichIs : Kind -&gt; Quality -&gt; Move ; -- Which wine is Italian?
Do : Verb -&gt; Item -&gt; Move ; -- Pay this wine!
DoPlease : Verb -&gt; Item -&gt; Move ; -- Pay this wine please!
I, You, We : Guest ;
GThis, GThat, GThese, GThose : GuestKind -&gt; Guest ;
Eat, Drink, Pay : Verb ;
Lady, Gentleman : GuestKind ;
}
</PRE>
<P>
The concrete syntax is implemented by a functor that extends the
already defined functor <CODE>FoodsI</CODE>.
</P>
<PRE>
incomplete concrete ExtFoodsI of ExtFoods =
FoodsI ** open Syntax, LexFoods in {
flags lexer=text ; unlexer=text ;
</PRE>
<P>
The flags set up a lexer and unlexer that can deal with sentence-initial
capital letters and proper spacing with punctuation (see <a href="#seclexing">here</a>
for more information on lexers and unlexers).
</P>
<A NAME="toc95"></A>
<H3>Linearization types</H3>
<P>
If we look at the resource documentation, we find several categories
that are above the clause level and can thus host different kinds
of dialogue moves:
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>Category</TH>
<TH>Explanation</TH>
<TH COLSPAN="2">Example</TH>
</TR>
<TR>
<TD><CODE>Text</CODE></TD>
<TD>text consisting of phrases</TD>
<TD><I>He is here. Why?</I></TD>
</TR>
<TR>
<TD><CODE>Phr</CODE></TD>
<TD>phrase in a text</TD>
<TD><I>but be quiet please</I></TD>
</TR>
<TR>
<TD><CODE>Utt</CODE></TD>
<TD>sentence, question, word...</TD>
<TD><I>be quiet</I></TD>
</TR>
<TR>
<TD><CODE>S</CODE></TD>
<TD>declarative sentence</TD>
<TD><I>she lived here</I></TD>
</TR>
<TR>
<TD><CODE>QS</CODE></TD>
<TD>question</TD>
<TD><I>where did she live</I></TD>
</TR>
<TR>
<TD><CODE>Imp</CODE></TD>
<TD>imperative</TD>
<TD><I>look at this</I></TD>
</TR>
<TR>
<TD><CODE>QCl</CODE></TD>
<TD>question clause, with all tenses</TD>
<TD><I>why does she walk</I></TD>
</TR>
</TABLE>
<P></P>
<P>
We also find that only the category <CODE>Text</CODE> contains punctuation marks.
So we choose this as the linearization type of <CODE>Move</CODE>. The other types
are quite obvious.
</P>
<PRE>
lincat
Move = Text ;
Verb = V2 ;
Guest = NP ;
GuestKind = CN ;
</PRE>
<P>
The category <CODE>V2</CODE> of <B>two-place verbs</B> includes both
<B>transitive verbs</B> that take <B>direct objects</B> (e.g. <I>we watch him</I>)
and verbs that take other kinds of <B>complements</B>, often with
prepositions (<I>we look at him</I>). In a multilingual grammar, it is
not guaranteed that transitive verbs are transitive in all languages,
so the more general notion of two-place verb is more appropriate.
</P>
<A NAME="toc96"></A>
<H3>Linearization rules</H3>
<P>
Now we need to find constructors that combine the new categories in
appropriate ways. To form a text from a clause, we first make it into
a sentence with <CODE>mkS</CODE>, and then apply <CODE>mkText</CODE>:
</P>
<PRE>
lin MAssert p = mkText (mkS p) ;
</PRE>
<P>
The function <CODE>mkS</CODE> has in the resource synopsis been given the type
</P>
<PRE>
mkS : (Tense) -&gt; (Ant) -&gt; (Pol) -&gt; Cl -&gt; S
</PRE>
<P>
Parentheses around type names do not make any difference for the GF compiler,
but in the synopsis notation they indicate <B>optionality</B>: any of the
optional arguments can be omitted, and there is an instance of <CODE>mkS</CODE>
available. For each optional type, it uses the <B>default value</B> for that
type, which for the <B>polarity</B> <CODE>Pol</CODE> is positive i.e. unnegated.
To build a negative sentence, we use an explicit polarity constructor:
</P>
<PRE>
MDeny p = mkText (mkS negativePol p) ;
</PRE>
<P>
Of course, we could have used <CODE>positivePol</CODE> in the first rule, instead of
relying on the default. (The types <CODE>Tense</CODE> and <CODE>Ant</CODE> will be explained
<a href="#sectense">here</a>.)
</P>
<P>
Phrases can be made into <B>question sentences</B>, which in turn can be
made into texts in a similar way as sentences; the default
punctuation mark is not the full stop but the question mark.
</P>
<PRE>
MAsk p = mkText (mkQS p) ;
</PRE>
<P>
There is an <CODE>mkCl</CODE> instance that directly builds a clause from a noun phrase,
a two-place verb, and another noun phrase.
</P>
<PRE>
PVerb = mkCl ;
</PRE>
<P>
The auxiliary verb <I>want</I> requires a <B>verb phrase</B> (<CODE>VP</CODE>) as its complement. It
can be built from a two-place verb and its noun phrase complement.
</P>
<PRE>
PVerbWant guest verb item = mkCl guest want_VV (mkVP verb item) ;
</PRE>
<P>
The <B>interrogative determiner</B> (<CODE>IDet</CODE>) <I>which</I> can be combined with
a common noun to form an <B>interrogative phrase</B> (<CODE>IP</CODE>). This <CODE>IP</CODE> can then
be used as a subject in a <B>question clause</B> (<CODE>QCl</CODE>), which in turn is
made into a <CODE>QS</CODE> and finally to a <CODE>Text</CODE>.
</P>
<PRE>
WhichIs kind quality =
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) (mkVP quality))) ;
</PRE>
<P>
When interrogative phrases are used as <I>objects</I>, the resource library
uses a category named <CODE>Slash</CODE> of
objectless sentences. The name cames from the <B>slash categories</B> of the
GPSG grammar formalism
(Gazdar &amp; al. 1985). Slashes can be formed from subjects and two-place verbs,
also with an intervening auxiliary verb.
</P>
<PRE>
WhichVerb kind guest verb =
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind)
(mkSlash guest verb))) ;
WhichVerbWant kind guest verb =
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind)
(mkSlash guest want_VV verb))) ;
</PRE>
<P>
Finally, we form the <B>imperative</B> (<CODE>Imp</CODE>) of a transitive verb
and its object. We make it into a <B>polite</B> form utterance, and finally
into a <CODE>Text</CODE> with an exclamation mark.
</P>
<PRE>
Do verb item =
mkText
(mkPhr (mkUtt politeImpForm (mkImp verb item))) exclMarkPunct ;
DoPlease verb item =
mkText
(mkPhr (mkUtt politeImpForm (mkImp verb item)) please_Voc)
exclMarkPunct ;
</PRE>
<P>
The rest of the concrete syntax is straightforward use of structural words,
</P>
<PRE>
I = mkNP i_Pron ;
You = mkNP youPol_Pron ;
We = mkNP we_Pron ;
GThis = mkNP this_QuantSg ;
GThat = mkNP that_QuantSg ;
GThese = mkNP these_QuantPl ;
GThose = mkNP those_QuantPl ;
</PRE>
<P>
and of the food lexicon,
</P>
<PRE>
Eat = eat_V2 ;
Drink = drink_V2 ;
Pay = pay_V2 ;
Lady = lady_N ;
Gentleman = gentleman_N ;
}
</PRE>
<P>
Notice that we have no reason to build an extension of <CODE>LexFoods</CODE>, but we just
add words to the old one. Since <CODE>LexFoods</CODE> instances are resource modules,
the superfluous definitions that they contain have no effect on the
modules that just <CODE>open</CODE> them, and thus the smaller <CODE>Foods</CODE> grammars
don't suffer from the additions we make.
</P>
<P>
<B>Exercise</B>. Port the <CODE>ExtFoods</CODE> grammars to some new languages, building
on the <CODE>Foods</CODE> implementations from previous sections, and using the functor
defined in this section.
</P>
<A NAME="toc97"></A>
<H2>Tenses</H2>
<P>
<a name="sectense"></a>
</P>
<P>
When compiling the <CODE>ExtFoods</CODE> grammars, we have used the path
</P>
<PRE>
--# -path=.:../foods:present:prelude
</PRE>
<P>
where the library subdirectory <CODE>present</CODE> refers to a restricted version
of the resource that covers only the present tense of verbs and sentences.
Having this version available is motivatad by efficiency reasons: tenses
produce in many languages a manifold of forms and combinations, which
multiply the size of the grammar; at the same time, many applications,
both technical ones and spoken dialogues, only need the present tense.
</P>
<P>
But it is easy change the grammars so that they admit of the full set
of tenses. It is enough to change the path to
</P>
<PRE>
--# -path=.:../foods:alltenses:prelude
</PRE>
<P>
and recompile the grammars from source (flag <CODE>-src</CODE>); the libraries are
not recompiled, because their sources cannot be found on the path list.
Then it is possible to see all the tenses of
phrases, by using the <CODE>-all</CODE> flag in linearization:
</P>
<PRE>
&gt; gr -cat=Phrase | l -all
This wine is delicious
Is this wine delicious
This wine isn't delicious
Isn't this wine delicious
This wine is not delicious
Is this wine not delicious
This wine has been delicious
Has this wine been delicious
This wine hasn't been delicious
Hasn't this wine been delicious
This wine has not been delicious
Has this wine not been delicious
This wine was delicious
Was this wine delicious
This wine wasn't delicious
Wasn't this wine delicious
This wine was not delicious
Was this wine not delicious
This wine had been delicious
Had this wine been delicious
This wine hadn't been delicious
Hadn't this wine been delicious
This wine had not been delicious
Had this wine not been delicious
This wine will be delicious
Will this wine be delicious
This wine won't be delicious
Won't this wine be delicious
This wine will not be delicious
Will this wine not be delicious
This wine will have been delicious
Will this wine have been delicious
This wine won't have been delicious
Won't this wine have been delicious
This wine will not have been delicious
Will this wine not have been delicious
This wine would be delicious
Would this wine be delicious
This wine wouldn't be delicious
Wouldn't this wine be delicious
This wine would not be delicious
Would this wine not be delicious
This wine would have been delicious
Would this wine have been delicious
This wine wouldn't have been delicious
Wouldn't this wine have been delicious
This wine would not have been delicious
Would this wine not have been delicious
</PRE>
<P>
In addition to tenses, the linearization writes all parametric
variations --- polarity and word order (direct vs. inverted) --- as
well as the variation between contracted and full negation words.
Of course, the list is even longer in languages that have more
tenses and moods, e.g. the Romance languages.
</P>
<P>
In the <CODE>ExtFoods</CODE> grammar, tenses never find their way to the
top level of <CODE>Move</CODE>s. Therefore it is useless to carry around
the clause and verb tenses given in the <CODE>alltenses</CODE> set of libraries.
But with the library, it is easy to add tenses to <CODE>Move</CODE>s. For
instance, one can add the rules
</P>
<PRE>
fun MAssertFut : Phrase -&gt; Move ; -- I will pay this wine
fun MAssertPastPerf : Phrase -&gt; Move ; -- I had paid that wine
lin MAssertFut p = mkText (mkS futureTense p) ;
lin MAssertPastPerf p = mkText (mkS pastTense anteriorAnt p) ;
</PRE>
<P>
Comparison with <CODE>MAssert</CODE> above shows that the absence of the tense
and anteriority features defaults to present simultaneous tenses.
</P>
<P>
<B>Exercise</B>. Measure the size of the context-free grammar corresponding to
some concrete syntax of <CODE>ExtFoods</CODE> with all tenses.
You can do this by printing the grammar in the context-free format
(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines.
</P>
<A NAME="toc98"></A>
<H2>Summary of GF language features</H2>
<A NAME="toc99"></A>
<H3>Interfaces and instances</H3>
<P>
An <B>interface module</B> (<CODE>interface</CODE> <I>I</I>) is like a <CODE>resource</CODE> module,
the difference being that it does not need to give definitions in
its <CODE>oper</CODE> and <CODE>param</CODE> judgements. Definitions are, however,
allowed, and they may use constants that appear undefined in the
module. For example, here is an interface for predication, which
is parametrized on NP case and agreement features, and on the constituent
order:
</P>
<PRE>
interface Predication = {
param
Case ;
Agreement ;
oper
subject : Case ;
object : Case ;
order : (verb,subj,obj : String) -&gt; String ;
NP : Type = {s : Case =&gt; Str ; a : Agreement} ;
TV : Type = {s : Agreement =&gt; Str} ;
sentence : TV -&gt; NP -&gt; NP -&gt; {s : Str} = \verb,subj,obj -&gt; {
s = order (verb ! subj.a) (subj ! subject) (obj ! object) ;
}
</PRE>
<P>
An <B>instance module</B> (<CODE>instance</CODE> <I>J</I> <CODE>of</CODE> <I>I</I>) is also like a
<CODE>resource</CODE>, but it is compiled in union with the interface that it
is an instance <CODE>of</CODE>. This means that the definitions given in the
instance are type-checked with respect to the types given in the
interface. Moreover, overwriting types or definitions given in the interface
is not allowed. But it is legal for an instance to contain definitions
not included in the corresponding interface. Here is an instance of
<CODE>Predication</CODE>, suitable for languages like English.
</P>
<PRE>
instance PredicationSimpleSVO of Predication = {
param
Case = Nom | Acc | Gen ;
Agreement = Agr Number Person ;
-- two new types
Number = Sg | Pl ;
Person = P1 | P2 | P3 ;
oper
subject = Nom ;
object = Acc ;
order = \verb,subj,obj -&gt; subj ++ verb ++ obj ;
-- the rest of the definitions don't need repetition
}
</PRE>
<P></P>
<A NAME="toc100"></A>
<H3>Grammar reuse</H3>
<P>
<a name="seclock"></a>
</P>
<P>
Abstract syntax modules can be used like interfaces, and concrete syntaxes
as their instances. The following translations then take place:
</P>
<PRE>
cat C ---&gt; oper C : Type
fun f : A ---&gt; oper f : A*
lincat C = T ---&gt; oper C : Type = T'
lin f = t ---&gt; oper f : A* = t'
</PRE>
<P>
This translation is called <B>grammar reuse</B>. It uses a homomorphism
from abstract types and terms to the concrete types and terms. For the
sake of more type safety, the types are not exactly the same. Currently
(GF 2.8), the type <I>T'</I> formed from the linearization type <I>T</I> of
a category <I>C</I> is <I>T</I> extended with a dummy <B>lock field</B>. Thus
</P>
<PRE>
lincat C = T ---&gt; oper C = T ** {lock_C : {}}
</PRE>
<P>
and the linearization terms are lifted correspondingly. The user of
a GF library should never see any lock fields; when they appear in
the compiler's warnings, they indicate that some library category is
constructed improperly by a user program.
</P>
<A NAME="toc101"></A>
<H3>Functors</H3>
<P>
A <B>parametrized module</B>, aka. an <B>incomplete module</B>, or a
<B>functor</B>, is any module that <CODE>open</CODE>s an <CODE>interface</CODE> (or
an <CODE>abstract</CODE>). Several interfaces may be opened by one
functor. The module header must be prefixed by the word <CODE>incomplete</CODE>.
Here is a typical example, using the resource <CODE>Syntax</CODE> and
a domain specific lexicon:
</P>
<PRE>
incomplete concrete DomainI of Domain = open Syntax, Lex in {...} ;
</PRE>
<P>
A <B>functor instantiation</B> is a module that inherits a functor and
provides an instance to each of its open interfaces. Here is an example:
</P>
<PRE>
concrete DomainSwe of Domain = DomainI with
(Syntax = SyntaxSwe),
(Lex = LexSwe) ;
</PRE>
<P></P>
<A NAME="toc102"></A>
<H3>Restricted inheritance</H3>
<P>
A module of any type can make <B>restricted inheritance</B>, which is
either exclusion or inclusion:
</P>
<PRE>
module M = A[f,g], B-[k] ** ...
</PRE>
<P>
A concrete syntax given to an abstract syntax that uses restricted inheritance
must make the corresponding restrictions. In addition, the concrete syntax can
make its own restrictions in order to redefine inherited linearization types and
rules.
</P>
<P>
Overriding old definitions without explicit restrictions is not allowed.
</P>
<A NAME="toc103"></A>
<H1>Refining semantics in abstract syntax</H1>
<P>
<a name="chapsix"></a>
</P>
<P>
While the concrete syntax constructs of GF have been already
covered, there is much more that can be done in the abstract
syntax. The techniques of <B>dependent types</B> and
<B>higher order abstract syntax</B> are introduced in this chapter,
which thereby concludes the presentation of the GF language.
</P>
<P>
Many of the examples in this chapter are somewhat less close to
applications than the ones shown before. Moreover, the tools for
embedded grammars in <a href="#chapeight">the eighth chapter</a> do not yet fully support dependent
types and higher order abstract syntax.
</P>
<A NAME="toc104"></A>
<H2>GF as a logical framework</H2>
<P>
In this chapter, we will show how
to encode advanced semantic concepts in an abstract syntax.
We use concepts inherited from <B>type theory</B>. Type theory
is the basis of many systems known as <B>logical frameworks</B>, which are
used for representing mathematical theorems and their proofs on a computer.
In fact, GF has a logical framework as its proper part:
this part is the abstract syntax.
</P>
<P>
In a logical framework, the formalization of a mathematical theory
is a set of type and function declarations. The following is an example
of such a theory, represented as an <CODE>abstract</CODE> module in GF.
</P>
<PRE>
abstract Arithm = {
cat
Prop ; -- proposition
Nat ; -- natural number
fun
Zero : Nat ; -- 0
Succ : Nat -&gt; Nat ; -- the successor of x
Even : Nat -&gt; Prop ; -- x is even
And : Prop -&gt; Prop -&gt; Prop ; -- A and B
}
</PRE>
<P>
This example does not show any new type-theoretical constructs yet, but
it could nevertheless be used as a part of a proof system for arithmetic.
</P>
<P>
<B>Exercise</B>. Give a concrete syntax of <CODE>Arithm</CODE>, preferably
by using the resource library.
</P>
<A NAME="toc105"></A>
<H2>Dependent types</H2>
<P>
<a name="secsmarthouse"></a>
</P>
<P>
<B>Dependent types</B> are a characteristic feature of GF,
inherited from the <B>constructive type theory</B> of Martin-L<>f and
distinguishing GF from most other grammar formalisms and
functional programming languages.
</P>
<P>
Dependent types can be used for stating stronger
<B>conditions of well-formedness</B> than ordinary types.
A simple example is a "smart house" system, which
defines voice commands for household appliances. This example
is borrowed from the
Regulus Book
(Rayner &amp; al. 2006).
</P>
<P>
One who enters a smart house can use a spoken <CODE>Command</CODE> to dim lights, switch
on the fan, etc. For <CODE>Device</CODE>s of each <CODE>Kind</CODE>, there is a set of
<CODE>Action</CODE>s that can be performed on them; thus one can dim the lights but
not the fan, for example. These dependencies can be expressed
by making the type <CODE>Action</CODE> dependent on <CODE>Kind</CODE>. We express these
dependencies in <CODE>cat</CODE> declarations by attaching argument types to
categories:
</P>
<PRE>
cat
Command ;
Kind ;
Device Kind ; -- argument type Kind
Action Kind ;
</PRE>
<P>
The crucial use of the dependencies is made in the rule for forming commands:
</P>
<PRE>
fun CAction : (k : Kind) -&gt; Action k -&gt; Device k -&gt; Command ;
</PRE>
<P>
In other words: an action and a device can be combined into a command only
if they are of the same <CODE>Kind</CODE> <CODE>k</CODE>. If we have the functions
</P>
<PRE>
DKindOne : (k : Kind) -&gt; Device k ; -- the light
light, fan : Kind ;
dim : Action light ;
</PRE>
<P>
we can form the syntax tree
</P>
<PRE>
CAction light dim (DKindOne light)
</PRE>
<P>
but we cannot form the trees
</P>
<PRE>
CAction light dim (DKindOne fan)
CAction fan dim (DKindOne light)
CAction fan dim (DKindOne fan)
</PRE>
<P>
Linearization rules are written as usual: the concrete syntax does not
know if a category is a dependent type. In English, one could write as follows:
</P>
<PRE>
lincat Action = {s : Str} ;
lin CAction _ act dev = {s = act.s ++ dev.s} ;
</PRE>
<P>
Notice that the argument for <CODE>Kind</CODE> does not appear in the linearization;
therefore it is good practice to make this clear by
using a wild card for it, rather than a real
variable.
As we will show,
the type checker can reconstruct the kind from the <CODE>dev</CODE> argument.
</P>
<P>
Parsing with dependent types is performed in two phases:
</P>
<OL>
<LI>context-free parsing
<LI>filtering through type checker
</OL>
<P>
If you just parse in the usual way, you don't enter the second phase, and
the <CODE>kind</CODE> argument is not found:
</P>
<PRE>
&gt; parse "dim the light"
CAction ? dim (DKindOne light)
</PRE>
<P>
Moreover, type-incorrect commands are not rejected:
</P>
<PRE>
&gt; parse "dim the fan"
CAction ? dim (DKindOne fan)
</PRE>
<P>
The question mark <CODE>?</CODE> is a <B>metavariable</B>, and is returned by the parser
for any subtree that is suppressed by a linearization rule.
These are exactly the same kind of metavariables as were used <a href="#secediting">here</a>
to mark incomplete parts of trees in the syntax editor.
</P>
<P>
To get rid of metavariables, we must feed the parse result into the
second phase of <B>solving</B> them. The <CODE>solve</CODE> process uses the dependent
type checker to restore the values of the metavariables. It is invoked by
the command <CODE>put_tree = pt</CODE> with the flag <CODE>-transform=solve</CODE>:
</P>
<PRE>
&gt; parse "dim the light" | put_tree -transform=solve
CAction light dim (DKindOne light)
</PRE>
<P>
The <CODE>solve</CODE> process may fail, in which case no tree is returned:
</P>
<PRE>
&gt; parse "dim the fan" | put_tree -transform=solve
no tree found
</PRE>
<P></P>
<P>
<B>Exercise</B>. Write an abstract syntax module with above contents
and an appropriate English concrete syntax. Try to parse the commands
<I>dim the light</I> and <I>dim the fan</I>, with and without <CODE>solve</CODE> filtering.
</P>
<P>
<B>Exercise</B>. Perform random and exhaustive generation, with and without
<CODE>solve</CODE> filtering.
</P>
<P>
<B>Exercise</B>. Add some device kinds and actions to the grammar.
</P>
<A NAME="toc106"></A>
<H2>Polymorphism</H2>
<P>
<a name="secpolymorphic"></a>
</P>
<P>
Sometimes an action can be performed on all kinds of devices. It would be
possible to introduce separate <CODE>fun</CODE> constants for each kind-action pair,
but this would be tedious. Instead, one can use <B>polymorphic</B> actions,
i.e. actions that take a <CODE>Kind</CODE> as an argument and produce an <CODE>Action</CODE>
for that <CODE>Kind</CODE>:
</P>
<PRE>
fun switchOn, switchOff : (k : Kind) -&gt; Action k ;
</PRE>
<P>
Functions that are not polymorphic are <B>monomorphic</B>. However, the
dichotomy into monomorphism and full polymorphism is not always sufficient
for good semantic modelling: very typically, some actions are defined
for a proper subset of devices, but not just one. For instance, both doors and
windows can be opened, whereas lights cannot.
We will return to this problem by introducing the
concept of <B>restricted polymorphism</B> later,
after a section on proof objects.
</P>
<P>
<B>Exercise</B>. The grammar <CODE>ExtFoods</CODE> <a href="#secextended">here</a> permits the
formation of phrases such as <I>we drink this fish</I> and <I>we eat this wine</I>.
A way to prevent them is to distinguish between eatable and drinkable food items.
Another, related problem is that there is some duplicated code
due to a category distinction between guests and food items, for instance,
two constructors for the determiner <I>this</I>. This problem can also
be solved by dependent types. Rewrite the abstract syntax in <CODE>Foods</CODE> and
<CODE>ExtFoods</CODE> by using such a type system, and also update the concrete syntaxes.
If you do this right, you only have to change the functor modules
<CODE>FoodsI</CODE> and <CODE>ExtFoodsI</CODE> in the concrete syntax.
</P>
<A NAME="toc107"></A>
<H3>Digression: dependent types in concrete syntax</H3>
<P>
The <B>functional fragment</B> of GF
terms and types comprises function types, applications, lambda
abstracts, constants, and variables. This fragment is the same in
abstract and concrete syntax. In particular,
dependent types are also available in concrete syntax.
We have not made use of them yet,
but we will now look at one example of how they
can be used.
</P>
<P>
Those readers who are familiar with functional programming languages
like ML and Haskell, may already have missed <B>polymorphic</B>
functions. For instance, Haskell programmers have access to
the functions
</P>
<PRE>
const :: a -&gt; b -&gt; a
const c _ = c
flip :: (a -&gt; b -&gt; c) -&gt; b -&gt; a -&gt; c
flip f y x = f x y
</PRE>
<P>
which can be used for any given types <CODE>a</CODE>,<CODE>b</CODE>, and <CODE>c</CODE>.
</P>
<P>
The GF counterpart of polymorphic functions are <B>monomorphic</B>
functions with explicit <B>type variables</B> --- a techniques that we already
used in abstract syntax for modelling actions that can be performed
on all kinds of devices. Thus the above definitions can be written
</P>
<PRE>
oper const :(a,b : Type) -&gt; a -&gt; b -&gt; a =
\_,_,c,_ -&gt; c ;
oper flip : (a,b,c : Type) -&gt; (a -&gt; b -&gt;c) -&gt; b -&gt; a -&gt; c =
\_,_,_,f,x,y -&gt; f y x ;
</PRE>
<P>
When the operations are used, the type checker requires
them to be equipped with all their arguments; this may be a nuisance
for a Haskell or ML programmer. They have not been used very much,
except in the <CODE>Coordination</CODE> module of the resource library.
</P>
<A NAME="toc108"></A>
<H2>Proof objects</H2>
<P>
Perhaps the most well-known idea in constructive type theory is
the <B>Curry-Howard isomorphism</B>, also known as the
<B>propositions as types principle</B>. Its earliest formulations
were attempts to give semantics to the logical systems of
propositional and predicate calculus. In this section, we will consider
a more elementary example, showing how the notion of proof is useful
outside mathematics, as well.
</P>
<P>
We use the already shown category of unary (also known as Peano-style)
natural numbers:
</P>
<PRE>
cat Nat ;
fun Zero : Nat ;
fun Succ : Nat -&gt; Nat ;
</PRE>
<P>
The <B>successor function</B> <CODE>Succ</CODE> generates an infinite
sequence of natural numbers, beginning from <CODE>Zero</CODE>.
</P>
<P>
We then define what it means for a number <I>x</I> to be <I>less than</I>
a number <I>y</I>. Our definition is based on two axioms:
</P>
<UL>
<LI><CODE>Zero</CODE> is less than <CODE>Succ</CODE> <I>y</I> for any <I>y</I>.
<LI>If <I>x</I> is less than <I>y</I>, then <CODE>Succ</CODE> <I>x</I> is less than <CODE>Succ</CODE> <I>y</I>.
</UL>
<P>
The most straightforward way of expressing these axioms in type theory
is with a dependent type <CODE>Less</CODE> <I>x y</I>, and two functions constructing
its objects:
</P>
<PRE>
cat Less Nat Nat ;
fun lessZ : (y : Nat) -&gt; Less Zero (Succ y) ;
fun lessS : (x,y : Nat) -&gt; Less x y -&gt; Less (Succ x) (Succ y) ;
</PRE>
<P>
Objects formed by <CODE>lessZ</CODE> and <CODE>lessS</CODE> are
called <B>proof objects</B>: they establish the truth of certain
mathematical propositions.
For instance, the fact that 2 is less that
4 has the proof object
</P>
<PRE>
lessS (Succ Zero) (Succ (Succ (Succ Zero)))
(lessS Zero (Succ (Succ Zero)) (lessZ (Succ Zero)))
</PRE>
<P>
whose type is
</P>
<PRE>
Less (Succ (Succ Zero)) (Succ (Succ (Succ (Succ Zero))))
</PRE>
<P>
which is the formalization of the proposition that 2 is less than 4.
</P>
<P>
GF grammars can be used to provide a <B>semantic control</B> of
well-formedness of expressions. We have already seen examples of this:
the grammar of well-formed actions on household devices. By introducing proof objects
we have now added an even more powerful technique of expressing semantic conditions.
</P>
<P>
A simple example of the use of proof objects is the definition of
well-formed <I>time spans</I>: a time span is expected to be from an earlier to
a later time:
</P>
<PRE>
from 3 to 8
</PRE>
<P>
is thus well-formed, whereas
</P>
<PRE>
from 8 to 3
</PRE>
<P>
is not. The following rules for spans impose this condition
by using the <CODE>Less</CODE> predicate:
</P>
<PRE>
cat Span ;
fun span : (m,n : Nat) -&gt; Less m n -&gt; Span ;
</PRE>
<P></P>
<P>
<B>Exercise</B>. Write an abstract and concrete syntax with the
concepts of this section, and experiment with it in GF.
</P>
<P>
<B>Exercise</B>. Define the notions of "even" and "odd" in terms
of proof objects. <B>Hint</B>. You need one function for proving
that 0 is even, and two other functions for propagating the
properties.
</P>
<A NAME="toc109"></A>
<H3>Proof-carrying documents</H3>
<P>
Another possible application of proof objects is <B>proof-carrying documents</B>:
to be semantically well-formed, the abstract syntax of a document must contain a proof
of some property, although the proof is not shown in the concrete document.
Think, for instance, of small documents describing flight connections:
</P>
<P>
<I>To fly from Gothenburg to Prague, first take LH3043 to Frankfurt, then OK0537 to Prague.</I>
</P>
<P>
The well-formedness of this text is partly expressible by dependent typing:
</P>
<PRE>
cat
City ;
Flight City City ;
fun
Gothenburg, Frankfurt, Prague : City ;
LH3043 : Flight Gothenburg Frankfurt ;
OK0537 : Flight Frankfurt Prague ;
</PRE>
<P>
This rules out texts saying <I>take OK0537 from Gothenburg to Prague</I>.
However, there is a
further condition saying that it must be possible to
change from LH3043 to OK0537 in Frankfurt.
This can be modelled as a proof object of a suitable type,
which is required by the constructor
that connects flights.
</P>
<PRE>
cat
IsPossible (x,y,z : City)(Flight x y)(Flight y z) ;
fun
Connect : (x,y,z : City) -&gt;
(u : Flight x y) -&gt; (v : Flight y z) -&gt;
IsPossible x y z u v -&gt; Flight x z ;
</PRE>
<P></P>
<A NAME="toc110"></A>
<H2>Restricted polymorphism</H2>
<P>
In the first version of the smart house grammar <CODE>Smart</CODE>,
all Actions were either of
</P>
<UL>
<LI><B>monomorphic</B>: defined for one Kind
<LI><B>polymorphic</B>: defined for all Kinds
</UL>
<P>
To make this scale up for new Kinds, we can refine this to
<B>restricted polymorphism</B>: defined for Kinds of a certain <B>class</B>
</P>
<P>
The notion of class can be expressed in abstract syntax
by using the Curry-Howard isomorphism as follows:
</P>
<UL>
<LI>a class is a <B>predicate</B> of Kinds --- i.e. a type depending of Kinds
<LI>a Kind is in a class if there is a proof object of this type
</UL>
<P>
Here is an example with switching and dimming. The classes are called
<CODE>switchable</CODE> and <CODE>dimmable</CODE>.
</P>
<PRE>
cat
Switchable Kind ;
Dimmable Kind ;
fun
switchable_light : Switchable light ;
switchable_fan : Switchable fan ;
dimmable_light : Dimmable light ;
switchOn : (k : Kind) -&gt; Switchable k -&gt; Action k ;
dim : (k : Kind) -&gt; Dimmable k -&gt; Action k ;
</PRE>
<P>
One advantage of this formalization is that classes for new
actions can be added incrementally.
</P>
<P>
<B>Exercise</B>. Write a new version of the <CODE>Smart</CODE> grammar with
classes, and test it in GF.
</P>
<P>
<B>Exercise</B>. Add some actions, kinds, and classes to the grammar.
Try to port the grammar to a new language. You will probably find
out that restricted polymorphism works differently in different languages.
For instance, in Finnish not only doors but also TVs and radios
can be "opened", which means switching them on.
</P>
<A NAME="toc111"></A>
<H2>Variable bindings</H2>
<P>
<a name="secbinding"></a>
</P>
<P>
Mathematical notation and programming languages have
expressions that <B>bind</B> variables. For instance,
a universally quantifier proposition
</P>
<PRE>
(All x)B(x)
</PRE>
<P>
consists of the <B>binding</B> <CODE>(All x)</CODE> of the variable <CODE>x</CODE>,
and the <B>body</B> <CODE>B(x)</CODE>, where the variable <CODE>x</CODE> can have
<B>bound occurrences</B>.
</P>
<P>
Variable bindings appear in informal mathematical language as well, for
instance,
</P>
<PRE>
for all x, x is equal to x
the function that for any numbers x and y returns the maximum of x+y
and x*y
Let x be a natural number. Assume that x is even. Then x + 3 is odd.
</PRE>
<P>
In type theory, variable-binding expression forms can be formalized
as functions that take functions as arguments. The universal
quantifier is defined
</P>
<PRE>
fun All : (Ind -&gt; Prop) -&gt; Prop
</PRE>
<P>
where <CODE>Ind</CODE> is the type of individuals and <CODE>Prop</CODE>,
the type of propositions. If we have, for instance, the equality predicate
</P>
<PRE>
fun Eq : Ind -&gt; Ind -&gt; Prop
</PRE>
<P>
we may form the tree
</P>
<PRE>
All (\x -&gt; Eq x x)
</PRE>
<P>
which corresponds to the ordinary notation
</P>
<PRE>
(All x)(x = x).
</PRE>
<P>
An abstract syntax where trees have functions as arguments, as in
the two examples above, has turned out to be precisely the right
thing for the semantics and computer implementation of
variable-binding expressions. The advantage lies in the fact that
only one variable-binding expression form is needed, the lambda abstract
<CODE>\x -&gt; b</CODE>, and all other bindings can be reduced to it.
This makes it easier to implement mathematical theories and reason
about them, since variable binding is tricky to implement and
to reason about. The idea of using functions as arguments of
syntactic constructors is known as <B>higher-order abstract syntax</B>.
</P>
<P>
The question now arises: how to define linearization rules
for variable-binding expressions?
Let us first consider universal quantification,
</P>
<PRE>
fun All : (Ind -&gt; Prop) -&gt; Prop
</PRE>
<P>
In GF, we write
</P>
<PRE>
lin All B = {s = "(" ++ "All" ++ B.$0 ++ ")" ++ B.s}
</PRE>
<P>
to obtain the form shown above.
This linearization rule brings in a new GF concept --- the <CODE>$0</CODE>
field of <CODE>B</CODE> containing a bound variable symbol.
The general rule is that, if an argument type of a function is
itself a function type <CODE>A -&gt; C</CODE>, the linearization type of
this argument is the linearization type of <CODE>C</CODE>
together with a new field <CODE>$0 : Str</CODE>. In the linearization rule
for <CODE>All</CODE>, the argument <CODE>B</CODE> thus has the linearization
type
</P>
<PRE>
{$0 : Str ; s : Str},
</PRE>
<P>
since the linearization type of <CODE>Prop</CODE> is
</P>
<PRE>
{s : Str}
</PRE>
<P>
In other words, the linearization of a function
consists of a linearization of the body together with a
field for a linearization of the bound variable.
Those familiar with type theory or lambda calculus
should notice that GF requires trees to be in
<B>eta-expanded</B> form in order for this to make sense:
for any function of type
</P>
<PRE>
A -&gt; B
</PRE>
<P>
an eta-expanded syntax tree has the form
</P>
<PRE>
\x -&gt; b
</PRE>
<P>
where <CODE>b : B</CODE> under the assumption <CODE>x : A</CODE>.
It is in this form that an expression can be analysed
as having a bound variable and a body, which can be put into
a linearization record.
</P>
<P>
Given the linearization rule
</P>
<PRE>
lin Eq a b = {s = "(" ++ a.s ++ "=" ++ b.s ++ ")"}
</PRE>
<P>
the linearization of
</P>
<PRE>
\x -&gt; Eq x x
</PRE>
<P>
is the record
</P>
<PRE>
{$0 = "x", s = ["( x = x )"]}
</PRE>
<P>
Thus we can compute the linearization of the formula,
</P>
<PRE>
All (\x -&gt; Eq x x) --&gt; {s = "[( All x ) ( x = x )]"}.
</PRE>
<P>
But how did we get the linearization of the variable <CODE>x</CODE>
into the string <CODE>"x"</CODE>? GF grammars have no rules for
this: it is just hard-wired in GF that variable symbols are
linearized into the same strings that represent them in
the print-out of the abstract syntax.
</P>
<P>
To be able to <I>parse</I> variable symbols, however, GF needs to know what
to look for (instead of e.g. trying to parse <I>any</I>
string as a variable). What strings are parsed as variable symbols
is defined in the lexical analysis part of GF parsing
</P>
<PRE>
&gt; p -cat=Prop -lexer=codevars "(All x)(x = x)"
All (\x -&gt; Eq x x)
</PRE>
<P>
(see more details on lexers <a href="#seclexing">here</a>). If several variables are bound in the
same argument, the labels are <CODE>$0, $1, $2</CODE>, etc.
</P>
<P>
<B>Exercise</B>. Write an abstract syntax of the whole
<B>predicate calculus</B>, with the
<B>connectives</B> "and", "or", "implies", and "not", and the
<B>quantifiers</B> "exists" and "for all". Use higher-order functions
to guarantee that unbounded variables do not occur.
</P>
<P>
<B>Exercise</B>. Write a concrete syntax for your favourite
notation of predicate calculus. Use Latex as target language
if you want nice output. You can also try producing boolean
expressions of some programming language. Use as many parenthesis as you need to
guarantee non-ambiguity.
</P>
<A NAME="toc112"></A>
<H2>Semantic definitions</H2>
<P>
<a name="secdefdef"></a>
</P>
<P>
Just like any functional programming language, abstract syntax in
GF has declarations of functions, telling what the type of a function is.
But we have not yet shown how to <B>compute</B>
these functions: all we can do is provide them with arguments
and linearize the resulting terms.
Since our main interest is the well-formedness of expressions,
this has not yet bothered
us very much. As we will see, however, computation does play a role
even in the well-formedness of expressions when dependent types are
present.
</P>
<P>
GF has a form of judgement for <B>semantic definitions</B>,
marked by the key word <CODE>def</CODE>. At its simplest, it is just
the definition of one constant, e.g.
</P>
<PRE>
fun one : Nat ;
def one = Succ Zero ;
</PRE>
<P>
Notice a <CODE>def</CODE> definition can only be given to names declared by
<CODE>fun</CODE> judgements in the same module; it is not possible to define
an inherited name.
</P>
<P>
We can also define a function with arguments,
</P>
<PRE>
fun twice : Nat -&gt; Nat ;
def twice x = plus x x ;
</PRE>
<P>
which is still a special case of the most general notion of
definition, that of a group of <B>pattern equations</B>:
</P>
<PRE>
fun plus : Nat -&gt; Nat -&gt; Nat ;
def
plus x Zero = x ;
plus x (Succ y) = Succ (Sum x y) ;
</PRE>
<P>
To compute a term is, as in functional programming languages,
simply to follow a chain of reductions until no definition
can be applied. For instance, we compute
</P>
<PRE>
Sum one one --&gt;
Sum (Succ Zero) (Succ Zero) --&gt;
Succ (sum (Succ Zero) Zero) --&gt;
Succ (Succ Zero)
</PRE>
<P>
Computation in GF is performed with the <CODE>pt</CODE> command and the
<CODE>compute</CODE> transformation, e.g.
</P>
<PRE>
&gt; p -tr "1 + 1" | pt -transform=compute -tr | l
sum one one
Succ (Succ Zero)
s(s(0))
</PRE>
<P></P>
<P>
The <CODE>def</CODE> definitions of a grammar induce a notion of
<B>definitional equality</B> among trees: two trees are
definitionally equal if they compute into the same tree.
Thus, trivially, all trees in a chain of computation
(such as the one above) are definitionally equal to each other.
In general, there can be infinitely many definitionally equal trees.
</P>
<P>
An important property of definitional equality is that it is
<B>extensional</B>, i.e. has to do with the sameness of semantic value.
Linearization, on the other hand, is an <B>intensional</B> operation,
i.e. has to do with the sameness of expression. This means that
<CODE>def</CODE> definitions are <I>not</I> evaluated as linearization steps.
Intensionality is a crucial property of linearization, since we want
to use it for things like tracing a chain of evaluation.
For instance, each of the steps of the computation above
has a different linearization into standard arithmetic notation:
</P>
<PRE>
1 + 1
s(0) + s(0)
s(s(0) + 0)
s(s(0))
</PRE>
<P>
In most programming languages, the operations that can be performed on
expressions are extensional, i.e. give equal values to equal arguments.
But GF has both extensional and intensional operations.
Type checking is extensional:
in the type theory with dependent types, types may depend on terms,
and types depending on definitionally equal terms are
equal types. For instance,
</P>
<PRE>
Less Zero one
Less Zero (Succ Zero))
</PRE>
<P>
are equal types. Hence, any tree that type checks as a proof that
1 is odd also type checks as a proof that the successor of 0 is odd.
(Recall, in this connection, that the
arguments a category depends on never play any role
in the linearization of trees of that category,
nor in the definition of the linearization type.)
</P>
<P>
When pattern matching is performed with <CODE>def</CODE> equations, it is
crucial to distinguish between <B>constructors</B> and other functions
(cf. <a href="#secmatching">here</a> on pattern matching in concrete syntax).
GF has a judgement form <CODE>data</CODE> to tell that a category has
certain functions as constructors:
</P>
<PRE>
data Nat = Succ | Zero ;
</PRE>
<P>
Unlike in Haskell and ML, new constructors can be added to
a type with new <CODE>data</CODE> judgements. The type signatures of constructors
are given separately, in ordinary <CODE>fun</CODE> judgements.
One can also write directly
</P>
<PRE>
data Succ : Nat -&gt; Nat ;
</PRE>
<P>
which is syntactic sugar for the pair of judgements
</P>
<PRE>
fun Succ : Nat -&gt; Nat ;
data Nat = Succ ;
</PRE>
<P>
If we did not mark <CODE>Zero</CODE> as <CODE>data</CODE>, the definition
</P>
<PRE>
fun isZero : Nat -&gt; Bool ;
def isZero Zero = True ;
def isZero _ = False ;
</PRE>
<P>
would return <CODE>True</CODE> for all arguments, because the pattern <CODE>Zero</CODE>
would be treated as a variable and it would hence match all values!
This is a common pitfall in GF.
</P>
<P>
<B>Exercise</B>. Implement an interpreter of a small functional programming
language with natural numbers, lists, pairs, lambdas, etc. Use higher-order
abstract syntax with semantic definitions. As onject language, use
your favourite programming language.
</P>
<A NAME="toc113"></A>
<H2>Summary of GF language features</H2>
<A NAME="toc114"></A>
<H3>Judgements</H3>
<P>
We have generalized the <CODE>cat</CODE> judgement form and introduced two new forms
for abstract syntax:
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>form</TH>
<TH COLSPAN="2">reading</TH>
</TR>
<TR>
<TD><CODE>cat</CODE> <I>C</I> <I>G</I></TD>
<TD><I>C</I> is a category in context <I>G</I></TD>
</TR>
<TR>
<TD><CODE>def</CODE> <I>f</I> <I>P1</I> ... <I>Pn</I> <CODE>=</CODE> t</TD>
<TD>function <I>f</I> applied to <I>P1</I>...<I>Pn</I> has value <I>t</I></TD>
</TR>
<TR>
<TD><CODE>data</CODE> <I>C</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD>
<TD>category <I>C</I> has constructors <I>C1</I>...<I>Cn</I></TD>
</TR>
</TABLE>
<P></P>
<P>
The <B>context</B> in the <CODE>cat</CODE> judgement has the form
</P>
<PRE>
(x1 : T1) ... (xn : Tn)
</PRE>
<P>
where the types <I>T1 ... Tn</I> may be increasingly dependent. To form a
type, <I>C</I> must be equipped with arguments of each type in the
context, satisfying the dependencies. As syntactic sugar, we have
</P>
<PRE>
T G === (x : T) G
</PRE>
<P>
if <I>x</I> does not occur in <I>G</I>. The linearization type definition of a
category does not mention the context.
</P>
<P>
In <CODE>def</CODE> judgements, the arguments <I>P1</I>...<I>Pn</I> can be constructor and
variable patterns as well as wild cards, and the binding and
evaluation rules are the same as <a href="#secmatching">here</a>.
</P>
<P>
A <CODE>data</CODE> judgement states that the names on the right-hand side are constructors
of the category on the left-hand side. The precise types of the constructors are
given in the <CODE>fun</CODE> judgements introducing them; the value type of a constructor
of <I>C</I> must be of the form <I>C a1 ... am</I>. As syntactic sugar,
</P>
<PRE>
data f : A1 ... An -&gt; C a1 ... am ===
fun f : A1 ... An -&gt; C a1 ... am ; data C = f ;
</PRE>
<P></P>
<A NAME="toc115"></A>
<H3>Dependent function types</H3>
<P>
A <B>dependent function type</B> has the form
</P>
<PRE>
(x : A) -&gt; B
</PRE>
<P>
where <I>B</I> depends on a variable <I>x</I> of type <I>A</I>. We have the
following syntactic sugar:
</P>
<PRE>
(x,y : A) -&gt; B === (x : A) -&gt; (y : A) -&gt; B
(_ : A) -&gt; B === (x : A) -&gt; B if B does not depend on x
A -&gt; B === (_ : A) -&gt; B
</PRE>
<P>
A <CODE>fun</CODE> function in abstract syntax may have function types as
argument types. This is called <B>higher-order abstract syntax</B>.
The linearization of an argument
</P>
<PRE>
\z0, ..., zn -&gt; b : (x0 : A1) -&gt; ... -&gt; (xn : An) -&gt; B
</PRE>
<P>
if formed from the linearization of <I>b*</I> of <I>b</I> by adding
fields that hold the variable symbols:
</P>
<PRE>
b* ** {$0 = "z0" ; ... ; $n = "zn"}
</PRE>
<P>
If an argument function is itself a higher-order function, its
bound variables cannot be reached in linearization. Thus, in a sense,
the higher-order abstract syntax of GF is just <B>second-orde abstract syntax</B>.
</P>
<P>
A <B>syntax tree</B> is a well-typed term in <B>beta-eta normal form</B>, which
means that
</P>
<UL>
<LI>its type is a basic type, i.e. it is not a partial application;
<LI>its arguments are in eta normal form, i.e. either full applications or
lambda abstractions with bodies that are full applications;
<LI>it has no beta redexes, i.e. applications of abstractions.
</UL>
<P>
Terms that are not in this form may occur as arguments of dependent types
and in <CODE>def</CODE> judgements, but they cannot be linearized.
</P>
<A NAME="toc116"></A>
<H1>Grammars of formal languages</H1>
<P>
<a name="chapseven"></a>
</P>
<P>
In this chapter, we will write a grammar for arithmetic expressions as known
from school mathematics and many programming languages. We will see how to
define precedences in GF, how to include built-in integers in grammars, and
how to deal with spaces between tokens in desired ways. As an alternative concrete
syntax, we will generate code for a JVM-like stack machine. We will conclude
by extending the language with variable declarations and assignments, which
are handled in a type-safe way by using higher-order abstract syntax.
</P>
<P>
To write grammars for formal languages is usually less challenging than for
natural languages. There are standard tools for this task, such as the YACC
family of parser generators. Using GF would be overkill for many projects,
and come with a penalty in efficiency. However, it is still worth while to
look at this task. A typical application of GF are natural-language interfaces
to formal systems: in such applications, the translation between natural and
formal language can be defined as a multilingual grammar. The use of higher-order
abstract syntax, together with dependent types, provides a way to define a
complete compiler in GF.
</P>
<A NAME="toc117"></A>
<H2>Arithmetic expressions</H2>
<A NAME="toc118"></A>
<H3>Abstract syntax</H3>
<P>
We want to write a grammar for what is usually called <B>expressions</B>
in programming languages. The expressions are built from integers by
the binary operations of addition, subtraction, multiplication, and
division. The abstract syntax is easy to write. We call it <CODE>Calculator</CODE>,
since it can be used as the basis of a calculator.
</P>
<PRE>
abstract Calculator = {
cat Exp ;
fun
EPlus, EMinus, ETimes, EDiv : Exp -&gt; Exp -&gt; Exp ;
EInt : Int -&gt; Exp ;
}
</PRE>
<P>
Notice the use of the category <CODE>Int</CODE>. It is a built-in category of
integers. Its syntax trees are denoted by <B>integer literals</B>, which are
sequences of digits. For instance,
</P>
<PRE>
5457455814608954681 : Int
</PRE>
<P>
These are the only objects of type <CODE>Int</CODE>:
grammars are not allowed to declare functions with <CODE>Int</CODE> as value type.
</P>
<A NAME="toc119"></A>
<H3>Concrete syntax: a simple approach</H3>
<P>
Arithmetic expressions should be unambiguous. If we write
</P>
<PRE>
2 + 3 * 4
</PRE>
<P>
it should be parsed as one, but not both, of
</P>
<PRE>
EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
ETimes (EPlus (EInt 2) (EInt 3)) (EInt 4)
</PRE>
<P>
Under normal conventions, the former is chosen, because
multiplication has <B>higher precedence</B> than addition.
If we want to express the latter tree, we have to use
parentheses:
</P>
<PRE>
(2 + 3) * 4
</PRE>
<P>
However, it is not completely trivial to decide when to use
parentheses and when not. We will therefore begin with a
concrete syntax that always uses parentheses around binary
operator applications.
</P>
<PRE>
concrete CalculatorP of Calculator = {
lincat
Exp = SS ;
lin
EPlus = infix "+" ;
EMinus = infix "-" ;
ETimes = infix "*" ;
EDiv = infix "/" ;
EInt i = i ;
oper
infix : Str -&gt; SS -&gt; SS -&gt; SS = \f,x,y -&gt;
ss ("(" ++ x.s ++ f ++ y.s ++ ")") ;
}
</PRE>
<P>
Now we will obtain
</P>
<PRE>
&gt; linearize EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
( 2 + ( 3 * 4 ) )
</PRE>
<P>
The first problem, even more urgent than superfluous parentheses, is
to get rid of superfluous spaces and to recognize integer literals
in the parser.
</P>
<A NAME="toc120"></A>
<H2>Lexing and unlexing</H2>
<P>
<a name="seclexing"></a>
</P>
<P>
The input of parsing in GF is not just a string, but a list of
<B>tokens</B>. By default, a list of tokens is obtained from a string
by analysing it into <B>words</B>, which means chunks separated by
spaces. Thus for instance
</P>
<PRE>
"(12 + (3 * 4))"
</PRE>
<P>
is split into the tokens
</P>
<PRE>
"(12", "+", "(3". "*". "4))"
</PRE>
<P>
The parser then tries to find each of these tokens among the terminals
of the grammar, i.e. among the strings that can appear in linearizations.
In our example, only the tokens <CODE>"+"</CODE> and <CODE>"*"</CODE> can be found, and
parsing therefore fails.
</P>
<P>
The proper way to split the above string into tokens would be
</P>
<PRE>
"(", "12", "+", "(", "3", "*", "4", ")", ")"
</PRE>
<P>
Moreover, the tokens <CODE>"12"</CODE>, <CODE>"3"</CODE>, and <CODE>"4"</CODE> should not be sought
among the terminals in the grammar, but treated as integer tokens, which
are defined outside the grammar. Since GF aims to be fully general, such
conventions are not built in: it must be possible for a grammar to have
tokens such as <CODE>"12"</CODE> and <CODE>"12)"</CODE>. Therefore, GF has a way to select
a <B>lexer</B>, a function that splits strings into tokens and classifies
them into terminals, literalts, etc.
</P>
<P>
A lexer can be given as a flag to the parsing command:
</P>
<PRE>
&gt; parse -cat=Exp -lexer=codelit "(2 + (3 * 4))"
EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
</PRE>
<P>
Since the lexer is usually a part of the language specification, it
makes sense to put it in the concrete syntax by using the judgement
</P>
<PRE>
flags lexer = codelit ;
</PRE>
<P>
The problem of getting correct spacing after linearization is likewise solved
by an <B>unlexer</B>:
</P>
<PRE>
&gt; l -unlexer=code EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
(2 + (3 * 4))
</PRE>
<P>
Also this flag is usually put into the concrete syntax file.
</P>
<P>
The lexers and unlexers that are available in the GF system can be
seen by
</P>
<PRE>
&gt; help -lexer
&gt; help -unlexer
</PRE>
<P>
A table of the most common lexers and unlexers is given in the Summary
section 7.8.
</P>
<A NAME="toc121"></A>
<H2>Precedence and fixity</H2>
<P>
<a name="secprecedence"></a>
</P>
<P>
Here is a summary of the usual
precedence rules in mathematics and programming languages:
</P>
<UL>
<LI>Integer constants and expressions in parentheses have the highest precedence.
<LI>Multiplication and division have equal precedence, lower than the highest
but higher than addition and subtraction, which are again equal.
<LI>All the four binary operations are <B>left-associative</B>, which means that
e.g. <CODE>1 + 2 + 3</CODE> means the same as <CODE>(1 + 2) + 3</CODE>.
</UL>
<P>
One way of dealing with precedences in compiler books is by dividing expressions
into three categories:
</P>
<UL>
<LI>expressions: addition and subtraction
<LI>terms: multiplication and division
<LI>factors: constants and expressions in parentheses
</UL>
<P>
The context-free grammar, also taking care of associativity, is the following:
</P>
<PRE>
Exp ::= Exp "+" Term | Exp "-" Term | Term ;
Term ::= Term "*" Fact | Term "/" Fact | Fact ;
Fact ::= Int | "(" Exp ")" ;
</PRE>
<P>
A compiler, however, does not want to make a semantic distinction between the
three categories. Nor does it want to build syntax trees with the
<B>coercions</B> that enable the use of a higher level expressions on a lower, and
encode the use of parentheses. In compiler tools such as YACC, building abstract
syntax trees is performed as a <B>semantic action</B>. For instance, if the parser
recognizes an expression in parentheses, the action is to return only the
expression, without encoding the parentheses.
</P>
<P>
In GF, semantic actions could be encoded by using <CODE>def</CODE> definitions introduced
<a href="#secdefdef">here</a>. But there is a more straightforward way of thinking about
precedences: we introduce a parameter for precedence, and treat it as
an inherent feature of expressions:
</P>
<PRE>
oper
param Prec = Ints 2 ;
TermPrec : Type = {s : Str ; p : Prec} ;
mkPrec : Prec -&gt; Str -&gt; TermPrec = \p,s -&gt; {s = s ; p = p} ;
lincat
Exp = TermPrec ;
</PRE>
<P>
This example shows another way to use built-in integers in GF:
the type <CODE>Ints 2</CODE> is a parameter type, whose values are the integers
<CODE>0,1,2</CODE>. These are the three precedence levels we need. The main idea
is to compare the inherent precedence of an expression with the context
in which it is used. If the precedence is higher than or equal to
the expected, then
no parentheses are needed. Otherwise they are. We encode this rule in
the operation
</P>
<PRE>
oper usePrec : TermPrec -&gt; Prec -&gt; Str = \x,p -&gt;
case lessPrec x.p p of {
True =&gt; "(" x.s ")" ;
False =&gt; x.s
} ;
</PRE>
<P>
With this operation, we can build another one, that can be used for
defining left-associative infix expressions:
</P>
<PRE>
infixl : Prec -&gt; Str -&gt; (_,_ : TermPrec) -&gt; TermPrec = \p,f,x,y -&gt;
mkPrec p (usePrec x p ++ f ++ usePrec y (nextPrec p)) ;
</PRE>
<P>
Constant-like expressions (the highest level) can be built simply by
</P>
<PRE>
constant : Str -&gt; TermPrec = mkPrec 2 ;
</PRE>
<P>
All these operations can be found in the library module <CODE>lib/prelude/Formal</CODE>,
so we don't have to define them in our own code. Also the auxiliary operations
<CODE>nextPrec</CODE> and <CODE>lessPrec</CODE> used in their definitions are defined there.
The library has 5 levels instead of 3.
</P>
<P>
Now we can express the whole concrete syntax of <CODE>Calculator</CODE> compactly:
</P>
<PRE>
concrete CalculatorC of Calculator = open Formal, Prelude in {
flags lexer = codelit ; unlexer = code ; startcat = Exp ;
lincat Exp = TermPrec ;
lin
EPlus = infixl 0 "+" ;
EMinus = infixl 0 "-" ;
ETimes = infixl 1 "*" ;
EDiv = infixl 1 "/" ;
EInt i = constant i.s ;
}
</PRE>
<P>
Let us just take one more look at the operation <CODE>usePrec</CODE>, which decides whether
to put parentheses around a term or not. The case where parentheses are not needed
around a string was defined as the string itself.
However, this would imply that superfluous parentheses
are never correct. A more liberal grammar is obtained by using the operation
</P>
<PRE>
parenthOpt : Str -&gt; Str = \s -&gt; variants {s ; "(" ++ s ++ ")"} ;
</PRE>
<P>
which is actually used in the <CODE>Formal</CODE> library.
But even in this way, we can only allow one pair of superfluous parentheses.
Thus the parameter-based grammar has not quite reached the goal
of implementing the same language as the expression-term-factor grammar.
But it has the advantage of eliminating precedence distinctions from the
abstract syntax.
</P>
<P>
<B>Exercise</B>. Define non-associative and right-associative infix operations
analogous to <CODE>infixl</CODE>.
</P>
<P>
<B>Exercise</B>. Add a constructor that puts parentheses around expressions
to raise their precedence, but that is eliminated by a <CODE>def</CODE> definition.
Test parsing with and without a pipe to <CODE>pt -transform=compute</CODE>.
</P>
<A NAME="toc122"></A>
<H2>Code generation as linearization</H2>
<P>
The classical use of grammars of programming languages is in <B>compilers</B>,
which translate one language into another. Typically the source language of
a compiler is a high-level language and the target language is a machine
language. The hub of a compiler is abstract syntax: the <B>front end</B> of
the compiler parses source language strings into abstract syntax trees, and
the <B>back end</B> linearizes these trees into the target language. This processing
model is of course what GF uses for natural language translation; the main
difference is that, in GF, the compiler could run in the opposite direction as
well, that is, function as a <B>decompiler</B>. (In full-size compilers, the
abstract syntax is also transformed by several layers of semantic analysis
and optimizations, before the target code is generated; this can destroy
reversibility and hence decompilation.)
</P>
<P>
More for the sake of illustration
than as a serious compiler, let us write a concrete
syntax of <CODE>Calculator</CODE> that generates machine code similar to JVM (Java Virtual
Machine). JVM is a so-called <B>stack machine</B>, whose code follows the
<B>postfix</B> notation, also known as <B>reverse Polish</B> notation. Thus the
expression
</P>
<PRE>
2 + 3 * 4
</PRE>
<P>
is translated to
</P>
<PRE>
iconst 2 : iconst 3 ; iconst 4 ; imul ; iadd
</PRE>
<P>
The linearization rules are not difficult to give:
</P>
<PRE>
lin
EPlus = postfix "iadd" ;
EMinus = postfix "isub" ;
ETimes = postfix "imul" ;
EDiv = postfix "idiv" ;
EInt i = ss ("iconst" ++ i.s) ;
oper
postfix : Str -&gt; SS -&gt; SS -&gt; SS = \op,x,y -&gt;
ss (x.s ++ ";" ++ y.s ++ ";" ++ op) ;
</PRE>
<P></P>
<A NAME="toc123"></A>
<H2>Speaking aloud arithmetic expressions</H2>
<P>
Natural languages have sometimes difficulties in expressing mathematical
formulas unambiguously, because they have no universal device of parentheses.
For arithmetic formulas, a solution exists:
</P>
<PRE>
2 + 3 * 4
</PRE>
<P>
can be expressed
</P>
<PRE>
the sum of 2 and the product of 3 and 4
</PRE>
<P>
However, this format is very verbose and unnatural, and becomes
impossible to understand when the complexity of expressions grows.
Fortunately, spoken language
has a very nice way of using <B>pauses</B> for disambiguation. This device was
introduced by Torbj<62>rn Lager (personal communication, 2003)
as an input mechanism to a calculator dialogue
system; it seems to correspond very closely to how we actually speak when we
want to communicate arithmetic expressions. Another application would be as
a part of a programming assistant that reads aloud code.
</P>
<P>
The idea is that, after every completed operation, there is a pause. Try this
by speaking aloud the following lines, making a pause instead of pronouncing the
word <CODE>PAUSE</CODE>:
</P>
<PRE>
2 plus 3 times 4 PAUSE
2 plus 3 PAUSE times 4 PAUSE
</PRE>
<P>
A grammar implementing this convention is again simple to write:
</P>
<PRE>
lin
EPlus = infix "plus" ;
EMinus = infix "minus" ;
ETimes = infix "times" ;
EDiv = infix ["divided by"] ;
EInt i = i ;
oper
infix : Str -&gt; SS -&gt; SS -&gt; SS = \op,x,y -&gt;
ss (x.s ++ op ++ y.s ++ "PAUSE") ;
</PRE>
<P>
Intuitively, a pause is taken to give the hearer time to compute an
intermediate result.
</P>
<P>
<B>Exercise</B>. Is the pause-based grammar unambiguous? Test with random examples!
</P>
<A NAME="toc124"></A>
<H2>Programs with variables</H2>
<P>
A useful extension of arithmetic expressions is a <B>straight code</B> programming
language. The programs of this language are <B>assignments</B> of the form <CODE>x = exp</CODE>,
which assign expressions to variables. Expressions can moreover contain variables
that have been given values in previous assignments.
</P>
<P>
In this language, we use two new categories: programs and variables.
A program is a sequence of assignments, where a variable is given a value.
Logically, we want to distinguish <B>initializations</B> from other assignments:
these are the assignments where a variable is given a value for the first time.
Just like in C-like languages,
we prefix an initializing assignment with the type of the variable.
Here is an example of a piece of code written in the language:
</P>
<PRE>
int x = 2 + 3 ;
int y = x + 1 ;
x = x + 9 * y ;
</PRE>
<P>
We define programs by the following constructors:
</P>
<PRE>
fun
PEmpty : Prog ;
PInit : Exp -&gt; (Var -&gt; Prog) -&gt; Prog ;
PAss : Var -&gt; Exp -&gt; Prog -&gt; Prog ;
</PRE>
<P>
The interesting constructor is <CODE>PInit</CODE>, which uses
higher-order abstract syntax for making the initialized variable available in
the <B>continuation</B> of the program. The abstract syntax tree for the above code
is
</P>
<PRE>
PInit (EPlus (EInt 2) (EInt 3)) (\x -&gt;
PInit (EPlus (EVar x) (EInt 1)) (\y -&gt;
PAss x (EPlus (EVar x) (ETimes (EInt 9) (EVar y)))
PEmpty))
</PRE>
<P>
Since we want to prevent the use of uninitialized variables in programs, we
don't give any constructors for <CODE>Var</CODE>! We just have a rule for using variables
as expressions:
</P>
<PRE>
fun EVar : Var -&gt; Exp ;
</PRE>
<P>
The rest of the grammar is just the same as for arithmetic expressions
<a href="#secprecedence">here</a>. The best way to implement it is perhaps by writing a
module that extends the expression module. The most natural start category
of the extension is <CODE>Prog</CODE>.
</P>
<P>
<B>Exercise</B>. Extend the straight-code language to expressions of type <CODE>float</CODE>.
To guarantee type safety, you can define a category <CODE>Typ</CODE> of types, and
make <CODE>Exp</CODE> and <CODE>Var</CODE> dependent on <CODE>Typ</CODE>. Basic floating point expressions
can be formed from literal of the built-in GF type <CODE>Float</CODE>. The arithmetic
operations should be made polymorphic (as <a href="#secpolymorphic">here</a>).
</P>
<A NAME="toc125"></A>
<H3>The concrete syntax of assignments</H3>
<P>
We can define a C-like concrete syntax by using GF's <CODE>$</CODE> variables, as explained
<a href="#secbinding">here</a>.
</P>
<P>
In a JVM-like syntax, we need two more instructions: <CODE>iload</CODE> <I>x</I>, which
loads (pushes on the stack) the value of the variable <I>x</I>, and <CODE>istore</CODE> <I>x</I>,
which stores the value of the currently topmost expression in the variable <I>x</I>.
Thus the code for the example in the previous section is
</P>
<PRE>
iconst 2 ; iconst 3 ; iadd ; istore x ;
iload x ; iconst 1 ; iadd ; istore y ;
iload x ; iconst 9 ; iload y ; imul ; iadd ; istore x ;
</PRE>
<P>
Those familiar with JVM will notice that we are using <B>symbolic addresses</B>, i.e.
variable names instead of integer offsets in the memory. Neither real JVM nor
our variant makes any distinction between the initialization and reassignment
of a variable.
</P>
<P>
<B>Exercise</B>. Finish the implementation of the
C-to-JVM compiler by extending the expression modules
to straight code programs.
</P>
<P>
<B>Exercise</B>. If you made the exercise of adding floating point numbers to
the language, you can now cash out the main advantage of type checking
for code generation: selecting type-correct JVM instructions. The floating
point instructions are precisely the same as the integer one, except that
the prefix is <CODE>f</CODE> instead of <CODE>i</CODE>, and that <CODE>fconst</CODE> takes floating
point literals as arguments.
</P>
<A NAME="toc126"></A>
<H3>A liberal syntax of variables</H3>
<P>
In many applications, the task of GF is just linearization and parsing;
keeping track of bound variables and other semantic constraints is
the task of other parts of the program. For instance, if we want to
write a natural language interface that reads aloud C code, we can
quite as well use a context-free grammar of C, and leave it to the C
compiler to check that variables make sense. In such a program, we may
want to treat variables as <I>Strings</I>, i.e. to have a constructor
</P>
<PRE>
fun VString : String -&gt; Var ;
</PRE>
<P>
The built-in category <CODE>String</CODE> has as its values <B>string literals</B>,
which are strings in double quotes. The lexer and unlexer <CODE>codelit</CODE>
restore and remove the quotes; when the lexer finds a token that is
neither a terminal in the grammar nor an integer literal, it sends
it to the parser as a string literal.
</P>
<P>
<B>Exercise</B>. Write a grammar for straight code without higher-order
abstract syntax.
</P>
<P>
<B>Exercise</B>. Extend the liberal straight code grammar to <CODE>while</CODE> loops and
some other program constructs, and investigate if you can build a reasonable spoken
language generator for this fragment.
</P>
<A NAME="toc127"></A>
<H2>Conclusion</H2>
<P>
Since formal languages are syntactically simpler than natural languages, it
is no wonder that their grammars can be defined in GF. Some thought is needed
for dealing with precedences and spacing, but much of it is encoded in GF's
libraries and built-in lexers and unlexers. If the sole purpose of a grammar
is to implement a programming language, then the <B>BNF Converter</B> tool
(BNFC) is more appropriate than GF:
<center>
<CODE>www.cs.chalmers.se/~markus/BNFC/</CODE>
</center>
BNFC uses standard YACC-like parser tools. GF has flags for printing
grammars in the BNFC format.
</P>
<P>
The most common applications of GF grammars of formal languages
are in natural-language interfaces of various kinds.
These systems don't usually need semantic control in GF abstract
syntax. However, the situation can be different if the interface also comprises
an interactive syntax editor, as in the GF-Key system
(Beckert &amp; al. 2006, Burke &amp; Johannisson 2005).
In that system, the editor is used for guiding programmers only to write
type-correct code.
</P>
<P>
The technique of continuations in modelling programming languages has recently
been applied to natural language, for processing <B>anaphoric reference</B>,
e.g. pronouns. It may be good to know that GF has the machinery available;
for the time being, however (GF 2.8), dependent types and
higher-order abstract syntax are not supported by the embedded GF implementations
in Haskell and Java.
</P>
<P>
<B>Exercise</B>. The book <I>C programming language</I> by Kernighan and Ritchie
(p. 123, 2nd edition, 1988) describes an English-like syntax for pointer and
array declarations, and a C program for translating between English and C.
The following example pair shows all the expression forms needed:
</P>
<PRE>
char (*(*x[3])())[5]
x: array[3] of pointer to function returning
pointer to array[5] of char
</PRE>
<P>
Implement these translations by a GF grammar.
</P>
<P>
<B>Exercise</B>. Design a natural-language interface to Unix command lines.
It should be able to express verbally commands such as
<CODE>cat, cd, grep, ls, mv, rm, wc</CODE> and also
pipes built from them.
</P>
<A NAME="toc128"></A>
<H2>Summary of GF language constructs</H2>
<A NAME="toc129"></A>
<H3>Lexers and unlexers</H3>
<P>
Lexers are set by the flag <CODE>lexer</CODE> and unlexers by the flag <CODE>unlexer</CODE>.
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>lexer</TH>
<TH COLSPAN="2">description</TH>
</TR>
<TR>
<TD><CODE>words</CODE></TD>
<TD>(default) tokens are separated by spaces or newlines</TD>
</TR>
<TR>
<TD><CODE>literals</CODE></TD>
<TD>like words, but integer and string literals recognized</TD>
</TR>
<TR>
<TD><CODE>chars</CODE></TD>
<TD>each character is a token</TD>
</TR>
<TR>
<TD><CODE>code</CODE></TD>
<TD>program code conventions (uses Haskell's lex)</TD>
</TR>
<TR>
<TD><CODE>text</CODE></TD>
<TD>with conventions on punctuation and capital letters</TD>
</TR>
<TR>
<TD><CODE>codelit</CODE></TD>
<TD>like code, but recognize literals (unknown words as strings)</TD>
</TR>
<TR>
<TD><CODE>textlit</CODE></TD>
<TD>like text, but recognize literals (unknown words as strings)</TD>
</TR>
</TABLE>
<P></P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>unlexer</TH>
<TH COLSPAN="2">description</TH>
</TR>
<TR>
<TD><CODE>unwords</CODE></TD>
<TD>(default) space-separated token list</TD>
</TR>
<TR>
<TD><CODE>text</CODE></TD>
<TD>format as text: punctuation, capitals, paragraph &lt;p&gt;</TD>
</TR>
<TR>
<TD><CODE>code</CODE></TD>
<TD>format as code (spacing, indentation)</TD>
</TR>
<TR>
<TD><CODE>textlit</CODE></TD>
<TD>like text, but remove string literal quotes</TD>
</TR>
<TR>
<TD><CODE>codelit</CODE></TD>
<TD>like code, but remove string literal quotes</TD>
</TR>
<TR>
<TD><CODE>concat</CODE></TD>
<TD>remove all spaces</TD>
</TR>
</TABLE>
<P></P>
<A NAME="toc130"></A>
<H3>Built-in abstract syntax types</H3>
<P>
There are three built-in types. Their syntax trees are literals of corresponding kinds:
</P>
<UL>
<LI><CODE>Int</CODE>, with nonnegative integer literals e.g. <CODE>987031434</CODE>
<LI><CODE>Float</CODE>, with nonnegative floating-point literals e.g. <CODE>907.219807</CODE>
<LI><CODE>String</CODE>, with string literals e.g. <CODE>"foo"</CODE>
</UL>
<P>
Their linearization type is uniformly <CODE>{s : Str}</CODE>.
</P>
<A NAME="toc131"></A>
<H1>Embedded grammars</H1>
<P>
<a name="chapeight"></a>
</P>
<P>
GF grammars can be used as parts of programs written in other programming
languages. Haskell and Java.
This facility is based on several components:
</P>
<UL>
<LI>a portable format for multilingual GF grammars
<LI>an interpreter for this format written in the host language
<LI>an API that enables reading grammar files and calling the interpreter
<LI>a way to manipulate abstract syntax trees in the host language
</UL>
<P>
In this chapter, we will show the basic ways of producing such
<B>embedded grammars</B> and using them in Haskell, Java, and JavaScript programs.
We will build a simple example application in each language:
</P>
<UL>
<LI>a question-answering system in Haskell
<LI>a translator GUI in Java
<LI>a multilingual syntax editor in JavaScript
</UL>
<P>
Moreover, we will use how grammar applications can be extended to
spoken language by generating <B>language models</B> for speech recognition
in various standard formats.
</P>
<A NAME="toc132"></A>
<H2>The portable grammar format</H2>
<P>
The portable format is called GFCC, "GF Canonical Compiled". A file
of this form can be produced from GF by the command
</P>
<PRE>
&gt; print_multi -printer=gfcc | write_file FILE.gfcc
</PRE>
<P>
Files written in this format can also be imported in the GF system,
which recognizes the suffix <CODE>.gfcc</CODE> and builds the multilingual
grammar in memory.
</P>
<P>
<I>This applies to GF version 3 and upwards. Older GF used a format suffixed</I>
<CODE>.gfcm</CODE>.
<I>At the moment of writing, also the Java interpreter still uses the GFCM format.</I>
</P>
<P>
GFCC is, in fact, the recommended format in
which final grammar products are distributed, because they
are stripped from superfluous information and can be started and applied
faster than sets of separate modules.
</P>
<P>
Application programmers have never any need to read or modify GFCC files.
Also in this sense, they play the same role as machine code in
general-purpose programming.
</P>
<A NAME="toc133"></A>
<H2>The embedded interpreter and its API</H2>
<P>
The interpreter is a kind of a miniature GF system, which can parse and
linearize with grammars. But it can only perform a subset of the commands of
the GF system. For instance, it
cannot compile source grammars into the GFCC format; the compiler is the most
heavy-weight component of the GF system, and should not be carried around
in end-user applications.
Since GFCC is much
simpler than source GF, building an interpreter is relatively easy.
Full-scale interpreters currently exist in Haskell and Java, and partial
ones in C++, JavaScript, and Prolog. We will in this chapter focus
on Haskell, Java, and JavaScript.
</P>
<P>
Application programmers never need to read or modify the interpreter.
They only need to access it via its API.
</P>
<A NAME="toc134"></A>
<H2>Embedded GF applications in Haskell</H2>
<P>
Readers unfamiliar with Haskell, or who just want to program in Java, can safely
skip this section. Everything will be repeated in the corresponding Java
section. However, seeing the Haskell code may still be helpful because
Haskell is in many ways closer to GF than Java is. In particular, recursive
types of syntax trees and pattern matching over them are very similar in
Haskell and GF,
but require a complex encoding with classes and visitors in Java.
</P>
<A NAME="toc135"></A>
<H3>The EmbedAPI module</H3>
<P>
The Haskell API contains (among other things) the following types and functions:
</P>
<PRE>
module EmbedAPI where
type MultiGrammar
type Language
type Category
type Tree
file2grammar :: FilePath -&gt; IO MultiGrammar
linearize :: MultiGrammar -&gt; Language -&gt; Tree -&gt; String
parse :: MultiGrammar -&gt; Language -&gt; Category -&gt; String -&gt; [Tree]
linearizeAll :: MultiGrammar -&gt; Tree -&gt; [String]
linearizeAllLang :: MultiGrammar -&gt; Tree -&gt; [(Language,String)]
parseAll :: MultiGrammar -&gt; Category -&gt; String -&gt; [[Tree]]
parseAllLang :: MultiGrammar -&gt; Category -&gt; String -&gt; [(Language,[Tree])]
languages :: MultiGrammar -&gt; [Language]
categories :: MultiGrammar -&gt; [Category]
startCat :: MultiGrammar -&gt; Category
</PRE>
<P>
This is the only module that needs to be imported in the Haskell application.
It is available as a part of the GF distribution, in the file
<CODE>src/GF/GFCC/API.hs</CODE>.
</P>
<A NAME="toc136"></A>
<H3>First application: a translator</H3>
<P>
Let us first build a stand-alone translator, which can translate
in any multilingual grammar between any languages in the grammar.
The whole code for this translator is here:
</P>
<PRE>
module Main where
import GF.GFCC.API
import System (getArgs)
main :: IO ()
main = do
file:_ &lt;- getArgs
gr &lt;- file2grammar file
interact (translate gr)
translate :: MultiGrammar -&gt; String -&gt; String
translate gr = case parseAllLang gr (startCat gr) s of
(lg,t:_):_ -&gt; unlines [linearize gr l t | l &lt;- languages gr, l /= lg]
_ -&gt; "NO PARSE"
</PRE>
<P>
To run the translator, first compile it by
</P>
<PRE>
% ghc --make -o trans Translator.hs
</PRE>
<P>
Then produce a GFCC file. For instance, the <CODE>Food</CODE> grammar set can be
compiled as follows:
</P>
<PRE>
% gfc --make FoodEng.gf FoodIta.gf
</PRE>
<P>
This produces the file <CODE>Food.gfcc</CODE> (its name comes from the abstract syntax).
</P>
<P>
<I>The gfc batch compiler program is available in GF 3 and upwards.</I>
<I>In earlier versions, the appropriate command can be piped to gf:</I>
</P>
<PRE>
% echo "pm -printer=gfcc | wf Food.gfcc" | gf FoodEng.gf FoodIta.gf
</PRE>
<P>
Equivalently, the grammars could be read into GF shell and the <CODE>pm</CODE> command
issued from there. But the unix command has the advantage that it can
be put into a <CODE>Makefile</CODE> to automate the compilation of an application.
</P>
<P>
The Haskell library function <CODE>interact</CODE> makes the <CODE>trans</CODE> program work
like a Unix filter, which reads from standard input and writes to standard
output. Therefore it can be a part of a pipe and read and write files.
The simplest way to translate is to <CODE>echo</CODE> input to the program:
</P>
<PRE>
% echo "this wine is delicious" | ./trans Food.gfcc
questo vino <20> delizioso
</PRE>
<P>
The result is given in all languages except the input language.
</P>
<A NAME="toc137"></A>
<H3>A looping translator</H3>
<P>
If the user wants to translate many expressions in a sequence, it
is cumbersome to have to start the translator over and over again,
because reading the grammar and building the parser always takes
time. The translator of the previous section is easy to modify
to enable this: just change <CODE>interact</CODE> in the main function to
<CODE>loop</CODE>. It is not a standard Haskell function, so its definition has
to be included:
</P>
<PRE>
loop :: (String -&gt; String) -&gt; IO ()
loop trans = do
s &lt;- getLine
if s == "quit" then putStrLn "bye" else do
putStrLn $ trans s
loop trans
</PRE>
<P>
The loop keeps on translating line by line until the input line
is <CODE>quit</CODE>.
</P>
<A NAME="toc138"></A>
<H3>A question-answer system</H3>
<P>
<a name="secmathprogram"></a>
</P>
<P>
The next application is also a translator, but it adds a
<B>transfer</B> component to the grammar. Transfer is a function that
takes the input syntax tree into some other syntax tree, which is
then linearized and shown back to the user. The transfer function we
are going to use is one that computes a question into an answer.
The program accepts simple questions about arithmetic and answers
"yes" or "no" in the language in which the question was made:
</P>
<PRE>
Is 123 prime?
No.
77 est impair ?
Oui.
</PRE>
<P>
The main change that is needed to the pure translator is to give
the type of <CODE>translate</CODE> an extra argument: a transfer function.
</P>
<PRE>
translate :: (Tree -&gt; Tree) -&gt; MultiGrammar -&gt; String -&gt; String
</PRE>
<P>
You can think of ordinary translation as a special case where
transfer is the identity function (<CODE>id</CODE> in Haskell).
</P>
<P>
Also the behaviour of returning the reply in different languages
should be changed so that the reply is returned in the <I>same</I> language.
Here is the complete definition of <CODE>translate</CODE> in the new form.
</P>
<PRE>
translate tr gr = case parseAllLang gr (startCat gr) s of
(lg,t:_):_ -&gt; linearize gr lg (tr t)
_ -&gt; "NO PARSE"
</PRE>
<P>
To complete the system, we have to define the transfer function.
So, how can we write a function from from abstract syntax trees
to abstract syntax trees? The embedded API does not make
the constructors of the type <CODE>Tree</CODE> available for users. Even if it did, it would
be quite complicated to use the type, and programs would be likely
to produce trees that are ill-typed in GF and therefore cannot
be linearized.
</P>
<A NAME="toc139"></A>
<H3>Exporting GF datatypes</H3>
<P>
The way to go in defining transfer is to use GF's tree constructors, that
is, the <CODE>fun</CODE> functions, as if they were Haskell's data constructors.
There is enough resemblance between GF and Haskell to make this possible
in most cases. It is even possible in Java, as we shall see later.
</P>
<P>
Thus every category of GF is translated into a Haskell datatype, where the
functions producing a value of that category are treated as constructors.
The translation is obtained by using the batch compiler with the command
</P>
<PRE>
% gfc -haskell Food.gfcc
</PRE>
<P>
It is also possible to produce the Haskell file together with GFCC, by
</P>
<PRE>
% gfc --make -haskell FoodEng.gf FoodIta.gf
</PRE>
<P>
The result is a file named <CODE>GSyntax.hs</CODE>, containing a
module named <CODE>GSyntax</CODE>.
</P>
<P>
<I>In GF before version 3, the same result is obtained from within GF, by the command</I>
</P>
<PRE>
&gt; print_grammar -printer=gfcc_haskell | write_file GSyntax.hs
</PRE>
<P></P>
<P>
As an example, we take
the grammar we are going to use for queries. The abstract syntax is
</P>
<PRE>
abstract Math = {
flags startcat = Question ;
cat Answer ; Question ; Object ;
fun
Even : Object -&gt; Question ;
Odd : Object -&gt; Question ;
Prime : Object -&gt; Question ;
Number : Int -&gt; Object ;
Yes : Answer ;
No : Answer ;
}
</PRE>
<P>
It is translated to the following system of datatypes:
</P>
<PRE>
newtype GInt = GInt Integer
data GAnswer =
GYes
| GNo
data GObject = GNumber GInt
data GQuestion =
GPrime GObject
| GOdd GObject
| GEven GObject
</PRE>
<P>
All type and constructor names are prefixed with a <CODE>G</CODE> to prevent clashes.
</P>
<P>
Now it is possible to define functions from and to these datatype, in Haskell.
Haskell's type checker guarantees that the functions are well-typed also with
respect to GF. Here is a question-to-answer function for this language:
</P>
<PRE>
answer :: GQuestion -&gt; GAnswer
answer p = case p of
GOdd x -&gt; test odd x
GEven x -&gt; test even x
GPrime x -&gt; test prime x
value :: GObject -&gt; Int
value e = case e of
GNumber (GInt i) -&gt; fromInteger i
test :: (Int -&gt; Bool) -&gt; GObject -&gt; GAnswer
test f x = if f (value x) then GYes else GNo
</PRE>
<P>
So it is the function <CODE>answer</CODE> that we want to apply as transfer.
The only problem is the <I>type</I> of this function: the parsing and
linearization method of <CODE>API</CODE> work with <CODE>Tree</CODE>s and not
with <CODE>GQuestion</CODE>s and <CODE>GAnswers</CODE>.
</P>
<P>
Fortunately the Haskell translation of GF takes care of translating
between trees and the generated datatypes. This is done by using
a class with the required translation methods:
</P>
<PRE>
class Gf a where
gf :: a -&gt; Tree
fg :: Tree -&gt; a
</PRE>
<P>
The Haskell code generator also generates instances of these classes
for each datatype, for example,
</P>
<PRE>
instance Gf GQuestion where
gf (GEven x1) = DTr [] (AC (CId "Even")) [gf x1]
gf (GOdd x1) = DTr [] (AC (CId "Odd")) [gf x1]
gf (GPrime x1) = DTr [] (AC (CId "Prime")) [gf x1]
fg t =
case t of
DTr [] (AC (CId "Even")) [x1] -&gt; GEven (fg x1)
DTr [] (AC (CId "Odd")) [x1] -&gt; GOdd (fg x1)
DTr [] (AC (CId "Prime")) [x1] -&gt; GPrime (fg x1)
_ -&gt; error ("no Question " ++ show t)
</PRE>
<P>
Needless to say, <CODE>GSyntax</CODE> is a module that a programmer
never needs to look into, let alone change: it is enough to know that it
contains a systematic encoding and decoding between an abstract syntax
and Haskell datatypes, where
</P>
<UL>
<LI>all GF names are in Haskell prefixed with <CODE>G</CODE>
<LI><CODE>gf</CODE> translates from Haskell to GF
<LI><CODE>fg</CODE> translates from GF to Haskell
</UL>
<A NAME="toc140"></A>
<H3>Putting it all together</H3>
<P>
Here is the complete code for the Haskell module <CODE>TransferLoop.hs</CODE>.
</P>
<PRE>
module Main where
import GF.GFCC.API
import TransferDef (transfer)
main :: IO ()
main = do
gr &lt;- file2grammar "Math.gfcc"
loop (translate transfer gr)
loop :: (String -&gt; String) -&gt; IO ()
loop trans = do
s &lt;- getLine
if s == "quit" then putStrLn "bye" else do
putStrLn $ trans s
loop trans
translate :: (Tree -&gt; Tree) -&gt; MultiGrammar -&gt; String -&gt; String
translate tr gr = case parseAllLang gr (startCat gr) s of
(lg,t:_):_ -&gt; linearize gr lg (tr t)
_ -&gt; "NO PARSE"
</PRE>
<P>
This is the <CODE>Main</CODE> module, which just needs a function <CODE>transfer</CODE> from
<CODE>TransferDef</CODE> in order to compile. In the current application, this module
looks as follows:
</P>
<PRE>
module TransferDef where
import GF.GFCC.API (Tree)
import GSyntax
transfer :: Tree -&gt; Tree
transfer = gf . answer . fg
answer :: GQuestion -&gt; GAnswer
answer p = case p of
GOdd x -&gt; test odd x
GEven x -&gt; test even x
GPrime x -&gt; test prime x
value :: GObject -&gt; Int
value e = case e of
GNumber (GInt i) -&gt; fromInteger i
test :: (Int -&gt; Bool) -&gt; GObject -&gt; GAnswer
test f x = if f (value x) then GYes else GNo
prime :: Int -&gt; Bool
prime x = elem x primes where
primes = sieve [2 .. x]
sieve (p:xs) = p : sieve [ n | n &lt;- xs, n `mod` p &gt; 0 ]
sieve [] = []
</PRE>
<P>
This module, in turn, needs <CODE>GSyntax</CODE> to compile, and the main module
needs <CODE>Math.gfcc</CODE> to run. To automate the production of the system,
we write a <CODE>Makefile</CODE> as follows:
</P>
<PRE>
all:
gfc --make -haskell MathEng.gf MathFre.gf
ghc --make -o ./math TransferLoop.hs
strip math
</PRE>
<P>
(Notice that the empty segments starting the command lines in a Makefile must be tabs.)
Now we can compile the whole system by just typing
</P>
<PRE>
make
</PRE>
<P>
Then you can run it by typing
</P>
<PRE>
./math
</PRE>
<P>
Well --- you will of course need some concrete syntaxes of <CODE>Math</CODE> in order
to succeed. We have defined ours by using the resource functor design pattern,
as explained <a href="#secfunctor">here</a>.
</P>
<P>
Just to summarize, the source of the application consists of the following files:
</P>
<PRE>
Makefile -- a makefile
Math.gf -- abstract syntax
Math???.gf -- concrete syntaxes
TransferDef.hs -- definition of question-to-answer function
TransferLoop.hs -- Haskell Main module
</PRE>
<P></P>
<A NAME="toc141"></A>
<H2>Embedded GF applications in Java</H2>
<P>
When an API for GFCC in Java is available,
we will write the same applications in Java as
were written in Haskell above. Until then, we will
build another kind of application, which does not require
modification of generated Java code.
</P>
<P>
More information on embedded GF grammars in Java can be found in the document
</P>
<PRE>
www.cs.chalmers.se/~bringert/gf/gf-java.html
</PRE>
<P>
by Bj<42>rn Bringert.
</P>
<A NAME="toc142"></A>
<H3>Translets</H3>
<P>
A Java system needs many more files than a Haskell system.
To get started, one can fetch the package <CODE>gfc2java</CODE> from
</P>
<PRE>
www.cs.chalmers.se/~bringert/darcs/gfc2java/
</PRE>
<P>
by using the Darcs version control system as described in the <CODE>gf-java</CODE> home page.
</P>
<P>
The <CODE>gfc2java</CODE> package contains a script <CODE>build-translet</CODE>, which can be applied
to any <CODE>.gfcm</CODE> file to create a <B>translet</B>, a small translation GUI. Foor the <CODE>Food</CODE>
grammars of <a href="#chapthree">the third chapter</a>, we first create a file <CODE>food.gfcm</CODE> by
</P>
<PRE>
% echo "pm | wf food.gfcm" | gf FoodEng.gf FoodIta.gf
</PRE>
<P>
and then run
</P>
<PRE>
% build_translet food.gfcm
</PRE>
<P>
The resulting file <CODE>translate-food.jar</CODE> can be run with
</P>
<PRE>
% java -jar translate-food.jar
</PRE>
<P>
The translet looks like this:
</P>
<P>
<IMG ALIGN="right" SRC="food-translet.png" BORDER="0" ALT="">
</P>
<A NAME="toc143"></A>
<H3>Dialogue systems</H3>
<P>
A question-answer system is a special case of a <B>dialogue system</B>, where the user and
the computer communicate by writing or, even more properly, by speech. The <CODE>gf-java</CODE>
homepage provides an example of a most simple dialogue system imaginable, where two
the conversation has just two rules:
</P>
<UL>
<LI>if the user says <I>here you go</I>, the system says <I>thanks</I>
<LI>if the user says <I>thanks</I>, the system says <I>you are welcome</I>
</UL>
<P>
The conversation can be made in both English and Swedish; the user's initiative
decides which language the system replies in. Thus the structure is very similar
to the <CODE>math</CODE> program <a href="#secmathprogram">here</a>. The GF and
Java sources of the program can be
found in
</P>
<PRE>
www.cs.chalmers.se/~bringert/darcs/simpledemo
</PRE>
<P>
again accessible with the Darcs version control system.
</P>
<A NAME="toc144"></A>
<H2>Language models for speech recognition</H2>
<P>
The standard way of using GF in speech recognition is by building
<B>grammar-based language models</B>. To this end, GF comes with compilers
into several formats that are used in speech recognition systems.
One such format is GSL, used in the <A HREF="http://www.nuance.com">Nuance speech recognizer</A>.
It is produced from GF simply by printing a grammar with the flag
<CODE>-printer=gsl</CODE>. The following example uses the smart house grammar defined
<a href="#secsmarthouse">here</a>.
</P>
<PRE>
&gt; import -conversion=finite SmartEng.gf
&gt; print_grammar -printer=gsl
;GSL2.0
; Nuance speech recognition grammar for SmartEng
; Generated by GF
.MAIN SmartEng_2
SmartEng_0 [("switch" "off") ("switch" "on")]
SmartEng_1 ["dim" ("switch" "off")
("switch" "on")]
SmartEng_2 [(SmartEng_0 SmartEng_3)
(SmartEng_1 SmartEng_4)]
SmartEng_3 ("the" SmartEng_5)
SmartEng_4 ("the" SmartEng_6)
SmartEng_5 "fan"
SmartEng_6 "light"
</PRE>
<P>
Other formats available via the <CODE>-printer</CODE> flag include:
</P>
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
<TR>
<TH>Format</TH>
<TH COLSPAN="2">Description</TH>
</TR>
<TR>
<TD><CODE>gsl</CODE></TD>
<TD>Nuance GSL speech recognition grammar</TD>
</TR>
<TR>
<TD><CODE>jsgf</CODE></TD>
<TD>Java Speech Grammar Format (JSGF)</TD>
</TR>
<TR>
<TD><CODE>jsgf_sisr_old</CODE></TD>
<TD>JSGF with semantic tags in SISR WD 20030401 format</TD>
</TR>
<TR>
<TD><CODE>srgs_abnf</CODE></TD>
<TD>SRGS ABNF format</TD>
</TR>
<TR>
<TD><CODE>srgs_xml</CODE></TD>
<TD>SRGS XML format</TD>
</TR>
<TR>
<TD><CODE>srgs_xml_prob</CODE></TD>
<TD>SRGS XML format, with weights</TD>
</TR>
<TR>
<TD><CODE>slf</CODE></TD>
<TD>finite automaton in the HTK SLF format</TD>
</TR>
<TR>
<TD><CODE>slf_sub</CODE></TD>
<TD>finite automaton with sub-automata in HTK SLF</TD>
</TR>
</TABLE>
<P></P>
<P>
All currently available formats can be seen in gf with <CODE>help -printer</CODE>.
</P>
<A NAME="toc145"></A>
<H2>Dependent types and spoken language models</H2>
<P>
We have used dependent types to control semantic well-formedness
in grammars. This is important in traditional type theory
applications such as proof assistants, where only mathematically
meaningful formulas should be constructed. But semantic filtering has
also proved important in speech recognition, because it reduces the
ambiguity of the results.
</P>
<P>
Now, GSL is a context-free format, so how does it cope with dependent types?
In general, dependent types can give rise to infinitely many basic types
(exercise!), whereas a context-free grammar can by definition only have
finitely many nonterminals.
</P>
<P>
This is where the flag <CODE>-conversion=finite</CODE> is needed in the <CODE>import</CODE>
command. Its effect is to convert a GF grammar with dependent types to
one without, so that each instance of a dependent type is replaced by
an atomic type. This can then be used as a nonterminal in a context-free
grammar. The <CODE>finite</CODE> conversion presupposes that every
dependent type has only finitely many instances, which is in fact
the case in the <CODE>Smart</CODE> grammar.
</P>
<P>
<B>Exercise</B>. If you have access to the Nuance speech recognizer,
test it with GF-generated language models for <CODE>SmartEng</CODE>. Do this
both with and without <CODE>-conversion=finite</CODE>.
</P>
<P>
<B>Exercise</B>. Construct an abstract syntax with infinitely many instances
of dependent types.
</P>
<A NAME="toc146"></A>
<H3>Statistical language models</H3>
<P>
An alternative to grammar-based language models are
<B>statistical language models</B> (<B>SLM</B>s). An SLM is
built from a <B>corpus</B>, i.e. a set of utterances. It specifies the
probability of each <B>n-gram</B>, i.e. sequence of <I>n</I> words. The
typical value of <I>n</I> is 2 (bigrams) or 3 (trigrams).
</P>
<P>
One advantage of SLMs over grammar-based models is that they are
<B>robust</B>, i.e. they can be used to recognize sequences that would
be out of the grammar or the corpus. Another advantage is that
an SLM can be built "for free" if a corpus is available.
</P>
<P>
However, collecting a corpus can require a lot of work, and writing
a grammar can be less demanding, especially with tools such as GF or
Regulus. This advantage of grammars can be combined with robustness
by creating a back-up SLM from a <B>synthesized corpus</B>. This means
simply that the grammar is used for generating such a corpus.
In GF, this can be done with the <CODE>generate_trees</CODE> command.
As with grammar-based models, the quality of the SLM is better
if meaningless utterances are excluded from the corpus. Thus
a good way to generate an SLM from a GF grammar is by using
dependent types and filter the results through the type checker:
</P>
<PRE>
&gt; generate_trees | put_trees -transform=solve | linearize
</PRE>
<P>
The method of creating statistical language model from corpora synthesized
from GF grammars is applied and evaluated in (Jonson 2006).
</P>
<P>
<B>Exercise</B>. Measure the size of the corpus generated from
<CODE>SmartEng</CODE> (defined <a href="#secsmarthouse">here</a>), with and without type checker filtering.
</P>
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags -thtml -\-toc gf-tutorial.txt -->
</BODY></HTML>