forked from GitHub/gf-core
7953 lines
259 KiB
HTML
7953 lines
259 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||
<HTML>
|
||
<HEAD>
|
||
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
|
||
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
||
<TITLE>Grammatical Framework Tutorial</TITLE>
|
||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||
<P ALIGN="center"><CENTER><H1>Grammatical Framework Tutorial</H1>
|
||
<FONT SIZE="4">
|
||
<I>Aarne Ranta</I><BR>
|
||
Draft, November 2007
|
||
</FONT></CENTER>
|
||
|
||
<P></P>
|
||
<HR NOSHADE SIZE=1>
|
||
<P></P>
|
||
<UL>
|
||
<LI><A HREF="#toc1">Getting started with GF</A>
|
||
<UL>
|
||
<LI><A HREF="#toc2">What GF is</A>
|
||
<LI><A HREF="#toc3">Getting the GF system</A>
|
||
<LI><A HREF="#toc4">Running the GF system</A>
|
||
<LI><A HREF="#toc5">A "Hello World" grammar</A>
|
||
<UL>
|
||
<LI><A HREF="#toc6">The program: abstract syntax and concrete syntaxes</A>
|
||
<LI><A HREF="#toc7">Using the grammar in the GF system</A>
|
||
</UL>
|
||
<LI><A HREF="#toc8">Using grammars from outside GF</A>
|
||
<LI><A HREF="#toc9">What else can be done with the grammar</A>
|
||
<LI><A HREF="#toc10">Summary of GF language features</A>
|
||
<UL>
|
||
<LI><A HREF="#toc11">Modules</A>
|
||
<LI><A HREF="#toc12">Judgements</A>
|
||
<LI><A HREF="#toc13">Types and terms</A>
|
||
<LI><A HREF="#toc14">Type checking</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc15">Designing a grammar for complex phrases</A>
|
||
<UL>
|
||
<LI><A HREF="#toc16">The abstract syntax Food</A>
|
||
<LI><A HREF="#toc17">The concrete syntax FoodEng</A>
|
||
<LI><A HREF="#toc18">Commands for testing grammars</A>
|
||
<UL>
|
||
<LI><A HREF="#toc19">Generating trees and strings</A>
|
||
<LI><A HREF="#toc20">More on pipes; tracing</A>
|
||
<LI><A HREF="#toc21">Writing and reading files</A>
|
||
<LI><A HREF="#toc22">Visualizing trees</A>
|
||
<LI><A HREF="#toc23">System commands</A>
|
||
</UL>
|
||
<LI><A HREF="#toc24">An Italian concrete syntax</A>
|
||
<LI><A HREF="#toc25">Free variation</A>
|
||
<LI><A HREF="#toc26">More application of multilingual grammars</A>
|
||
<UL>
|
||
<LI><A HREF="#toc27">Multilingual treebanks</A>
|
||
<LI><A HREF="#toc28">Translation session</A>
|
||
<LI><A HREF="#toc29">Translation quiz</A>
|
||
<LI><A HREF="#toc30">Multilingual syntax editing</A>
|
||
</UL>
|
||
<LI><A HREF="#toc31">Context-free grammars and GF</A>
|
||
<UL>
|
||
<LI><A HREF="#toc32">The "cf" grammar format</A>
|
||
<LI><A HREF="#toc33">Restrictions of context-free grammars</A>
|
||
</UL>
|
||
<LI><A HREF="#toc34">Modules and files</A>
|
||
<LI><A HREF="#toc35">Using operations and resource modules</A>
|
||
<UL>
|
||
<LI><A HREF="#toc36">The golden rule of functional programming</A>
|
||
<LI><A HREF="#toc37">Operation definitions</A>
|
||
<LI><A HREF="#toc38">The ``resource`` module type</A>
|
||
<LI><A HREF="#toc39">Opening a resource</A>
|
||
<LI><A HREF="#toc40">Partial application</A>
|
||
<LI><A HREF="#toc41">Testing resource modules</A>
|
||
</UL>
|
||
<LI><A HREF="#toc42">Grammar architecture</A>
|
||
<UL>
|
||
<LI><A HREF="#toc43">Extending a grammar</A>
|
||
<LI><A HREF="#toc44">Multiple inheritance</A>
|
||
<LI><A HREF="#toc45">Visualizing module structure</A>
|
||
</UL>
|
||
<LI><A HREF="#toc46">Summary of GF language features</A>
|
||
<UL>
|
||
<LI><A HREF="#toc47">Modules</A>
|
||
<LI><A HREF="#toc48">Judgements</A>
|
||
<LI><A HREF="#toc49">Free variation</A>
|
||
<LI><A HREF="#toc50">The context-free grammar format</A>
|
||
<LI><A HREF="#toc51">Character encoding</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc52">Grammars with parameters</A>
|
||
<UL>
|
||
<LI><A HREF="#toc53">The problem: words have to be inflected</A>
|
||
<LI><A HREF="#toc54">Parameters and tables</A>
|
||
<LI><A HREF="#toc55">Inflection tables and paradigms</A>
|
||
<LI><A HREF="#toc56">Using parameters in concrete syntax</A>
|
||
<UL>
|
||
<LI><A HREF="#toc57">Agreement</A>
|
||
<LI><A HREF="#toc58">Determiners</A>
|
||
<LI><A HREF="#toc59">Parametric vs. inherent features</A>
|
||
</UL>
|
||
<LI><A HREF="#toc60">An English concrete syntax for Foods with parameters</A>
|
||
<LI><A HREF="#toc61">More on inflection paradigms</A>
|
||
<UL>
|
||
<LI><A HREF="#toc62">Worst-case functions</A>
|
||
<LI><A HREF="#toc63">Intelligent paradigms</A>
|
||
<LI><A HREF="#toc64">Function types with variables</A>
|
||
<LI><A HREF="#toc65">Separating operation types and definitions</A>
|
||
<LI><A HREF="#toc66">Overloading of operations</A>
|
||
<LI><A HREF="#toc67">Morphological analysis and morphology quiz</A>
|
||
</UL>
|
||
<LI><A HREF="#toc68">The Italian Foods grammar</A>
|
||
<LI><A HREF="#toc69">Discontinuous constituents</A>
|
||
<LI><A HREF="#toc70">Strings at compile time vs. run time</A>
|
||
<LI><A HREF="#toc71">Summary of GF language features</A>
|
||
<UL>
|
||
<LI><A HREF="#toc72">Parameter and table types</A>
|
||
<LI><A HREF="#toc73">Pattern matching</A>
|
||
<LI><A HREF="#toc74">Overloading</A>
|
||
<LI><A HREF="#toc75">Local definitions</A>
|
||
<LI><A HREF="#toc76">Supplementary constructs</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc77">Using the resource grammar library</A>
|
||
<UL>
|
||
<LI><A HREF="#toc78">The coverage of the library</A>
|
||
<LI><A HREF="#toc79">The structure of the library</A>
|
||
<UL>
|
||
<LI><A HREF="#toc80">Lexical vs. phrasal rules</A>
|
||
<LI><A HREF="#toc81">Lexical categories</A>
|
||
<LI><A HREF="#toc82">Lexical rules</A>
|
||
<LI><A HREF="#toc83">Phrasal categories</A>
|
||
</UL>
|
||
<LI><A HREF="#toc84">The resource API</A>
|
||
<LI><A HREF="#toc85">Example: English</A>
|
||
<LI><A HREF="#toc86">Functor implementation of multilingual grammars</A>
|
||
<LI><A HREF="#toc87">Interfaces and instances</A>
|
||
<LI><A HREF="#toc88">Adding languages to a functor implementation</A>
|
||
<LI><A HREF="#toc89">Division of labour revisited</A>
|
||
<LI><A HREF="#toc90">Restricted inheritance</A>
|
||
<LI><A HREF="#toc91">Grammar reuse</A>
|
||
<LI><A HREF="#toc92">Browsing the resource with GF commands</A>
|
||
<LI><A HREF="#toc93">An extended Foods grammar</A>
|
||
<UL>
|
||
<LI><A HREF="#toc94">Abstract syntax</A>
|
||
<LI><A HREF="#toc95">Linearization types</A>
|
||
<LI><A HREF="#toc96">Linearization rules</A>
|
||
</UL>
|
||
<LI><A HREF="#toc97">Tenses</A>
|
||
<LI><A HREF="#toc98">Summary of GF language features</A>
|
||
<UL>
|
||
<LI><A HREF="#toc99">Interfaces and instances</A>
|
||
<LI><A HREF="#toc100">Grammar reuse</A>
|
||
<LI><A HREF="#toc101">Functors</A>
|
||
<LI><A HREF="#toc102">Restricted inheritance</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc103">Refining semantics in abstract syntax</A>
|
||
<UL>
|
||
<LI><A HREF="#toc104">GF as a logical framework</A>
|
||
<LI><A HREF="#toc105">Dependent types</A>
|
||
<LI><A HREF="#toc106">Polymorphism</A>
|
||
<UL>
|
||
<LI><A HREF="#toc107">Digression: dependent types in concrete syntax</A>
|
||
</UL>
|
||
<LI><A HREF="#toc108">Proof objects</A>
|
||
<UL>
|
||
<LI><A HREF="#toc109">Proof-carrying documents</A>
|
||
</UL>
|
||
<LI><A HREF="#toc110">Restricted polymorphism</A>
|
||
<LI><A HREF="#toc111">Variable bindings</A>
|
||
<LI><A HREF="#toc112">Semantic definitions</A>
|
||
<LI><A HREF="#toc113">Summary of GF language features</A>
|
||
<UL>
|
||
<LI><A HREF="#toc114">Judgements</A>
|
||
<LI><A HREF="#toc115">Dependent function types</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc116">Grammars of formal languages</A>
|
||
<UL>
|
||
<LI><A HREF="#toc117">Arithmetic expressions</A>
|
||
<UL>
|
||
<LI><A HREF="#toc118">Abstract syntax</A>
|
||
<LI><A HREF="#toc119">Concrete syntax: a simple approach</A>
|
||
</UL>
|
||
<LI><A HREF="#toc120">Lexing and unlexing</A>
|
||
<LI><A HREF="#toc121">Precedence and fixity</A>
|
||
<LI><A HREF="#toc122">Code generation as linearization</A>
|
||
<LI><A HREF="#toc123">Speaking aloud arithmetic expressions</A>
|
||
<LI><A HREF="#toc124">Programs with variables</A>
|
||
<UL>
|
||
<LI><A HREF="#toc125">The concrete syntax of assignments</A>
|
||
<LI><A HREF="#toc126">A liberal syntax of variables</A>
|
||
</UL>
|
||
<LI><A HREF="#toc127">Conclusion</A>
|
||
<LI><A HREF="#toc128">Summary of GF language constructs</A>
|
||
<UL>
|
||
<LI><A HREF="#toc129">Lexers and unlexers</A>
|
||
<LI><A HREF="#toc130">Built-in abstract syntax types</A>
|
||
</UL>
|
||
</UL>
|
||
<LI><A HREF="#toc131">Embedded grammars</A>
|
||
<UL>
|
||
<LI><A HREF="#toc132">The portable grammar format</A>
|
||
<LI><A HREF="#toc133">The embedded interpreter and its API</A>
|
||
<LI><A HREF="#toc134">Embedded GF applications in Haskell</A>
|
||
<UL>
|
||
<LI><A HREF="#toc135">The EmbedAPI module</A>
|
||
<LI><A HREF="#toc136">First application: a translator</A>
|
||
<LI><A HREF="#toc137">A looping translator</A>
|
||
<LI><A HREF="#toc138">A question-answer system</A>
|
||
<LI><A HREF="#toc139">Exporting GF datatypes</A>
|
||
<LI><A HREF="#toc140">Putting it all together</A>
|
||
</UL>
|
||
<LI><A HREF="#toc141">Embedded GF applications in Java</A>
|
||
<UL>
|
||
<LI><A HREF="#toc142">Translets</A>
|
||
<LI><A HREF="#toc143">Dialogue systems</A>
|
||
</UL>
|
||
<LI><A HREF="#toc144">Language models for speech recognition</A>
|
||
<LI><A HREF="#toc145">Dependent types and spoken language models</A>
|
||
<UL>
|
||
<LI><A HREF="#toc146">Statistical language models</A>
|
||
</UL>
|
||
</UL>
|
||
</UL>
|
||
|
||
<P></P>
|
||
<HR NOSHADE SIZE=1>
|
||
<P></P>
|
||
<P>
|
||
<h2>Overview</h2>
|
||
</P>
|
||
<P>
|
||
This tutorial gives a hands-on introduction to grammar writing in GF.
|
||
It has been written for all programmers
|
||
who want to learn to write grammars in GF.
|
||
It will go through the programming concepts of GF, and also
|
||
explain, without presupposing them, the main ingredients of GF:
|
||
linguistics, functional programming, and type theory.
|
||
This knowledge will be introduced as a part of grammar writing
|
||
practice.
|
||
Thus the tutorial should be accessible to anyone who has some
|
||
previous experience from any programming language; the basics
|
||
of using computers are also presupposed, e.g. the use of
|
||
text editors and the management of files.
|
||
</P>
|
||
<P>
|
||
We start in <a href="#chaptwo">the second chapter</a>
|
||
by building a "Hello World" grammar, which covers greetings
|
||
in three languages: English (<I>hello world</I>),
|
||
Finnish (<I>terve maailma</I>), and Italian (<I>ciao mondo</I>).
|
||
This <B>multilingual grammar</B> is based on the most central idea of GF:
|
||
the distinction between <B>abstract syntax</B>
|
||
(the logical structure) and <B>concrete syntax</B> (the
|
||
sequence of words).
|
||
</P>
|
||
<P>
|
||
From the "Hello World" example, we proceed
|
||
in <a href="#chapthree">the third chapter</a>
|
||
to a larger grammar for the domain of food.
|
||
In this grammar, you can say things like
|
||
<center>
|
||
<I>this Italian cheese is delicious</I>
|
||
</center>
|
||
in English and Italian. This grammar illustrates how translation is
|
||
more than just replacement of words. For instance, the order of
|
||
words may have to be changed:
|
||
<center>
|
||
<I>Italian cheese</I>
|
||
</P>
|
||
<P>
|
||
<I>formaggio italiano</I>
|
||
</center>
|
||
Moreover, words can have different forms, and which forms
|
||
they have vary from language to language. For instance,
|
||
Italian adjectives usually have four forms where English
|
||
has just one:
|
||
<center>
|
||
<I>delicious</I> (<I>wine, wines, pizza, pizzas</I>)
|
||
</P>
|
||
<P>
|
||
<I>vino delizioso, vini deliziosi, pizza deliziosa, pizze deliziose</I>
|
||
</center>
|
||
The <B>morphology</B> of a language describes the
|
||
forms of its words, and the basics of implementing morphology and
|
||
integrating it with syntax are covered in <a href="#chaptwo">the fourth chapter</a>.
|
||
</P>
|
||
<P>
|
||
The complete description of morphology and syntax in natural
|
||
languages is in GF preferably left to the <B>resource grammar library</B>.
|
||
Its use is therefore an important part of GF programming, and
|
||
it is covered in <a href="#chapfive">the fifth chapter</a>. How to contribute to resource
|
||
grammars as an author will only be covered in Part III;
|
||
however, the tutorial does go through all the
|
||
programming concepts of GF, including those involved in
|
||
resource grammars.
|
||
</P>
|
||
<P>
|
||
In addition to multilinguality, <B>semantics</B> is an important aspect of GF
|
||
grammars. The "purely linguistic" aspects (morphology and syntax) belong to
|
||
the concrete syntax part of GF, whereas semantics is expressed in the abstract
|
||
syntax. After the presentation of concrete syntax constructs, we proceed
|
||
in <a href="#chapsix">the sixth chapter</a> to the enrichment of abstract syntax with <B>dependent types</B>,
|
||
<B>variable bindings</B>, and <B>semantic definitions</B>.
|
||
<a href="#chapseven">the seventh chapter</a> concludes the tutorial by technical tips for implementing formal
|
||
languages. It will also illustrate the close relation between GF grammars
|
||
and compilers by actually implementing a small compiler from C-like statements
|
||
and expressions to machine code similar to Java Virtual Machine.
|
||
</P>
|
||
<P>
|
||
English and Italian are used as example languages in many grammars.
|
||
Of course, we will not presuppose that the reader knows any Italian.
|
||
We have chosen Italian because it has a rich structure
|
||
that illustrates very well the capacities of GF.
|
||
Moreover, even those readers who don't know Italian, will find many of
|
||
its words familiar, due to the Latin heritage.
|
||
The exercises will encourage the reader to
|
||
port the examples to other languages as well; in particular,
|
||
it should be instructive for the reader to look at her
|
||
own native language from the point of view of writing a grammar
|
||
implementation.
|
||
</P>
|
||
<P>
|
||
To learn how to write GF grammars is not the only goal of
|
||
this tutorial. We will also explain the most important
|
||
commands of the GF system, mostly in passing. With these commands,
|
||
simple application programs such as translation and
|
||
quiz systems, can be built simply by writing scripts for the
|
||
GF system. More complicated applications, such as natural-language
|
||
interfaces and dialogue systems, moreover require programming in
|
||
some general-purpose language; such applications are covered in <a href="#chapeight">the eighth chapter</a>.
|
||
</P>
|
||
<A NAME="toc1"></A>
|
||
<H1>Getting started with GF</H1>
|
||
<P>
|
||
<a name="chaptwo"></a>
|
||
</P>
|
||
<P>
|
||
In this chapter, we will introduce the GF system and write the first GF grammar,
|
||
a "Hello World" grammar. While extremely small, this grammar already illustrates
|
||
how GF can be used for the tasks of translation and multilingual
|
||
generation.
|
||
</P>
|
||
<A NAME="toc2"></A>
|
||
<H2>What GF is</H2>
|
||
<P>
|
||
We use the term GF for three different things:
|
||
</P>
|
||
<UL>
|
||
<LI>a <B>system</B> (computer program) used for working with grammars
|
||
<LI>a <B>programming language</B> in which grammars can be written
|
||
<LI>a <B>theory</B> about grammars and languages
|
||
</UL>
|
||
|
||
<P>
|
||
The relation between these things is obvious: the GF system is an implementation
|
||
of the GF programming language, which in turn is built on the ideas of the
|
||
GF theory. The main focus of this book is on the GF programming language.
|
||
We learn how grammars are written in this language. At the same time, we learn
|
||
the way of thinking in the GF theory. To make this all useful and fun, and
|
||
to encourage experimenting, we make the grammars run on a computer by
|
||
using the GF system.
|
||
</P>
|
||
<P>
|
||
A GF program is called a <B>grammar</B>. A grammar is, traditionally, a
|
||
definition of a language. From this definition, different language
|
||
processing components can be derived:
|
||
</P>
|
||
<UL>
|
||
<LI><B>parsing</B>: to analyse the language
|
||
<LI><B>linearization</B>: to generate the language
|
||
<LI><B>translation</B>: to analyse one language and generate another
|
||
</UL>
|
||
|
||
<P>
|
||
A GF grammar is thus a declarative program from which these
|
||
procedures can be automatically derived. In general, a GF grammar
|
||
is <B>multilingual</B>: it defines many languages and translations between them.
|
||
</P>
|
||
<A NAME="toc3"></A>
|
||
<H2>Getting the GF system</H2>
|
||
<P>
|
||
The GF system is open-source free software, which can be downloaded via the
|
||
GF Homepage:
|
||
<center>
|
||
<CODE>gf.digitalgrammars.com</CODE>
|
||
</center>
|
||
There you can download
|
||
</P>
|
||
<UL>
|
||
<LI>binaries for Linux, Mac OS X, and Windows
|
||
<LI>source code and documentation
|
||
<LI>grammar libraries and examples
|
||
</UL>
|
||
|
||
<P>
|
||
In particular, many of the examples in this book are included in the
|
||
subdirectory <CODE>examples/tutorial</CODE> of the source distribution package.
|
||
This directory is also available
|
||
<A HREF="http://digitalgrammars.com/gf/examples/tutorial">online</A>.
|
||
</P>
|
||
<P>
|
||
If you want to compile GF from source, you need a Haskell compiler.
|
||
To compile the interactive editor, you also need a Java compilers.
|
||
But normally you don't have to compile anything yourself, and you definitely
|
||
don't need to know Haskell or Java to use GF.
|
||
</P>
|
||
<P>
|
||
We are assuming the availability of a Unix shell. Linux and Mac OS X users
|
||
have it automatically, the latter under the name "terminal".
|
||
Windows users are recommended to install Cywgin, the free Unix shell for Windows.
|
||
</P>
|
||
<A NAME="toc4"></A>
|
||
<H2>Running the GF system</H2>
|
||
<P>
|
||
To start the GF system, assuming you have installed it, just type
|
||
<CODE>gf</CODE> in the Unix (or Cygwin) shell:
|
||
</P>
|
||
<PRE>
|
||
% gf
|
||
</PRE>
|
||
<P>
|
||
You will see GF's welcome message and the prompt <CODE>></CODE>.
|
||
The command
|
||
</P>
|
||
<PRE>
|
||
> help
|
||
</PRE>
|
||
<P>
|
||
will give you a list of available commands.
|
||
</P>
|
||
<P>
|
||
As a common convention in this book, we will use
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>%</CODE> as a prompt that marks system commands
|
||
<LI><CODE>></CODE> as a prompt that marks GF commands
|
||
</UL>
|
||
|
||
<P>
|
||
Thus you should not type these prompts, but only the characters that
|
||
follow them.
|
||
</P>
|
||
<A NAME="toc5"></A>
|
||
<H2>A "Hello World" grammar</H2>
|
||
<P>
|
||
The tradition in programming language tutorials is to start with a
|
||
program that prints "Hello World" on the terminal. GF should be no
|
||
exception. But our program has features that distinguish it from
|
||
most "Hello World" programs:
|
||
</P>
|
||
<UL>
|
||
<LI><B>Multilinguality</B>: the message is printed in many languages.
|
||
<LI><B>Reversibility</B>: in addition to printing, you can <B>parse</B> the
|
||
message and <B>translate</B> it to other languages.
|
||
</UL>
|
||
|
||
<A NAME="toc6"></A>
|
||
<H3>The program: abstract syntax and concrete syntaxes</H3>
|
||
<P>
|
||
A GF program, in general, is a <B>multilingual grammar</B>. Its main parts
|
||
are
|
||
</P>
|
||
<UL>
|
||
<LI>an <B>abstract syntax</B>
|
||
<LI>one or more <B>concrete syntaxes</B>
|
||
</UL>
|
||
|
||
<P>
|
||
The abstract syntax defines, in a language-independent way, what <B>meanings</B>
|
||
can be expressed in the grammar. In the "Hello World" grammar we want
|
||
to express <I>Greetings</I>, where we greet a <I>Recipient</I>, which can be
|
||
<I>World</I> or <I>Mum</I> or <I>Friends</I>. Here is the entire
|
||
GF code for the abstract syntax:
|
||
</P>
|
||
<PRE>
|
||
-- a "Hello World" grammar
|
||
abstract Hello = {
|
||
|
||
flags startcat = Greeting ;
|
||
|
||
cat Greeting ; Recipient ;
|
||
|
||
fun
|
||
Hello : Recipient -> Greeting ;
|
||
World, Mum, Friends : Recipient ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
The code has the following parts:
|
||
</P>
|
||
<UL>
|
||
<LI>a <B>comment</B> (optional), saying what the module is doing
|
||
<LI>a <B>module header</B> indicating that it is an abstract syntax
|
||
module named <CODE>Hello</CODE>
|
||
<LI>a <B>module body</B> in braces, consisting of
|
||
<UL>
|
||
<LI>a <B>startcat flag declaration</B> stating that <CODE>Greeting</CODE> is the
|
||
main category, i.e. the one in which parsing and generation are
|
||
performed by default
|
||
<LI><B>category declarations</B> stating that <CODE>Greeting</CODE> and <CODE>Recipient</CODE>
|
||
are categories, i.e. types of meanings
|
||
<LI><B>function declarations</B> stating what meaning-building functions there
|
||
are; these are the function <CODE>Hello</CODE> constructing a greeting from a recipient,
|
||
as well as the three possible recipients
|
||
</UL>
|
||
</UL>
|
||
|
||
<P>
|
||
A concrete syntax defines a mapping from the abstract meanings to their
|
||
expressions in a language. We first give an English concrete syntax:
|
||
</P>
|
||
<PRE>
|
||
concrete HelloEng of Hello = {
|
||
|
||
lincat Greeting, Recipient = {s : Str} ;
|
||
|
||
lin
|
||
Hello recip = {s = "hello" ++ recip.s} ;
|
||
World = {s = "world"} ;
|
||
Mum = {s = "mum"} ;
|
||
Friends = {s = "friends"} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
The major parts of this code are:
|
||
</P>
|
||
<UL>
|
||
<LI>a module header indicating that it is a concrete syntax of the abstract syntax
|
||
<CODE>Hello</CODE>, itself named <CODE>HelloEng</CODE>
|
||
<LI>a module body in curly brackets, consisting of
|
||
<UL>
|
||
<LI><B>linearization type definitions</B> stating that
|
||
<CODE>Greeting</CODE> and <CODE>Recipient</CODE> are <B>records</B> with a <B>string</B> <CODE>s</CODE>
|
||
<LI><B>linearization definitions</B> telling what records are assigned to
|
||
each of the meanings defined in the abstract syntax; the recipients are
|
||
linearized to records containing single words, whereas the <CODE>Hello</CODE> greeting
|
||
has a function telling that the word <CODE>hello</CODE> is prefixed to the string
|
||
<CODE>s</CODE> contained in the record <CODE>recip</CODE>
|
||
</UL>
|
||
</UL>
|
||
|
||
<P>
|
||
To make the grammar truly multilingual, we add a Finnish and an Italian concrete
|
||
syntax:
|
||
</P>
|
||
<PRE>
|
||
concrete HelloFin of Hello = {
|
||
lincat Greeting, Recipient = {s : Str} ;
|
||
lin
|
||
Hello recip = {s = "terve" ++ recip.s} ;
|
||
World = {s = "maailma"} ;
|
||
Mum = {s = "<22>iti"} ;
|
||
Friends = {s = "yst<73>v<EFBFBD>t"} ;
|
||
}
|
||
|
||
concrete HelloIta of Hello = {
|
||
lincat Greeting, Recipient = {s : Str} ;
|
||
lin
|
||
Hello recip = {s = "ciao" ++ recip.s} ;
|
||
World = {s = "mondo"} ;
|
||
Mum = {s = "mamma"} ;
|
||
Friends = {s = "amici"} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Now we have a trilingual grammar usable for translation and
|
||
many other tasks, which we will now start experimenting with.
|
||
</P>
|
||
<A NAME="toc7"></A>
|
||
<H3>Using the grammar in the GF system</H3>
|
||
<P>
|
||
In order to compile the grammar in GF, each of the four modules
|
||
has to be put into a file named <I>Modulename</I><CODE>.gf</CODE>:
|
||
</P>
|
||
<PRE>
|
||
Hello.gf HelloEng.gf HelloFin.gf HelloIta.gf
|
||
</PRE>
|
||
<P>
|
||
The first GF command needed when using a grammar is to <B>import</B> it.
|
||
The command has a long name, <CODE>import</CODE>, and a short name, <CODE>i</CODE>.
|
||
When you have started GF (by the shell command <CODE>gf</CODE>), you can thus type either
|
||
</P>
|
||
<PRE>
|
||
> import HelloEng.gf
|
||
</PRE>
|
||
<P>
|
||
or
|
||
</P>
|
||
<PRE>
|
||
> i HelloEng.gf
|
||
</PRE>
|
||
<P>
|
||
to get the same effect. In general, all GF commands have a long and a short name;
|
||
short names are convenient when typing commands by hand, whereas long command
|
||
names are more readable in scripts, i.e. files that include sequences of commands.
|
||
</P>
|
||
<P>
|
||
The effect of <CODE>import</CODE> is that the GF system <B>compiles</B> your grammar
|
||
into an internal representation, and shows a new prompt when it is ready.
|
||
It will also show how much CPU time was consumed:
|
||
</P>
|
||
<PRE>
|
||
> i HelloEng.gf
|
||
- compiling Hello.gf... wrote file Hello.gfc 8 msec
|
||
- compiling HelloEng.gf... wrote file HelloEng.gfc 12 msec
|
||
|
||
12 msec
|
||
>
|
||
</PRE>
|
||
<P>
|
||
You can now use GF for <B>parsing</B>:
|
||
</P>
|
||
<PRE>
|
||
> parse "hello world"
|
||
Hello World
|
||
</PRE>
|
||
<P>
|
||
The <CODE>parse</CODE> (= <CODE>p</CODE>) command takes a <B>string</B>
|
||
(in double quotes) and returns an <B>abstract syntax tree</B> --- the meaning
|
||
of the string as defined in the abstract syntax.
|
||
A tree is, in general, something easier than a string
|
||
for a machine to understand and to process further, although this
|
||
is not so obvious in this simple grammar. The syntax for trees is that
|
||
of <B>function application</B>, which in GF is written
|
||
</P>
|
||
<PRE>
|
||
function argument1 ... argumentn
|
||
</PRE>
|
||
<P>
|
||
Parentheses are only needed for grouping. For instance, <CODE>f (a b)</CODE> is
|
||
<CODE>f</CODE> applied to the application of <CODE>a</CODE> to <CODE>b</CODE>. This is different
|
||
from <CODE>f a b</CODE>, which is <CODE>f</CODE> applied to <CODE>a</CODE> and <CODE>b</CODE>.
|
||
</P>
|
||
<P>
|
||
Strings that return a tree when parsed do so in virtue of the grammar
|
||
you imported. Try to parse something that is not in grammar, and you will fail
|
||
</P>
|
||
<PRE>
|
||
> parse "hello dad"
|
||
Unknown words: dad
|
||
|
||
> parse "world hello"
|
||
no tree found
|
||
</PRE>
|
||
<P>
|
||
In the first example, the failure is caused by an unknown word.
|
||
In the second example, the combination of words is ungrammatical.
|
||
</P>
|
||
<P>
|
||
In addition to parsing, you can also use GF for <B>linearization</B>
|
||
(<CODE>linearize = l</CODE>). This is the inverse of
|
||
parsing, taking trees into strings:
|
||
</P>
|
||
<PRE>
|
||
> linearize Hello World
|
||
hello world
|
||
</PRE>
|
||
<P>
|
||
What is the use of this? Typically not that you type in a tree at
|
||
the GF prompt. The utility of linearization comes from the fact that
|
||
you can obtain a tree from somewhere else --- for instance, from
|
||
a parser. A prime example of this is <B>translation</B>: you parse
|
||
with one concrete syntax and linearize with another. Let us
|
||
now do this by first importing the Italian grammar:
|
||
</P>
|
||
<PRE>
|
||
> import HelloIta.gf
|
||
</PRE>
|
||
<P>
|
||
We can now parse with <CODE>HelloEng</CODE> and <B>pipe</B> the result
|
||
into linearizing with <CODE>HelloIta</CODE>:
|
||
</P>
|
||
<PRE>
|
||
> parse -lang=HelloEng "hello mum" | linearize -lang=HelloIta
|
||
ciao mamma
|
||
</PRE>
|
||
<P>
|
||
Notice that, since there are now two concrete syntaxes read into the
|
||
system, the commands use a <B>language flag</B> to indicate
|
||
which concrete syntax is used in each operation. If no language flag is
|
||
given, the last-imported language is applied.
|
||
</P>
|
||
<P>
|
||
To conclude the translation exercise, we import the Finnish grammar
|
||
and pipe English parsing into <B>multilingual generation</B>:
|
||
</P>
|
||
<PRE>
|
||
> parse -lang=HelloEng "hello friends" | linearize -multi
|
||
terve yst<73>v<EFBFBD>t
|
||
ciao amici
|
||
hello friends
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Test the parsing and translation examples shown above, as well as
|
||
some other examples, in different combinations of languages.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the grammar <CODE>Hello.gf</CODE> and some of the
|
||
concrete syntaxes by five new recipients and one new greeting
|
||
form.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add a concrete syntax for some other
|
||
languages you might know.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add a pair of greetings that are expressed in one and the same way in
|
||
one language and in two different ways in another. For instance, <I>good morning</I>
|
||
and <I>good afternoon</I> in English are both expressed as <I>buongiorno</I> in Italian.
|
||
Test what happens when you translate <I>buongiorno</I> to English in GF.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Inject errors in the <CODE>Hello</CODE> grammars, for example, leave out
|
||
some line, omit a variable in a <CODE>lin</CODE> rule, or change the name in one occurrence
|
||
of a variable. Inspect the error messages generated by GF.
|
||
</P>
|
||
<A NAME="toc8"></A>
|
||
<H2>Using grammars from outside GF</H2>
|
||
<P>
|
||
A normal "hello world" program written in C is executable from the
|
||
Unix shell and print its output on the terminal. This is possible in GF
|
||
as well, by using the <CODE>gf</CODE> program in a Unix pipe. Invoking <CODE>gf</CODE>
|
||
can be made with grammar names as arguments,
|
||
</P>
|
||
<PRE>
|
||
% gf HelloEng.gf HelloFin.gf HelloIta.gf
|
||
</PRE>
|
||
<P>
|
||
which has the same effect as opening <CODE>gf</CODE> and then importing the
|
||
grammars. A command can be send to this <CODE>gf</CODE> state by piping it from
|
||
Unix's <CODE>echo</CODE> command:
|
||
</P>
|
||
<PRE>
|
||
% echo "l -multi Hello Wordl" | gf HelloEng.gf HelloFin.gf HelloIta.gf
|
||
</PRE>
|
||
<P>
|
||
which will execute the command and then quit. Alternatively, one can write
|
||
a <B>script</B>, a file containing the lines
|
||
</P>
|
||
<PRE>
|
||
import HelloEng.gf
|
||
import HelloFin.gf
|
||
import HelloIta.gf
|
||
linearize -multi Hello World
|
||
</PRE>
|
||
<P>
|
||
If we name this script <CODE>hello.gfs</CODE>, we can do
|
||
</P>
|
||
<PRE>
|
||
$ gf -batch -s <hello.gfs s
|
||
|
||
ciao mondo
|
||
terve maailma
|
||
hello world
|
||
</PRE>
|
||
<P>
|
||
The options <CODE>-batch</CODE> and <CODE>-s</CODE> ("silent") remove prompts, CPU time,
|
||
and other messages. Writing GF scripts and Unix shell scripts that call
|
||
GF is the simplest way to build application programs that use GF grammars.
|
||
In <a href="#chapeight">the eighth chapter</a>, we will see how to build stand-alone programs that don't need
|
||
the GF system to run.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. (For Unix hackers.) Write a GF application that reads
|
||
an English string from the standard input and writes an Italian
|
||
translation to the output.
|
||
</P>
|
||
<A NAME="toc9"></A>
|
||
<H2>What else can be done with the grammar</H2>
|
||
<P>
|
||
Now we have built our first multilingual grammar and seen the basic
|
||
functionalities of GF: parsing and linearization. We have tested
|
||
these functionalities inside the GF program. In the forthcoming
|
||
chapters, we will build larger grammars and can then get more out of
|
||
these functionalities. But we will also introduce new ones:
|
||
</P>
|
||
<UL>
|
||
<LI><B>morphological analysis</B>: find out the possible inflection forms of words
|
||
<LI><B>morphological synthesis</B>: generate all inflection forms of words
|
||
<LI><B>random generation</B>: generate random expressions
|
||
<LI><B>corpus generation</B>: generate all expressions
|
||
<LI><B>treebank generation</B>: generate a list of trees with their linearizations
|
||
<LI><B>teaching quizzes</B>: train morphology and translation
|
||
<LI><B>multilingual authoring</B>: create a document in many languages simultaneously
|
||
<LI><B>speech input</B>: optimize a speech recognition system for a grammar
|
||
</UL>
|
||
|
||
<P>
|
||
The usefulness of GF would be quite limited if grammars were
|
||
usable only inside the GF system. In <a href="#chapeight">the eighth chapter</a>,
|
||
we will see other ways of using grammars:
|
||
</P>
|
||
<UL>
|
||
<LI>compile them to new formats, such as speech recognition grammars
|
||
<LI>embed them in Java and Haskell programs
|
||
<LI>build applications using compilation and embedding:
|
||
<UL>
|
||
<LI>voice commands
|
||
<LI>spoken language translators
|
||
<LI>dialogue systems
|
||
<LI>user interfaces
|
||
<LI>localization: parametrize the messages printed by a program
|
||
to support different languages
|
||
</UL>
|
||
</UL>
|
||
|
||
<P>
|
||
All GF functionalities, both those inside the GF program and those
|
||
ported to other environments,
|
||
are of course already applicable to the simplest of grammars,
|
||
such as the <CODE>Hello</CODE> grammars presented above. But the main focus
|
||
of this tutorial will be on grammar writing. Thus we will show
|
||
how larger and more expressive grammars can be built by using
|
||
the constructs of the GF programming language, before entering the
|
||
applications.
|
||
</P>
|
||
<A NAME="toc10"></A>
|
||
<H2>Summary of GF language features</H2>
|
||
<P>
|
||
As the last section of each chapter, we will give a summary of the GF language
|
||
features covered in the chapter. The presentation is rather technical and intended
|
||
as a reference for later use, rather than to be read at once. The summaries
|
||
may cover some new features, which complement the discussion in the main chapter.
|
||
</P>
|
||
<A NAME="toc11"></A>
|
||
<H3>Modules</H3>
|
||
<P>
|
||
A GF grammar consists of <B>modules</B>,
|
||
into which judgements are grouped. The most important
|
||
module forms are
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>abstract</CODE> A <CODE>= {...}</CODE> , abstract syntax A with judgements in
|
||
the <B>module body</B> <CODE>{...}</CODE>.
|
||
<LI><CODE>concrete</CODE> C <CODE>of</CODE> A <CODE>= {...}</CODE>, concrete syntax C of the
|
||
abstract syntax A, with judgements in the module body <CODE>{...}</CODE>.
|
||
</UL>
|
||
|
||
<P>
|
||
Each module is written in a file named <I>Modulename</I><CODE>.gf</CODE>.
|
||
</P>
|
||
<A NAME="toc12"></A>
|
||
<H3>Judgements</H3>
|
||
<P>
|
||
<a name="secjment"></a>
|
||
</P>
|
||
<P>
|
||
Rules in a module body are called <B>judgements</B>. Keywords such as
|
||
<CODE>fun</CODE> and <CODE>lin</CODE> are used for distinguishing between
|
||
<B>judgement forms</B>. Here is a summary of the most important
|
||
judgement forms, which we have considered by now:
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>form</TH>
|
||
<TH>reading</TH>
|
||
<TH COLSPAN="2">module type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>cat</CODE> <I>C</I></TD>
|
||
<TD><I>C</I> is a category</TD>
|
||
<TD>abstract</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <I>A</I></TD>
|
||
<TD><I>f</I> is a function of type <I>A</I></TD>
|
||
<TD>abstract</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I></TD>
|
||
<TD>category <I>C</I> has linearization type <I>T</I></TD>
|
||
<TD>concrete</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>lin</CODE> <I>f <i>x</i><sub>1</sub> ... <i>x</i><sub>n</sub></I> <CODE>=</CODE> <I>t</I></TD>
|
||
<TD>function <I>f</I> has linearization <I>t</I></TD>
|
||
<TD>concrete</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>flags</CODE> <I>p</I> <CODE>=</CODE> <I>v</I></TD>
|
||
<TD>flag <I>p</I> has value <I>v</I></TD>
|
||
<TD>any</TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
Both abstract and concrete modules may moreover contain <B>comments</B> of the forms
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>--</CODE> <I>anything until a newline</I>
|
||
<LI><CODE>{-</CODE> <I>anything except hyphen followed by closing brace</I> <CODE>-}</CODE>
|
||
</UL>
|
||
|
||
<P>
|
||
Judgements are terminated by semicolons. Shorthands permit the sharing of
|
||
the keyword in subsequent judgements,
|
||
</P>
|
||
<PRE>
|
||
cat C ; D ; === cat C ; cat D ;
|
||
</PRE>
|
||
<P>
|
||
and of the right-hand-side in subsequent judgements of the same form
|
||
</P>
|
||
<PRE>
|
||
fun f, g : A ; === fun f : A ; g : A ;
|
||
</PRE>
|
||
<P>
|
||
We will use the symbol <CODE>===</CODE> to indicate <B>syntactic sugar</B> when
|
||
speaking about GF. Thus it is not a symbol of the GF language.
|
||
</P>
|
||
<P>
|
||
Each judgement declares a <B>name</B>, which is an <B>identifier</B>.
|
||
An identifier is a letter followed by a sequence of letters, digits, and
|
||
characters <CODE>'</CODE> or <CODE>_</CODE>. Each identifier can only be
|
||
defined once in the same module (that is, as next to the judgement keyword;
|
||
local variables such as those in <CODE>lin</CODE> judgemenrs can be
|
||
reused in other judgements).
|
||
</P>
|
||
<P>
|
||
Names are in <B>scope</B> in the rest of the module, i.e. usable in the other
|
||
judgements of the module (subject to type restrictions, of course). Also
|
||
the name of the module is an identifier in scope.
|
||
</P>
|
||
<P>
|
||
The order of judgements in a module is free. In particular, an identifier
|
||
need not be declared before it is used.
|
||
</P>
|
||
<A NAME="toc13"></A>
|
||
<H3>Types and terms</H3>
|
||
<P>
|
||
A <B>type</B> in an abstract syntax are either a <B>basic type</B>,
|
||
i.e. one introduced in a <CODE>cat</CODE> judgement, or a
|
||
<B>function type</B> of the form
|
||
</P>
|
||
<PRE>
|
||
A1 -> ... -> An -> A
|
||
</PRE>
|
||
<P>
|
||
where each of <CODE>A1, ..., An, A</CODE> is a basic type.
|
||
The last type in the arrow-separated sequence
|
||
is the <B>value type</B> of the function type, and the earlier types are
|
||
its <B>argument types</B>.
|
||
</P>
|
||
<P>
|
||
In a concrete syntax, the available types include
|
||
</P>
|
||
<UL>
|
||
<LI>the type of <B>token lists</B>, <CODE>Str</CODE>
|
||
<LI><B>record types</B> of form <CODE>{</CODE> r1 : T1 ; ... ; rn : Tn <CODE>}</CODE>
|
||
</UL>
|
||
|
||
<P>
|
||
Token lists are often briefly called <B>strings</B>.
|
||
</P>
|
||
<P>
|
||
Each semi-colon separated part in a record type is called a
|
||
<B>field</B>. The identifier introduced by the left-hand-side of a field
|
||
is called a <B>label</B>.
|
||
</P>
|
||
<P>
|
||
A <B>term</B> in abstract syntax is a <B>function application</B> of form
|
||
</P>
|
||
<PRE>
|
||
f a1 ... an
|
||
</PRE>
|
||
<P>
|
||
where <CODE>f</CODE> is a function declared in a <CODE>fun</CODE> judgement and <CODE>a1 ... an</CODE>
|
||
are terms. These terms are also called <B>abstract syntax trees</B>, or just
|
||
<B>trees</B>.
|
||
The tree above is well-typed and has the type A, if
|
||
</P>
|
||
<PRE>
|
||
f : A1 -> ... -> An -> A
|
||
</PRE>
|
||
<P>
|
||
and each <CODE>ai</CODE> has type <CODE>an</CODE>.
|
||
</P>
|
||
<P>
|
||
A term used in concrete syntax has one the forms
|
||
</P>
|
||
<UL>
|
||
<LI>quoted string: <CODE>"foo"</CODE>, of type <CODE>Str</CODE>
|
||
<LI>concatenation of strings: <CODE>"foo" ++ "bar"</CODE>,
|
||
<LI>record: <CODE>{</CODE> r1 = t1 ; ... ; rn = Tn <CODE>}</CODE>,
|
||
of type <CODE>{</CODE> r1 : R1 ; ... ; rn : Rn <CODE>}</CODE>
|
||
<LI>projection <CODE>t.r</CODE> of a term <CODE>t</CODE> that has a record type,
|
||
with the record label <CODE>r</CODE>; the projection has the corresponding record
|
||
field type
|
||
<LI>argument variable <CODE>x</CODE> bound by the left-hand-side of a <CODE>lin</CODE> rule,
|
||
of the corresponding linearization type
|
||
</UL>
|
||
|
||
<P>
|
||
Each quoted string is treated as one <B>token</B>, and strings concatenated by
|
||
<CODE>++</CODE> are treated as separate tokens. Tokens are, by default, written with
|
||
a space in between. This behaviour can be changed by <CODE>lexer</CODE> and <CODE>unlexer</CODE>
|
||
flags, as will be explained later "Rseclexing. Therefore it is usually
|
||
not correct to have a space in a token. Writing
|
||
</P>
|
||
<PRE>
|
||
"hello world"
|
||
</PRE>
|
||
<P>
|
||
in a grammar would give the parser the task to find a token with a space
|
||
in it, rather than two tokens <CODE>"hello"</CODE> and <CODE>"world"</CODE>. If the latter is
|
||
what is meant, it is possible to use the shorthand
|
||
</P>
|
||
<PRE>
|
||
["hello world"] === "hello" ++ "world"
|
||
</PRE>
|
||
<P>
|
||
The <B>empty string</B> is denoted by <CODE>[]</CODE> or, equivalently, <CODE>`` or ``[]</CODE>.
|
||
</P>
|
||
<A NAME="toc14"></A>
|
||
<H3>Type checking</H3>
|
||
<P>
|
||
An important functionality of the GF system is <B>static type checking</B>.
|
||
This means that the grammars are controlled to be well-formed, so that all
|
||
run-time errors are eliminated. The main type checking principles are the
|
||
following:
|
||
</P>
|
||
<UL>
|
||
<LI>a concrete syntax must define the <CODE>lincat</CODE> of each <CODE>cat</CODE> and a <CODE>lin</CODE>
|
||
for each <CODE>fun</CODE> in the abstract syntax that it is "<CODE>of</CODE>"
|
||
<LI><CODE>lin</CODE> rules are type checked with respect to the <CODE>lincat</CODE> and <CODE>fun</CODE>
|
||
rules
|
||
<LI>terms have types as defined in the previous section
|
||
</UL>
|
||
|
||
<A NAME="toc15"></A>
|
||
<H1>Designing a grammar for complex phrases</H1>
|
||
<P>
|
||
<a name="chapthree"></a>
|
||
</P>
|
||
<P>
|
||
In this chapter, we will write a grammar that has much more structure than
|
||
the <CODE>Hello</CODE> grammar. We will look at how the abstract syntax
|
||
is divided into suitable categories, and how infinitely many
|
||
phrases can be generated by using recursive rules. We will also
|
||
introduce modularity by showing how a grammar can be
|
||
divided into modules, and how functional programming
|
||
can be used to share code in and among modules.
|
||
</P>
|
||
<A NAME="toc16"></A>
|
||
<H2>The abstract syntax Food</H2>
|
||
<P>
|
||
We will write a grammar that
|
||
defines a set of phrases usable for speaking about food:
|
||
</P>
|
||
<UL>
|
||
<LI>the start category is <CODE>Phrase</CODE>
|
||
<LI>a <CODE>Phrase</CODE> can be built by assigning a <CODE>Quality</CODE> to an <CODE>Item</CODE>
|
||
(e.g. <I>this cheese is Italian</I>)
|
||
<LI>an<CODE>Item</CODE> are build from a <CODE>Kind</CODE> by prefixing <I>this</I> or <I>that</I>
|
||
(e.g. <I>this wine</I>)
|
||
<LI>a <CODE>Kind</CODE> is either <B>atomic</B> (e.g. <I>cheese</I>), or formed
|
||
qualifying a given <CODE>Kind</CODE> with a <CODE>Quality</CODE> (e.g. <I>Italian cheese</I>)
|
||
<LI>a <CODE>Quality</CODE> is either atomic (e.g. <I>Italian</I>,
|
||
or built by modifying a given <CODE>Quality</CODE> with the word <I>very</I> (e.g. <I>very warm</I>)
|
||
</UL>
|
||
|
||
<P>
|
||
These verbal descriptions can be expressed as the following abstract syntax:
|
||
</P>
|
||
<PRE>
|
||
abstract Food = {
|
||
|
||
flags startcat = Phrase ;
|
||
|
||
cat
|
||
Phrase ; Item ; Kind ; Quality ;
|
||
|
||
fun
|
||
Is : Item -> Quality -> Phrase ;
|
||
This, That : Kind -> Item ;
|
||
QKind : Quality -> Kind -> Kind ;
|
||
Wine, Cheese, Fish : Kind ;
|
||
Very : Quality -> Quality ;
|
||
Fresh, Warm, Italian, Expensive, Delicious, Boring : Quality ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
In this abstract syntax, we can build <CODE>Phrase</CODE>s such as
|
||
</P>
|
||
<PRE>
|
||
Is (This (QKind Delicious (QKind Italian Wine))) (Very (Very Expensive))
|
||
</PRE>
|
||
<P>
|
||
In the English concrete syntax, we will want to linearize this into
|
||
</P>
|
||
<PRE>
|
||
this delicious Italian wine is very very expensive
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc17"></A>
|
||
<H2>The concrete syntax FoodEng</H2>
|
||
<P>
|
||
The English concrete syntax gives no surprises:
|
||
</P>
|
||
<PRE>
|
||
concrete FoodEng of Food = {
|
||
|
||
lincat
|
||
Phrase, Item, Kind, Quality = {s : Str} ;
|
||
|
||
lin
|
||
Is item quality = {s = item.s ++ "is" ++ quality.s} ;
|
||
This kind = {s = "this" ++ kind.s} ;
|
||
That kind = {s = "that" ++ kind.s} ;
|
||
QKind quality kind = {s = quality.s ++ kind.s} ;
|
||
Wine = {s = "wine"} ;
|
||
Cheese = {s = "cheese"} ;
|
||
Fish = {s = "fish"} ;
|
||
Very quality = {s = "very" ++ quality.s} ;
|
||
Fresh = {s = "fresh"} ;
|
||
Warm = {s = "warm"} ;
|
||
Italian = {s = "Italian"} ;
|
||
Expensive = {s = "expensive"} ;
|
||
Delicious = {s = "delicious"} ;
|
||
Boring = {s = "boring"} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Let us test how the grammar works in parsing:
|
||
</P>
|
||
<PRE>
|
||
> import FoodEng.gf
|
||
> parse "this delicious wine is very very Italian"
|
||
Is (This (QKind Delicious Wine)) (Very (Very Italian))
|
||
</PRE>
|
||
<P>
|
||
We can also try parsing in other categories than the <CODE>startcat</CODE>,
|
||
by setting the command-line <CODE>cat</CODE> flag:
|
||
</P>
|
||
<PRE>
|
||
p -cat=Kind "very Italian wine"
|
||
QKind (Very Italian) Wine
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar by ten new food kinds and
|
||
qualities, and run the parser with new kinds of examples.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add a rule that enables question phrases of the form
|
||
<I>is this cheese Italian</I>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Enable the optional prefixing of
|
||
phrases with the words "excuse me but". Do this in such a way that
|
||
the prefix can occur at most once.
|
||
</P>
|
||
<A NAME="toc18"></A>
|
||
<H2>Commands for testing grammars</H2>
|
||
<A NAME="toc19"></A>
|
||
<H3>Generating trees and strings</H3>
|
||
<P>
|
||
When we have a grammar above a trivial size, especially a recursive
|
||
one, we need more efficient ways of testing it than just by parsing
|
||
sentences that happen to come to our minds. One way to do this is
|
||
based on automatic generation, which can be either
|
||
<B>random generation</B> or <B>exhaustive generation</B>.
|
||
</P>
|
||
<P>
|
||
Random generation (<CODE>generate_random = gr</CODE>) is an operation that
|
||
builds a random tree in accordance with an abstract syntax:
|
||
</P>
|
||
<PRE>
|
||
> generate_random
|
||
Is (This (QKind Italian Fish)) Fresh
|
||
</PRE>
|
||
<P>
|
||
By using a pipe, random generation can be fed into linearization:
|
||
</P>
|
||
<PRE>
|
||
> generate_random | linearize
|
||
this Italian fish is fresh
|
||
</PRE>
|
||
<P>
|
||
Random generation is a good way to test a grammar. It can also give results
|
||
that are surprising, which shows how fast we lose intuition
|
||
when we write complex grammars.
|
||
</P>
|
||
<P>
|
||
By using the <CODE>number</CODE> flag, several trees can be generated
|
||
in one command:
|
||
</P>
|
||
<PRE>
|
||
> gr -number=10 | l
|
||
that wine is boring
|
||
that fresh cheese is fresh
|
||
that cheese is very boring
|
||
this cheese is Italian
|
||
that expensive cheese is expensive
|
||
that fish is fresh
|
||
that wine is very Italian
|
||
this wine is Italian
|
||
this cheese is boring
|
||
this fish is boring
|
||
</PRE>
|
||
<P>
|
||
To generate <I>all</I> phrases that a grammar can produce,
|
||
GF provides the command <CODE>generate_trees = gt</CODE>.
|
||
</P>
|
||
<PRE>
|
||
> generate_trees | l
|
||
that cheese is very Italian
|
||
that cheese is very boring
|
||
that cheese is very delicious
|
||
that cheese is very expensive
|
||
that cheese is very fresh
|
||
...
|
||
this wine is expensive
|
||
this wine is fresh
|
||
this wine is warm
|
||
|
||
</PRE>
|
||
<P>
|
||
We get quite a few trees but not all of them: only up to a given
|
||
<B>depth</B> of trees. The default depth is 3; the depth can be
|
||
set by using the <CODE>depth</CODE> flag:
|
||
</P>
|
||
<PRE>
|
||
> generate_trees -depth=5 | l
|
||
</PRE>
|
||
<P>
|
||
Other options to the generation commands (like all commands) can be seen
|
||
by GF's <CODE>help = h</CODE> command:
|
||
</P>
|
||
<PRE>
|
||
> help gr
|
||
> help gt
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. If the command <CODE>gt</CODE> generated all
|
||
trees in your grammar, it would never terminate. Why?
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Measure how many trees the grammar gives with depths 4 and 5,
|
||
respectively. <B>Hint</B>. You can
|
||
use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines.
|
||
</P>
|
||
<A NAME="toc20"></A>
|
||
<H3>More on pipes; tracing</H3>
|
||
<P>
|
||
A pipe of GF commands can have any length, but the "output type"
|
||
(either string or tree) of one command must always match the "input type"
|
||
of the next command, in order for the result to make sense.
|
||
</P>
|
||
<P>
|
||
The intermediate results in a pipe can be observed by putting the
|
||
<B>tracing</B> option <CODE>-tr</CODE> to each command whose output you
|
||
want to see:
|
||
</P>
|
||
<PRE>
|
||
> gr -tr | l -tr | p
|
||
|
||
Is (This Cheese) Boring
|
||
this cheese is boring
|
||
Is (This Cheese) Boring
|
||
</PRE>
|
||
<P>
|
||
This facility is useful for test purposes: the pipe above can show
|
||
if a grammar is <B>ambiguous</B>, i.e.
|
||
contains strings that can be parsed in more than one way.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar so that it produces ambiguous
|
||
strings, and try out the ambiguity test.
|
||
</P>
|
||
<A NAME="toc21"></A>
|
||
<H3>Writing and reading files</H3>
|
||
<P>
|
||
To save the outputs of GF commands into a file, you can
|
||
pipe it to the <CODE>write_file = wf</CODE> command,
|
||
</P>
|
||
<PRE>
|
||
> gr -number=10 | linearize | write_file exx.tmp
|
||
</PRE>
|
||
<P>
|
||
You can read the file back to GF with the
|
||
<CODE>read_file = rf</CODE> command,
|
||
</P>
|
||
<PRE>
|
||
> read_file exx.tmp | parse -lines
|
||
</PRE>
|
||
<P>
|
||
Notice the flag <CODE>-lines</CODE> given to the parsing
|
||
command. This flag tells GF to parse each line of
|
||
the file separately. Without the flag, the grammar could
|
||
not recognize the string in the file, because it is not
|
||
a sentence but a sequence of ten sentences.
|
||
</P>
|
||
<P>
|
||
Files with examples can be used for <B>regression testing</B>
|
||
of grammars. The most systematic way to do this is by
|
||
generating treebanks; see <a href="#sectreebank">here</a>.
|
||
</P>
|
||
<A NAME="toc22"></A>
|
||
<H3>Visualizing trees</H3>
|
||
<P>
|
||
The gibberish code with parentheses returned by the parser does not
|
||
look like trees. Why is it called so? From the abstract mathematical
|
||
point of view, trees are a data structure that
|
||
represents <B>nesting</B>: trees are branching entities, and the branches
|
||
are themselves trees. Parentheses give a linear representation of trees,
|
||
useful for the computer. But the human eye may prefer to see a visualization;
|
||
for this purpose, GF provides the command <CODE>visualize_tree = vt</CODE>, to which
|
||
parsing (and any other tree-producing command) can be piped:
|
||
</P>
|
||
<PRE>
|
||
> parse "this delicious cheese is very Italian" | visualize_tree
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<IMG ALIGN="middle" SRC="mytree.png" BORDER="0" ALT="">
|
||
</P>
|
||
<P>
|
||
This command uses the programs Graphviz and Ghostview, which you
|
||
might not have, but which are freely available on the web.
|
||
</P>
|
||
<P>
|
||
Alternatively, you can print the tree into a file
|
||
e.g. a <CODE>.png</CODE> file that
|
||
can be be viewed with e.g. an HTML browser and also included in an
|
||
HTML document. You can do this
|
||
by saving the file <CODE>grphtmp.dot</CODE>, which the command <CODE>vt</CODE>
|
||
produces. Then you can process this file with the <CODE>dot</CODE>
|
||
program (from the Graphviz package).
|
||
</P>
|
||
<PRE>
|
||
% dot -Tpng grphtmp.dot > mytree.png
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc23"></A>
|
||
<H3>System commands</H3>
|
||
<P>
|
||
If you don't have Ghostview, or want to view graphs in some other way,
|
||
you can call <CODE>dot</CODE> and a suitable
|
||
viewer (e.g. <CODE>open</CODE> in Mac) without leaving GF, by using
|
||
a <B>system command</B>: <CODE>!</CODE> followed by a Unix command,
|
||
</P>
|
||
<PRE>
|
||
> ! dot -Tpng grphtmp.dot > mytree.png
|
||
> ! open mytree.png
|
||
</PRE>
|
||
<P>
|
||
Another form of system commands are those that receive arguments from
|
||
GF pipes. The escape symbol
|
||
is then <CODE>?</CODE>.
|
||
</P>
|
||
<PRE>
|
||
> generate_trees | ? wc
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. (Exercise drom 3.3.1 revisited.)
|
||
Measure how many trees the grammar <CODE>FoodEng</CODE> gives with depths 4 and 5,
|
||
respectively. Use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines, and
|
||
a pipe from a GF command into a Unix command.
|
||
</P>
|
||
<A NAME="toc24"></A>
|
||
<H2>An Italian concrete syntax</H2>
|
||
<P>
|
||
<a name="secanitalian"></a>
|
||
</P>
|
||
<P>
|
||
We write the Italian grammar in a straightforward way, by replacing
|
||
English words with their dictionary equivalents:
|
||
</P>
|
||
<PRE>
|
||
concrete FoodIta of Food = {
|
||
|
||
lincat
|
||
Phrase, Item, Kind, Quality = {s : Str} ;
|
||
|
||
lin
|
||
Is item quality = {s = item.s ++ "<22>" ++ quality.s} ;
|
||
This kind = {s = "questo" ++ kind.s} ;
|
||
That kind = {s = "quello" ++ kind.s} ;
|
||
QKind quality kind = {s = kind.s ++ quality.s} ;
|
||
Wine = {s = "vino"} ;
|
||
Cheese = {s = "formaggio"} ;
|
||
Fish = {s = "pesce"} ;
|
||
Very quality = {s = "molto" ++ quality.s} ;
|
||
Fresh = {s = "fresco"} ;
|
||
Warm = {s = "caldo"} ;
|
||
Italian = {s = "italiano"} ;
|
||
Expensive = {s = "caro"} ;
|
||
Delicious = {s = "delizioso"} ;
|
||
Boring = {s = "noioso"} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
An alert reader, or one who already knows Italian, may notice one point in
|
||
which the change is more substantial than just replacement of words: the order of
|
||
a quality and the kind it modifies in
|
||
</P>
|
||
<PRE>
|
||
QKind quality kind = {s = kind.s ++ quality.s} ;
|
||
</PRE>
|
||
<P>
|
||
Thus Italian says <CODE>vino italiano</CODE> for <CODE>Italian wine</CODE>. (Some Italian adjectives
|
||
are put before the noun. This distinction can be controlled by parameters, which
|
||
are introduced in <a href="#chaptwo">the fourth chapter</a>.)
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for some other language.
|
||
You will probably end up with grammatically incorrect linearizations --- but don't
|
||
worry about this yet.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. If you have written <CODE>Food</CODE> for German, Swedish, or some
|
||
other language, test with random or exhaustive generation what constructs
|
||
come out incorrect, and prepare a list of those ones that cannot be helped
|
||
with the currently available fragment of GF. You can return to your list
|
||
after having worked out <a href="#chaptwo">the fourth chapter</a>.
|
||
</P>
|
||
<A NAME="toc25"></A>
|
||
<H2>Free variation</H2>
|
||
<P>
|
||
Sometimes there are alternative ways to define a concrete syntax.
|
||
For instance, if we use the <CODE>Food</CODE> grammars in a restaurant phrase
|
||
book, we may want to accept different words for expressing the quality
|
||
"delicious" ---- and different languages can differ in how many
|
||
such words they have. Then we don't want to put the distinctions into
|
||
the abstract syntax, but into concrete syntaxes. Such semantically
|
||
neutral distinctions are known as <B>free variation</B> in linguistics.
|
||
</P>
|
||
<P>
|
||
The <CODE>variants</CODE> construct of GF expresses free variation. For example,
|
||
</P>
|
||
<PRE>
|
||
lin Delicious = {s = variants {"delicious" ; "exquisit" ; "tasty"}} ;
|
||
</PRE>
|
||
<P>
|
||
says that <CODE>Delicious</CODE> can be linearized to any of <I>delicious</I>,
|
||
<I>exquisit</I>, and <I>tasty</I>. As a consequence, both these words result in the
|
||
tree <CODE>Delicious</CODE> when parsed. By default, the <CODE>linearize</CODE> command
|
||
shows only the first variant from each <CODE>variants</CODE> list; to see them
|
||
all, the option <CODE>-all</CODE> can be used:
|
||
</P>
|
||
<PRE>
|
||
> p "this exquisit wine is delicious" | l -all
|
||
this delicious wine is delicious
|
||
this delicious wine is exquisit
|
||
...
|
||
</PRE>
|
||
<P>
|
||
In linguistics, it is well known that free variation is almost
|
||
non-existing, if all aspects of expressions are taken into account, including style.
|
||
Therefore, free variation should not be used in grammars that are meant as
|
||
libraries for other grammars, as in <a href="#chapfive">the fifth chapter</a>. However, in a specific
|
||
application, free variation is an excellent way to scale up the parser to
|
||
variations in user input that make no difference in the semantic
|
||
treatment.
|
||
</P>
|
||
<P>
|
||
An example that clearly illustrates these points is the
|
||
English negation. If we added to the <CODE>Food</CODE> grammar the negation
|
||
of a quality, we could accept both contracted and uncontracted <I>not</I>:
|
||
</P>
|
||
<PRE>
|
||
fun IsNot : Item -> Quality -> Phrase ;
|
||
lin IsNot item qual =
|
||
{s = item.s ++ variants {"isn't" ; ["is not"]} ++ qual.s} ;
|
||
</PRE>
|
||
<P>
|
||
Both forms are likely to occur in user input. Since there is no
|
||
corresponding contrast in Italian, we do not want to put the distinction
|
||
in the abstract syntax. Yet there is a stylistic difference between
|
||
these two forms. In particular, if we are doing generation rather
|
||
than parsing, we will want to choose the one or
|
||
the other depending on the kind of language we want to generate.
|
||
</P>
|
||
<P>
|
||
A limiting case of free variation is an empty variant list
|
||
</P>
|
||
<PRE>
|
||
variants {}
|
||
</PRE>
|
||
<P>
|
||
It can be used e.g. if a word lacks a certain inflection form.
|
||
</P>
|
||
<P>
|
||
Free variation works for all types in concrete syntax; all terms in
|
||
a <CODE>variants</CODE> list must be of the same type.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Modify <CODE>FoodIta</CODE> in such a way that a quality can
|
||
be assigned to an item by using two different word orders, exemplified
|
||
by <I>questo vino <20> delizioso</I> and <I><EFBFBD> delizioso questo vino</I>
|
||
(a real variation in Italian),
|
||
and that it is impossible to say that something is boring
|
||
(a rather contrived example).
|
||
</P>
|
||
<A NAME="toc26"></A>
|
||
<H2>More application of multilingual grammars</H2>
|
||
<A NAME="toc27"></A>
|
||
<H3>Multilingual treebanks</H3>
|
||
<P>
|
||
<a name="sectreebank"></a>
|
||
</P>
|
||
<P>
|
||
A <B>multilingual treebank</B> is a set of trees with their
|
||
translations in different languages:
|
||
</P>
|
||
<PRE>
|
||
> gr -number=2 | tree_bank
|
||
|
||
Is (That Cheese) (Very Boring)
|
||
quello formaggio <20> molto noioso
|
||
that cheese is very boring
|
||
|
||
Is (That Cheese) Fresh
|
||
quello formaggio <20> fresco
|
||
that cheese is fresh
|
||
</PRE>
|
||
<P>
|
||
There is also an XML format for treebanks and a set of commands
|
||
suitable for regression testing; see <CODE>help tb</CODE> for more details.
|
||
</P>
|
||
<A NAME="toc28"></A>
|
||
<H3>Translation session</H3>
|
||
<P>
|
||
If translation is what you want to do with a set of grammars, a convenient
|
||
way to do it is to open a <CODE>translation_session = ts</CODE>. In this session,
|
||
you can translate between all the languages that are in scope.
|
||
A dot <CODE>.</CODE> terminates the translation session.
|
||
</P>
|
||
<PRE>
|
||
> ts
|
||
|
||
trans> that very warm cheese is boring
|
||
quello formaggio molto caldo <20> noioso
|
||
that very warm cheese is boring
|
||
|
||
trans> questo vino molto italiano <20> molto delizioso
|
||
questo vino molto italiano <20> molto delizioso
|
||
this very Italian wine is very delicious
|
||
|
||
trans> .
|
||
>
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc29"></A>
|
||
<H3>Translation quiz</H3>
|
||
<P>
|
||
This is a simple language exercise that can be automatically
|
||
generated from a multilingual grammar. The system generates a set of
|
||
random sentences, displays them in one language, and checks the user's
|
||
answer given in another language. The command <CODE>translation_quiz = tq</CODE>
|
||
makes this in a subshell of GF.
|
||
</P>
|
||
<PRE>
|
||
> translation_quiz FoodEng FoodIta
|
||
|
||
Welcome to GF Translation Quiz.
|
||
The quiz is over when you have done at least 10 examples
|
||
with at least 75 % success.
|
||
You can interrupt the quiz by entering a line consisting of a dot ('.').
|
||
|
||
this fish is warm
|
||
questo pesce <20> caldo
|
||
> Yes.
|
||
Score 1/1
|
||
|
||
this cheese is Italian
|
||
questo formaggio <20> noioso
|
||
> No, not questo formaggio <20> noioso, but
|
||
questo formaggio <20> italiano
|
||
|
||
Score 1/2
|
||
this fish is expensive
|
||
</PRE>
|
||
<P>
|
||
You can also generate a list of translation exercises and save it in a
|
||
file for later use, by the command <CODE>translation_list = tl</CODE>
|
||
</P>
|
||
<PRE>
|
||
> translation_list -number=25 FoodEng FoodIta | write_file transl.txt
|
||
</PRE>
|
||
<P>
|
||
The <CODE>number</CODE> flag gives the number of sentences generated.
|
||
</P>
|
||
<A NAME="toc30"></A>
|
||
<H3>Multilingual syntax editing</H3>
|
||
<P>
|
||
<a name="secediting"></a>
|
||
</P>
|
||
<P>
|
||
Any multilingual grammar can be used in the graphical syntax editor, which is
|
||
opened by the shell
|
||
command <CODE>gfeditor</CODE> followed by the names of the grammar files.
|
||
Thus
|
||
</P>
|
||
<PRE>
|
||
% gfeditor FoodEng.gf FoodIta.gf
|
||
</PRE>
|
||
<P>
|
||
opens the editor for the two <CODE>Food</CODE> grammars.
|
||
</P>
|
||
<P>
|
||
The editor supports commands for manipulating an abstract syntax tree.
|
||
The process is started by choosing a category from the "New" menu.
|
||
Choosing <CODE>Phrase</CODE> creates a new tree of type <CODE>Phrase</CODE>. A new tree
|
||
is in general completely unknown: it consists of a <B>metavariable</B>
|
||
<CODE>?1</CODE>. However, since the category <CODE>Phrase</CODE> in <CODE>Food</CODE> has
|
||
only one possible constructor, <CODE>Is</CODE>, the tree is readily
|
||
given the form <CODE>Is ?1 ?2</CODE>. Here is what the editor looks like at
|
||
this stage:
|
||
</P>
|
||
<P>
|
||
<IMG ALIGN="right" SRC="food1.png" BORDER="0" ALT="">
|
||
</P>
|
||
<P>
|
||
Editing goes on by <B>refinements</B>, i.e. choices of constructors from
|
||
the menu, until no metavariables remain. Here is a tree resulting from the
|
||
current editing session:
|
||
</P>
|
||
<P>
|
||
<IMG ALIGN="right" SRC="food2.png" BORDER="0" ALT="">
|
||
</P>
|
||
<P>
|
||
Editing can be continued even when the tree is finished. The user can shift
|
||
the <B>focus</B> to some of the subtrees by clicking at it or the corresponding
|
||
part of a linearization. In the picture, the focus is on "fish".
|
||
Since there are no metavariables,
|
||
the menu shows no refinements, but some other possible actions:
|
||
</P>
|
||
<UL>
|
||
<LI>to <B>change</B> "fish" to "cheese" or "wine"
|
||
<LI>to <B>delete</B> "fish", i.e. change it to a metavariable
|
||
<LI>to <B>wrap</B> "fish" in a qualification, i.e. change it to
|
||
<CODE>QKind ? Fish</CODE>, where the quality can be given in a later refinement
|
||
</UL>
|
||
|
||
<P>
|
||
In addition to menu-based editing, the tool supports refinement by parsing,
|
||
which is accessible by middle-clicking in the tree or in the linearization field.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Construct the sentence
|
||
<I>this very expensive cheese is very very delicious</I>
|
||
and its Italian translation by using <CODE>gfeditor</CODE>.
|
||
</P>
|
||
<A NAME="toc31"></A>
|
||
<H2>Context-free grammars and GF</H2>
|
||
<P>
|
||
Readers not familar with context-free grammars, also known as BNF grammars, can
|
||
skip this section. Those that are familar with them will find here the exact
|
||
relation between GF and context-free grammars. We will moreover show how
|
||
the BNF format can be used as input to the GF program; it is often more
|
||
concise than GF proper, but also more restricted in expressive power.
|
||
</P>
|
||
<A NAME="toc32"></A>
|
||
<H3>The "cf" grammar format</H3>
|
||
<P>
|
||
The grammar <CODE>FoodEng</CODE> could be written in a BNF format as follows:
|
||
</P>
|
||
<PRE>
|
||
Is. Phrase ::= Item "is" Quality ;
|
||
That. Item ::= "that" Kind ;
|
||
This. Item ::= "this" Kind ;
|
||
QKind. Kind ::= Quality Kind ;
|
||
Cheese. Kind ::= "cheese" ;
|
||
Fish. Kind ::= "fish" ;
|
||
Wine. Kind ::= "wine" ;
|
||
Italian. Quality ::= "Italian" ;
|
||
Boring. Quality ::= "boring" ;
|
||
Delicious. Quality ::= "delicious" ;
|
||
Expensive. Quality ::= "expensive" ;
|
||
Fresh. Quality ::= "fresh" ;
|
||
Very. Quality ::= "very" Quality ;
|
||
Warm. Quality ::= "warm" ;
|
||
</PRE>
|
||
<P>
|
||
In this format, each rule is prefixed by a <B>label</B> that gives
|
||
the constructor function GF gives in its <CODE>fun</CODE> rules. In fact,
|
||
each context-free rule is a fusion of a <CODE>fun</CODE> and a <CODE>lin</CODE> rule:
|
||
it states simultaneously that
|
||
</P>
|
||
<UL>
|
||
<LI>the label is a function from the nonterminal categories
|
||
on the right-hand side to the category on the left-hand side;
|
||
the first rule gives
|
||
<PRE>
|
||
fun Is : Item -> Quality -> Phrase
|
||
</PRE>
|
||
<LI>trees built by the label are linearized in the way indicated
|
||
by the right-hand side;
|
||
the first rule gives
|
||
<PRE>
|
||
lin Is item quality = {s = item.s ++ "is" ++ quality.s}
|
||
</PRE>
|
||
</UL>
|
||
|
||
<P>
|
||
The translation from BNF to GF described above is in fact used in
|
||
the GF system to convert BNF grammars into GF. BNF files are recognized
|
||
by the file name suffix <CODE>.cf</CODE>; thus the grammar above can be
|
||
put into a file named <CODE>food.cf</CODE> and read into GF by
|
||
</P>
|
||
<PRE>
|
||
> import food.cf
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc33"></A>
|
||
<H3>Restrictions of context-free grammars</H3>
|
||
<P>
|
||
Even though we managed to write <CODE>FoodEng</CODE> in the context-free format,
|
||
we cannot do this for GF grammars in general. It is enough to try this
|
||
with <CODE>FoodIta</CODE> at the same time as <CODE>FoodEng</CODE>,
|
||
we lose an important aspect of multilinguality:
|
||
that the order of constituents is defined only in concrete syntax.
|
||
Thus we could not use context-free <CODE>FoodEng</CODE> and <CODE>FoodIta</CODE> in a multilingual
|
||
grammar that supports translation via common abstract syntax: the
|
||
qualification function <CODE>QKind</CODE> has different types in the two
|
||
grammars.
|
||
</P>
|
||
<P>
|
||
In general terms, the separation of concrete and abstract syntax allows
|
||
three deviations from context-free grammar:
|
||
</P>
|
||
<UL>
|
||
<LI><B>permutation</B>: changing the order of constituents
|
||
<LI><B>suppression</B>: omitting constituents
|
||
<LI><B>reduplication</B>: repeating constituents
|
||
</UL>
|
||
|
||
<P>
|
||
The third property is the one that definitely shows that GF is
|
||
stronger than context-free: GF can define the <B>copy language</B>
|
||
<CODE>{x x | x <- (a|b)*}</CODE>, which is known not to be context-free.
|
||
The other properties have more to do with the kind of trees that
|
||
the grammar can associate with strings: permutation is important
|
||
in multilingual grammars, and suppression is exploited in grammars
|
||
where trees carry some hidden semantic information (see <a href="#chapsix">the sixth chapter</a>
|
||
below).
|
||
</P>
|
||
<P>
|
||
Of course, context-free grammars are also restricted from the
|
||
grammar engineering point of view. They give no support to
|
||
modules, functions, and parameters, which are so central
|
||
for the productivity of GF. Despite the initial conciseness
|
||
of context-free grammars, GF can easily produce grammars where
|
||
30 lines of GF code would need hundreds of lines of
|
||
context-free grammar code to produce; see exercises
|
||
<a href="#secitalian">here</a> and <a href="#sectense">here</a>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. GF can also interpret unlabelled BNF grammars, by
|
||
creating labels automatically. The right-hand sides of BNF rules
|
||
can moreover be disjunctions, e.g.
|
||
</P>
|
||
<PRE>
|
||
Quality ::= "fresh" | "Italian" | "very" Quality ;
|
||
</PRE>
|
||
<P>
|
||
Experiment with this format in GF, possibly with a grammar that
|
||
you import from some other source, such as a programming language
|
||
document.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Define the copy language <CODE>{x x | x <- (a|b)*}</CODE> in GF.
|
||
</P>
|
||
<A NAME="toc34"></A>
|
||
<H2>Modules and files</H2>
|
||
<P>
|
||
GF uses suffixes to recognize different file formats. The most
|
||
important ones are:
|
||
</P>
|
||
<UL>
|
||
<LI>Source files: <I>Modulename</I><CODE>.gf</CODE>
|
||
<LI>Target files: <I>Modulename</I><CODE>.gfc</CODE>
|
||
</UL>
|
||
|
||
<P>
|
||
When you import <CODE>FoodEng.gf</CODE>, you see the target files being
|
||
generated:
|
||
</P>
|
||
<PRE>
|
||
> i FoodEng.gf
|
||
- compiling Food.gf... wrote file Food.gfc 16 msec
|
||
- compiling FoodEng.gf... wrote file FoodEng.gfc 20 msec
|
||
</PRE>
|
||
<P>
|
||
You also see that the GF program does not only read the file
|
||
<CODE>FoodEng.gf</CODE>, but also all other files that it
|
||
depends on --- in this case, <CODE>Food.gf</CODE>.
|
||
</P>
|
||
<P>
|
||
For each file that is compiled, a <CODE>.gfc</CODE> file
|
||
is generated. The GFC format (="GF Canonical") is the
|
||
"machine code" of GF, which is faster to process than
|
||
GF source files. When reading a module, GF decides whether
|
||
to use an existing <CODE>.gfc</CODE> file or to generate
|
||
a new one, by looking at modification times.
|
||
</P>
|
||
<P>
|
||
<I>In GF version 3, the</I> <CODE>gfc</CODE> <I>format is replaced by the format suffixed</I>
|
||
<CODE>gfo</CODE>, <I>"GF object"</I>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. What happens when you import <CODE>FoodEng.gf</CODE> for
|
||
a second time? Try this in different situations:
|
||
</P>
|
||
<UL>
|
||
<LI>Right after importing it the first time (the modules are kept in
|
||
the memory of GF and need no reloading).
|
||
<LI>After issuing the command <CODE>empty</CODE> (<CODE>e</CODE>), which clears the memory
|
||
of GF.
|
||
<LI>After making a small change in <CODE>FoodEng.gf</CODE>, be it only an added space.
|
||
<LI>After making a change in <CODE>Food.gf</CODE>.
|
||
</UL>
|
||
|
||
<A NAME="toc35"></A>
|
||
<H2>Using operations and resource modules</H2>
|
||
<A NAME="toc36"></A>
|
||
<H3>The golden rule of functional programming</H3>
|
||
<P>
|
||
When writing a grammar, you have to type lots of
|
||
characters. You have probably
|
||
done this by the copy-and-paste method, which is a universally
|
||
available way to avoid repeating work.
|
||
</P>
|
||
<P>
|
||
However, there is a more elegant way to avoid repeating work than
|
||
the copy-and-paste
|
||
method. The <B>golden rule of functional programming</B> says that
|
||
</P>
|
||
<UL>
|
||
<LI>whenever you find yourself programming by copy-and-paste,
|
||
write a function instead.
|
||
</UL>
|
||
|
||
<P>
|
||
A function separates the shared parts of different computations from the
|
||
changing parts, its <B>arguments</B>, or <B>parameters</B>.
|
||
In functional programming languages, such as
|
||
Haskell, it is possible to share much more
|
||
code with functions than in languages such as C and Java, because
|
||
of higher-order functions (functions that takes functions as arguments).
|
||
</P>
|
||
<A NAME="toc37"></A>
|
||
<H3>Operation definitions</H3>
|
||
<P>
|
||
GF is a functional programming language, not only in the sense that
|
||
the abstract syntax is a system of functions (<CODE>fun</CODE>), but also because
|
||
functional programming can be used when defining concrete syntax. This is
|
||
done by using a new form of judgement, with the keyword <CODE>oper</CODE> (for
|
||
<B>operation</B>), distinct from <CODE>fun</CODE> for the sake of clarity.
|
||
Here is a simple example of an operation:
|
||
</P>
|
||
<PRE>
|
||
oper ss : Str -> {s : Str} = \x -> {s = x} ;
|
||
</PRE>
|
||
<P>
|
||
The operation can be <B>applied</B> to an argument, and GF will
|
||
<B>compute</B> the application into a value. For instance,
|
||
</P>
|
||
<PRE>
|
||
ss "boy" ===> {s = "boy"}
|
||
</PRE>
|
||
<P>
|
||
We use the symbol <CODE>===</CODE> to indicate how an expression is
|
||
computed into a value; this symbol is not a part of GF.
|
||
</P>
|
||
<P>
|
||
Thus an <CODE>oper</CODE> judgement includes the name of the defined operation,
|
||
its type, and an expression defining it. As for the syntax of the defining
|
||
expression, notice the <B>lambda abstraction</B> form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I> of
|
||
the function. It reads: function with variable <I>x</I> and <B>function body</B>
|
||
<I>t</I>. Any occurrence of <I>x</I> in <I>t</I> is said to be <B>bound</B> in <I>t</I>.
|
||
</P>
|
||
<P>
|
||
For lambda abstraction with multiple arguments, we have the shorthand
|
||
</P>
|
||
<PRE>
|
||
\x,y -> t === \x -> \y -> t
|
||
</PRE>
|
||
<P>
|
||
The notation we have used for linearization rules, where
|
||
variables are bound on the left-hand side, is actually syntactic
|
||
sugar for abstraction:
|
||
</P>
|
||
<PRE>
|
||
lin f x = t === lin f = \x -> t
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc38"></A>
|
||
<H3>The ``resource`` module type</H3>
|
||
<P>
|
||
Operator definitions can be included in a concrete syntax.
|
||
But they are usually not really tied to a particular
|
||
set of linearization rules.
|
||
They should rather be seen as <B>resources</B>
|
||
usable in many concrete syntaxes.
|
||
</P>
|
||
<P>
|
||
The <CODE>resource</CODE> module type is used to package
|
||
<CODE>oper</CODE> definitions into reusable resources. Here is
|
||
an example, with a handful of operations to manipulate
|
||
strings and records.
|
||
</P>
|
||
<PRE>
|
||
resource StringOper = {
|
||
oper
|
||
SS : Type = {s : Str} ;
|
||
ss : Str -> SS = \x -> {s = x} ;
|
||
cc : SS -> SS -> SS = \x,y -> ss (x.s ++ y.s) ;
|
||
prefix : Str -> SS -> SS = \p,x -> ss (p ++ x.s) ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc39"></A>
|
||
<H3>Opening a resource</H3>
|
||
<P>
|
||
Any number of <CODE>resource</CODE> modules can be
|
||
<B>open</B>ed in a <CODE>concrete</CODE> syntax, which
|
||
makes definitions contained
|
||
in the resource usable in the concrete syntax. Here is
|
||
an example, where the resource <CODE>StringOper</CODE> is
|
||
opened in a new version of <CODE>FoodEng</CODE>.
|
||
</P>
|
||
<PRE>
|
||
concrete FoodEng of Food = open StringOper in {
|
||
|
||
lincat
|
||
S, Item, Kind, Quality = SS ;
|
||
|
||
lin
|
||
Is item quality = cc item (prefix "is" quality) ;
|
||
This k = prefix "this" k ;
|
||
That k = prefix "that" k ;
|
||
QKind k q = cc k q ;
|
||
Wine = ss "wine" ;
|
||
Cheese = ss "cheese" ;
|
||
Fish = ss "fish" ;
|
||
Very = prefix "very" ;
|
||
Fresh = ss "fresh" ;
|
||
Warm = ss "warm" ;
|
||
Italian = ss "Italian" ;
|
||
Expensive = ss "expensive" ;
|
||
Delicious = ss "delicious" ;
|
||
Boring = ss "boring" ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Use the same string operations to write <CODE>FoodIta</CODE>
|
||
more concisely.
|
||
</P>
|
||
<A NAME="toc40"></A>
|
||
<H3>Partial application</H3>
|
||
<P>
|
||
<a name="secpartapp"></a>
|
||
</P>
|
||
<P>
|
||
GF, like Haskell, permits <B>partial application</B> of
|
||
functions. An example of this is the rule
|
||
</P>
|
||
<PRE>
|
||
lin This k = prefix "this" k ;
|
||
</PRE>
|
||
<P>
|
||
which can be written more concisely
|
||
</P>
|
||
<PRE>
|
||
lin This = prefix "this" ;
|
||
</PRE>
|
||
<P>
|
||
The first form is perhaps more intuitive to write
|
||
but, once you get used to partial application, you will appreciate its
|
||
conciseness and elegance. The logic of partial application
|
||
is known as <B>currying</B>, with a reference to Haskell B. Curry.
|
||
The idea is that any <I>n</I>-place function can be seen as a 1-place
|
||
function whose value is an <I>n-</I>1 -place function. Thus
|
||
</P>
|
||
<PRE>
|
||
oper prefix : Str -> SS -> SS ;
|
||
</PRE>
|
||
<P>
|
||
can be used as a 1-place function that takes a <CODE>Str</CODE> into a
|
||
function <CODE>SS -> SS</CODE>. The expected linearization of <CODE>This</CODE> is exactly
|
||
a function of such a type, operating on an argument of type <CODE>Kind</CODE>
|
||
whose linearization is of type <CODE>SS</CODE>. Thus we can define the
|
||
linearization directly as <CODE>prefix "this"</CODE>.
|
||
</P>
|
||
<P>
|
||
An important part of the art of functional programming is to decide the order
|
||
of arguments in a function, so that partial application can be used as much
|
||
as possible. For instance, of the operation <CODE>prefix</CODE> we know that it
|
||
will be typically applied to linearization variables with constant strings.
|
||
This is the reason to put the <CODE>Str</CODE> argument before the <CODE>SS</CODE> argument --- not
|
||
the prefixity. A <CODE>postfix</CODE> function would have exactly the same order of arguments.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Define an operation <CODE>infix</CODE> analogous to <CODE>prefix</CODE>,
|
||
such that it allows you to write
|
||
</P>
|
||
<PRE>
|
||
lin Is = infix "is" ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc41"></A>
|
||
<H3>Testing resource modules</H3>
|
||
<P>
|
||
To test a <CODE>resource</CODE> module independently, you must import it
|
||
with the flag <CODE>-retain</CODE>, which tells GF to retain <CODE>oper</CODE> definitions
|
||
in the memory; the usual behaviour is that <CODE>oper</CODE> definitions
|
||
are just applied to compile linearization rules
|
||
(this is called <B>inlining</B>) and then thrown away.
|
||
</P>
|
||
<PRE>
|
||
> import -retain StringOper.gf
|
||
</PRE>
|
||
<P>
|
||
The command <CODE>compute_concrete = cc</CODE> computes any expression
|
||
formed by operations and other GF constructs. For example,
|
||
</P>
|
||
<PRE>
|
||
> compute_concrete prefix "in" (ss "addition")
|
||
{
|
||
s : Str = "in" ++ "addition"
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc42"></A>
|
||
<H2>Grammar architecture</H2>
|
||
<P>
|
||
<a name="secarchitecture"></a>
|
||
</P>
|
||
<A NAME="toc43"></A>
|
||
<H3>Extending a grammar</H3>
|
||
<P>
|
||
The module system of GF makes it possible to write a new module that <B>extend</B>s
|
||
an old one. The syntax of extension is
|
||
shown by the following example. We extend <CODE>Food</CODE> into <CODE>MoreFood</CODE> by
|
||
adding a category of questions and two new functions.
|
||
</P>
|
||
<PRE>
|
||
abstract Morefood = Food ** {
|
||
cat
|
||
Question ;
|
||
fun
|
||
QIs : Item -> Quality -> Question ;
|
||
Pizza : Kind ;
|
||
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Parallel to the abstract syntax, extensions can
|
||
be built for concrete syntaxes:
|
||
</P>
|
||
<PRE>
|
||
concrete MorefoodEng of Morefood = FoodEng ** {
|
||
lincat
|
||
Question = {s : Str} ;
|
||
lin
|
||
QIs item quality = {s = "is" ++ item.s ++ quality.s} ;
|
||
Pizza = {s = "pizza"} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
The effect of extension is that all of the contents of the extended
|
||
and extending module are put together. We also say that the new
|
||
module <B>inherits</B> the contents of the old module.
|
||
</P>
|
||
<P>
|
||
At the same time as extending a module of the same type, a concrete
|
||
syntax module may open resources. Since <CODE>open</CODE> takes effect in
|
||
the module body and not in the extended module, its logical place
|
||
in the module header is after the extend part:
|
||
</P>
|
||
<PRE>
|
||
concrete MorefoodIta of Morefood = FoodIta ** open StringOper in {
|
||
lincat
|
||
Question = SS ;
|
||
lin
|
||
QIs item quality = ss (item.s ++ "<22>" ++ quality.s) ;
|
||
Pizza = ss "pizza" ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Resource modules can extend other resource modules, in the
|
||
same way as modules of other types can extend modules of the
|
||
same type. Thus it is possible to build resource hierarchies.
|
||
</P>
|
||
<A NAME="toc44"></A>
|
||
<H3>Multiple inheritance</H3>
|
||
<P>
|
||
Specialized vocabularies can be represented as small grammars that
|
||
only do "one thing" each. For instance, the following are grammars
|
||
for fruit and mushrooms
|
||
</P>
|
||
<PRE>
|
||
abstract Fruit = {
|
||
cat Fruit ;
|
||
fun Apple, Peach : Fruit ;
|
||
}
|
||
|
||
abstract Mushroom = {
|
||
cat Mushroom ;
|
||
fun Cep, Agaric : Mushroom ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
They can afterwards be combined into bigger grammars by using
|
||
<B>multiple inheritance</B>, i.e. extension of several grammars at the
|
||
same time:
|
||
</P>
|
||
<PRE>
|
||
abstract Foodmarket = Food, Fruit, Mushroom ** {
|
||
fun
|
||
FruitKind : Fruit -> Kind ;
|
||
MushroomKind : Mushroom -> Kind ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
The main advantages with splitting a grammar to modules are
|
||
<B>reusability</B>, <B>separate compilation</B>, and <B>division of labour</B>.
|
||
Reusability means
|
||
that one and the same module can be put into different uses; for instance,
|
||
a module with mushroom names might be used in a mycological information system
|
||
as well as in a restaurant phrasebook. Separate compilation means that a module
|
||
once compiled into <CODE>.gfc</CODE> need not be compiled again unless changes have
|
||
taken place.
|
||
Division of labour means simply that programmers that are experts in
|
||
special areas can work on modules belonging to those areas.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Refactor <CODE>Food</CODE> by taking apart <CODE>Wine</CODE> into a special
|
||
<CODE>Drink</CODE> module.
|
||
</P>
|
||
<A NAME="toc45"></A>
|
||
<H3>Visualizing module structure</H3>
|
||
<P>
|
||
When you have created all the abstract syntaxes and
|
||
one set of concrete syntaxes needed for <CODE>Foodmarket</CODE>,
|
||
your grammar consists of eight GF modules. To see how their
|
||
dependences look like, you can use the command
|
||
<CODE>visualize_graph = vg</CODE>,
|
||
</P>
|
||
<PRE>
|
||
> visualize_graph
|
||
</PRE>
|
||
<P>
|
||
and the graph will pop up in a separate window:
|
||
</P>
|
||
<P>
|
||
<IMG ALIGN="middle" SRC="foodmarket.png" BORDER="0" ALT="">
|
||
</P>
|
||
<P>
|
||
The graph uses
|
||
</P>
|
||
<UL>
|
||
<LI>oval boxes for abstract modules
|
||
<LI>square boxes for concrete modules
|
||
<LI>black-headed arrows for inheritance
|
||
<LI>white-headed arrows for the concrete-of-abstract relation
|
||
</UL>
|
||
|
||
<P>
|
||
Just as the <CODE>visualize_tree = vt</CODE> command, the freely available tools
|
||
Ghostview and Graphviz are needed. As an alternative, you can again print
|
||
the graph into a <CODE>.dot</CODE> file by using the command <CODE>print_multi = pm</CODE>:
|
||
</P>
|
||
<PRE>
|
||
> print_multi -printer=graph | write_file Foodmarket.dot
|
||
> ! dot -Tpng Foodmarket.dot > Foodmarket.png
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc46"></A>
|
||
<H2>Summary of GF language features</H2>
|
||
<A NAME="toc47"></A>
|
||
<H3>Modules</H3>
|
||
<P>
|
||
The general form of a module is
|
||
<center>
|
||
<I>Moduletype</I> <I>M</I> <I>Of</I> <CODE>=</CODE> (<I>Extends</I> <CODE>**</CODE>)? (<CODE>open</CODE> <I>Opens</I> <CODE>in</CODE>)? <I>Body</I>
|
||
</center>
|
||
where <I>Moduletype</I> is one of <CODE>abstract</CODE>, <CODE>concrete</CODE>, and <CODE>resource</CODE>.
|
||
</P>
|
||
<P>
|
||
If <I>Moduletype</I> is <CODE>concrete</CODE>, the <I>Of</I>-part has the form <CODE>of</CODE> <I>A</I>,
|
||
where <I>A</I> is the name of an abstract module. Otherwise it is empty.
|
||
</P>
|
||
<P>
|
||
The name of the module is given by the identifier <I>M</I>.
|
||
</P>
|
||
<P>
|
||
The optional <I>Extends</I> part is a comma-separated
|
||
list of module names, which have to be modules of
|
||
the same <I>Moduletype</I>. The contents of these modules are <B>inherited</B> by
|
||
<I>M</I>. This means that they are both usable in <I>Body</I> and exported by <I>M</I>,
|
||
i.e. inherited when <I>M</I> is inherited and available when <I>M</I> is opened.
|
||
(Exception: <CODE>oper</CODE> and <CODE>param</CODE> judgements are not inherited from
|
||
<CODE>concrete</CODE> modules.)
|
||
</P>
|
||
<P>
|
||
The optional <I>Opens</I> part is a comma-separated
|
||
list of resource module names. The contents of these
|
||
modules are usable in the <I>Body</I>, but they are not exported.
|
||
</P>
|
||
<P>
|
||
Opening can be <B>qualified</B>, e.g.
|
||
</P>
|
||
<PRE>
|
||
concrete C of A = open (P = Prelude) in ...
|
||
</PRE>
|
||
<P>
|
||
This means that the names from <CODE>Prelude</CODE> are only available in the form
|
||
<CODE>P.name</CODE>. This form of qualifying a name is always possible, and it can
|
||
be used to resolve <B>name conflicts</B>, which result when the same name is
|
||
declared in more than one module that is in scope.
|
||
</P>
|
||
<A NAME="toc48"></A>
|
||
<H3>Judgements</H3>
|
||
<P>
|
||
The <I>Body</I> part consists of judgements. The judgement form table #secjment
|
||
is extended with the following forms:
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>form</TH>
|
||
<TH>reading</TH>
|
||
<TH COLSPAN="2">module type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD ALIGN="center"><CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I></TD>
|
||
<TD>operation <I>h</I> of type <I>T</I> is defined as <I>t</I></TD>
|
||
<TD>resource, concrete</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD ALIGN="right"><CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD>
|
||
<TD>parameter type P has constructors <I>C1...Cn</I></TD>
|
||
<TD>resource, concrete</TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
The <CODE>param</CODE> judgement will be explained in the next chapter.
|
||
</P>
|
||
<P>
|
||
The type part of an <CODE>oper</CODE> judgement can be omitted, if the type can be inferred
|
||
by the GF compiler.
|
||
</P>
|
||
<PRE>
|
||
oper hello = "hello" ++ "world" ;
|
||
</PRE>
|
||
<P>
|
||
As a rule, type inference works for all terms except lambda abstracts.
|
||
</P>
|
||
<P>
|
||
<B>Lambda abstracts</B> are expressions of the form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I>,
|
||
where <I>x</I> is a variable <B>bound</B> in the expression <I>t</I>, which is the
|
||
<B>body</B> of the lambda abstract. The type of the lambda abstract is
|
||
<I>A</I> <CODE>-></CODE><I>B</I>, where <I>A</I> is the type of the variable <CODE>x</CODE> and
|
||
<I>B</I> the type of the body <I>t</I>.
|
||
</P>
|
||
<P>
|
||
For multiple lambda abstractions, there is a shorthand
|
||
</P>
|
||
<PRE>
|
||
\x,y -> t === \x -> \y -> t
|
||
</PRE>
|
||
<P>
|
||
For <CODE>lin</CODE> judgements, there is the shorthand
|
||
</P>
|
||
<PRE>
|
||
lin f x = t === lin f = \x -> t
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc49"></A>
|
||
<H3>Free variation</H3>
|
||
<P>
|
||
The <CODE>variants</CODE> construct of GF can be used to give a list of
|
||
concrete syntax terms, of the same type, in free variation. For example,
|
||
</P>
|
||
<PRE>
|
||
variants {["does not"] ; "doesn't"}
|
||
</PRE>
|
||
<P>
|
||
A limiting case is the empty variant list <CODE>variants {}</CODE>.
|
||
</P>
|
||
<A NAME="toc50"></A>
|
||
<H3>The context-free grammar format</H3>
|
||
<P>
|
||
The <CODE>.cf</CODE> file format is used for <B>context-free grammars</B>, which are
|
||
always interpretable as GF grammars. Files of this format consist of
|
||
rules of the form
|
||
<center>
|
||
(<I>Label</I> <CODE>.</CODE>)? <I>Cat</I> <CODE>::=</CODE> <I>RHS</I> <CODE>;</CODE>
|
||
</center>
|
||
where the <I>RHS</I> is a sequence of terminals (quoted strings) and
|
||
nonterminals (identifiers). The optional <I>Label</I> gives the abstract
|
||
syntax function created. If it is omitted, a function name is generated
|
||
automatically. Then it is also possible to have more than one <I>RHS</I>,
|
||
separated by <I>|</I>. An empty <I>RHS</I> is interpreted as an empty sequence
|
||
of terminals, not as an empty disjunction.
|
||
</P>
|
||
<P>
|
||
The <B>Extended BNF</B> format (<B>EBNF</B>) can also be used, in files suffixed <CODE>.ebnf</CODE>.
|
||
This format does not allow user-written labels. The right-hand-side of a rule
|
||
can contain everything that is possible in the <CODE>.cf</CODE> format, but also
|
||
optional parts (<CODE>p ?</CODE>), sequences (<CODE>p *</CODE>) and non-empty sequences (<CODE>p +</CODE>).
|
||
For example, the phrases in <CODE>FoodEng</CODE> could be recognized with the following
|
||
EBNF grammar:
|
||
</P>
|
||
<PRE>
|
||
Phrase ::=
|
||
("this" | "that") Quality* ("wine" | "cheese" | "fish") "is" Quality ;
|
||
Quality ::=
|
||
("very"* ("fresh" | "warm" | "boring" | "Italian" | "expensive")) ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc51"></A>
|
||
<H3>Character encoding</H3>
|
||
<P>
|
||
The default encoding is iso-latin-1. UTF-8 can be set by the flag <CODE>coding=utf8</CODE>
|
||
in the grammar. The resource grammar libraries are in iso-latin-1, except Russian
|
||
and Arabic, which are in UTF-8. The resources may be changed to UTF-8 in future.
|
||
Letters in identifiers must currently be iso-latin-1.
|
||
</P>
|
||
<A NAME="toc52"></A>
|
||
<H1>Grammars with parameters</H1>
|
||
<P>
|
||
<a name="chapfour"></a>
|
||
</P>
|
||
<P>
|
||
In this chapter, we will introduce the techniques needed for
|
||
describing the inflection of words, as well as the rules by
|
||
which correct word forms are selected in syntactic combinations.
|
||
These techniques are already needed in a very slight extension
|
||
of the Food grammar of the previous chapter. While explaining
|
||
how the linguistic problems are solved for English and Italian,
|
||
we also cover all the language constructs GF has for
|
||
defining concrete syntax.
|
||
</P>
|
||
<P>
|
||
It is in principle possible to skip this chapter and go directly
|
||
to the next, since the use of the GF Resource Grammar library
|
||
makes it unnecessary to use any more constructs of GF than we
|
||
have already covered: parameters could be left to library implementors.
|
||
</P>
|
||
<A NAME="toc53"></A>
|
||
<H2>The problem: words have to be inflected</H2>
|
||
<P>
|
||
Suppose we want to say, with the vocabulary included in
|
||
<CODE>Food.gf</CODE>, things like
|
||
<center>
|
||
<I>these Italian wines are delicious</I>
|
||
</center>
|
||
The new grammatical facility we need are the plural forms
|
||
of nouns and verbs (<I>wines, are</I>), as opposed to their
|
||
singular forms.
|
||
</P>
|
||
<P>
|
||
The introduction of plural forms requires two things:
|
||
</P>
|
||
<UL>
|
||
<LI>the <B>inflection</B> of nouns and verbs in singular and plural
|
||
<LI>the <B>agreement</B> of the verb to subject:
|
||
the verb must have the same number as the subject
|
||
</UL>
|
||
|
||
<P>
|
||
Different languages have different types of inflection and agreement.
|
||
For instance, Italian has also agreement in gender (masculine vs. feminine).
|
||
In a multilingual grammar,
|
||
we want to express such differences between languages in the
|
||
concrete syntax while ignoring them in the abstract syntax.
|
||
</P>
|
||
<P>
|
||
To be able to do all this, we need one new judgement form
|
||
and some new expression forms.
|
||
We also need to generalize linearization types
|
||
from strings to more complex types.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Make a list of the possible forms that nouns,
|
||
adjectives, and verbs can have in some languages that you know.
|
||
</P>
|
||
<A NAME="toc54"></A>
|
||
<H2>Parameters and tables</H2>
|
||
<P>
|
||
We define the <B>parameter type</B> of number in English by
|
||
using a new form of judgement:
|
||
</P>
|
||
<PRE>
|
||
param Number = Sg | Pl ;
|
||
</PRE>
|
||
<P>
|
||
This judgement defines the parameter type <CODE>Number</CODE> by listing
|
||
its two <B>constructors</B>, <CODE>Sg</CODE> and <CODE>Pl</CODE> (common shorthands for
|
||
singular and plural).
|
||
</P>
|
||
<P>
|
||
To state that <CODE>Kind</CODE> expressions in English have a linearization
|
||
depending on number, we replace the linearization type <CODE>{s : Str}</CODE>
|
||
with a type where the <CODE>s</CODE> field is a <B>table</B> depending on number:
|
||
</P>
|
||
<PRE>
|
||
lincat Kind = {s : Number => Str} ;
|
||
</PRE>
|
||
<P>
|
||
The <B>table type</B> <CODE>Number => Str</CODE> is in many respects similar to
|
||
a function type (<CODE>Number -> Str</CODE>). The main difference is that the
|
||
argument type of a table type must always be a parameter type. This means
|
||
that the argument-value pairs can be listed in a finite table. The following
|
||
example shows such a table:
|
||
</P>
|
||
<PRE>
|
||
lin Cheese = {
|
||
s = table {
|
||
Sg => "cheese" ;
|
||
Pl => "cheeses"
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
The table consists of <B>branches</B>, where a <B>pattern</B> on the
|
||
left of the arrow <CODE>=></CODE> is assigned a <B>value</B> on the right.
|
||
</P>
|
||
<P>
|
||
The application of a table to a parameter is done by the <B>selection</B>
|
||
operator <CODE>!</CODE>, which is computed by <B>pattern matching</B>: it returns
|
||
the value from the first branch whose pattern matches the
|
||
selection argument. For instance,
|
||
</P>
|
||
<PRE>
|
||
table {Sg => "cheese" ; Pl => "cheeses"} ! Pl
|
||
===> "cheeses"
|
||
</PRE>
|
||
<P>
|
||
As syntactic sugar for table selections, we can define the
|
||
<B>case expressions</B>, which are common in functional programming and also
|
||
handy to use in GF.
|
||
</P>
|
||
<PRE>
|
||
case e of {...} === table {...} ! e
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
A parameter type can have any number of constructors, and these can
|
||
also take arguments from other parameter types. For instance, an accurate
|
||
type system for English verbs (except <I>be</I>) is
|
||
</P>
|
||
<PRE>
|
||
param VerbForm = VPresent Number | VPast | VPastPart | VPresPart ;
|
||
</PRE>
|
||
<P>
|
||
This system expresses accurately the fact that only the present tense has
|
||
number variation. (Agreement also requires variation in person, but
|
||
this can be defined in syntax rules, by picking the singular form for third person
|
||
singular subjects and the plural forms for all others). As an example of
|
||
a table, here are the forms of the verb <I>drink</I>:
|
||
</P>
|
||
<PRE>
|
||
table {
|
||
VPresent Sg => "drinks" ;
|
||
VPresent Pl => "drink" ;
|
||
VPast => "drank" ;
|
||
VPastPart => "drunk" ;
|
||
VPresPart => "drinking"
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. In an earlier exercise (previous section),
|
||
you made a list of the possible
|
||
forms that nouns, adjectives, and verbs can have in some languages that
|
||
you know. Now take some of the results and implement them by
|
||
using parameter type definitions and tables. Write them into a <CODE>resource</CODE>
|
||
module, which you can test by using the command <CODE>compute_concrete</CODE>.
|
||
</P>
|
||
<A NAME="toc55"></A>
|
||
<H2>Inflection tables and paradigms</H2>
|
||
<P>
|
||
All English common nouns are inflected for number, most of them in the
|
||
same way: the plural form is obtained from the singular by adding the
|
||
ending <I>s</I>. This rule is an example of
|
||
a <B>paradigm</B> --- a formula telling how a class of words is inflected.
|
||
</P>
|
||
<P>
|
||
From the GF point of view, a paradigm is a function that takes
|
||
a <B>lemma</B> --- also known as a <B>dictionary form</B> or a <B>citation form</B> --- and
|
||
returns an inflection
|
||
table of desired type. Paradigms are not functions in the sense of the
|
||
<CODE>fun</CODE> judgements of abstract syntax (which operate on trees and not
|
||
on strings), but operations defined in <CODE>oper</CODE> judgements.
|
||
The following operation defines the regular noun paradigm of English:
|
||
</P>
|
||
<PRE>
|
||
oper regNoun : Str -> {s : Number => Str} = \dog -> {
|
||
s = table {
|
||
Sg => dog ;
|
||
Pl => dog + "s"
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
The <B>gluing</B> operator <CODE>+</CODE> tells that
|
||
the string held in the variable <CODE>dog</CODE> and the ending <CODE>"s"</CODE>
|
||
are written together to form one <B>token</B>. Thus, for instance,
|
||
</P>
|
||
<PRE>
|
||
(regNoun "cheese").s ! Pl ===> "cheese" + "s" ===> "cheeses"
|
||
</PRE>
|
||
<P>
|
||
A more complex example are regular verbs:
|
||
</P>
|
||
<PRE>
|
||
oper regVerb : Str -> {s : VerbForm => Str} = \talk -> {
|
||
s = table {
|
||
VPresent Sg => talk + "s" ;
|
||
VPresent Pl => talk ;
|
||
VPresPart => talk + "ing" ;
|
||
_ => talk + "ed"
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Notice how a catch-all case for the past tense and the past participle
|
||
is expressed by using a <B>wild card</B> pattern <CODE>_</CODE>. Here again, pattern matching
|
||
tries all patterns in order until it finds a matching pattern;
|
||
and it is the wild card that is the first match for both <CODE>VPast</CODE> and
|
||
<CODE>VPastPart</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Identify cases in which the <CODE>regNoun</CODE> paradigm does not
|
||
apply in English, and implement some alternative paradigms.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Implement some regular paradigms for other languages you have
|
||
considered in earlier exercises.
|
||
</P>
|
||
<A NAME="toc56"></A>
|
||
<H2>Using parameters in concrete syntax</H2>
|
||
<P>
|
||
We can now enrich the concrete syntax definitions to
|
||
comprise morphology. This will permit a more radical
|
||
variation between languages (e.g. English and Italian)
|
||
than just the use of different words. In general,
|
||
parameters and linearization types are different in
|
||
different languages --- but this does not prevent using a
|
||
the common abstract syntax.
|
||
</P>
|
||
<P>
|
||
We consider a grammar <CODE>Foods</CODE>, which is similar to
|
||
<CODE>Food</CODE>, with the addition two rules for forming plural items:
|
||
</P>
|
||
<PRE>
|
||
fun These, Those : Kind -> Item ;
|
||
</PRE>
|
||
<P>
|
||
We also add a noun which in Italian has the feminine case; all nouns in
|
||
<CODE>Food</CODE> were carefully chosen to be masculine!
|
||
</P>
|
||
<PRE>
|
||
fun Pizza : Kind ;
|
||
</PRE>
|
||
<P>
|
||
This noun will force us to deal with gender in the Italian grammar,
|
||
which is what we need for the grammar to scale up for larger applications.
|
||
</P>
|
||
<A NAME="toc57"></A>
|
||
<H3>Agreement</H3>
|
||
<P>
|
||
In the English <CODE>Foods</CODE> grammar, we need just one type of parameters:
|
||
<CODE>Number</CODE> as defined above. The phrase-forming rule
|
||
</P>
|
||
<PRE>
|
||
fun Is : Item -> Quality -> Phrase ;
|
||
</PRE>
|
||
<P>
|
||
is affected by the number because of <B>subject-verb agreement</B>.
|
||
In English, agreement says that the verb of a sentence
|
||
must be inflected in the number of the subject. Thus we will linearize
|
||
</P>
|
||
<PRE>
|
||
Is (This Pizza) Warm ===> "this pizza is warm"
|
||
Is (These Pizza) Warm ===> "these pizzas are warm"
|
||
</PRE>
|
||
<P>
|
||
Here it is the <B>copula</B>, i.e. the verb <I>be</I> that is affected. We define
|
||
the copula as the operation
|
||
</P>
|
||
<PRE>
|
||
oper copula : Number -> Str = \n ->
|
||
case n of {
|
||
Sg => "is" ;
|
||
Pl => "are"
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
We don't need to inflect the copula for person and tense in this grammar.
|
||
</P>
|
||
<P>
|
||
The form of the copula in a sentence depends on the
|
||
<B>subject</B> of the sentence, i.e. the item
|
||
that is qualified. This means that an <CODE>Item</CODE> must have such a number to provide.
|
||
The obvious way to guarantee this is by including a number field in
|
||
the linearization type:
|
||
</P>
|
||
<PRE>
|
||
lincat Item = {s : Str ; n : Number} ;
|
||
</PRE>
|
||
<P>
|
||
Now we can write precisely the <CODE>Is</CODE> rule that expresses agreement:
|
||
</P>
|
||
<PRE>
|
||
lin Is item qual = {s = item.s ++ copula item.n ++ qual.s} ;
|
||
</PRE>
|
||
<P>
|
||
The copula receives the number that it needs from the subject item.
|
||
</P>
|
||
<A NAME="toc58"></A>
|
||
<H3>Determiners</H3>
|
||
<P>
|
||
Let us turn to <CODE>Item</CODE> subjects and see how they receive their
|
||
numbers. The two rules
|
||
</P>
|
||
<PRE>
|
||
fun This, These : Kind -> Item ;
|
||
</PRE>
|
||
<P>
|
||
form <CODE>Item</CODE>s from <CODE>Kind</CODE>s by adding <B>determiners</B>, either
|
||
<I>this</I> or <I>these</I>. The determiners
|
||
require different numbers of their <CODE>Kind</CODE> arguments: <CODE>This</CODE>
|
||
requires the singular (<I>this pizza</I>) and <CODE>These</CODE> the plural
|
||
(<I>these pizzas</I>). The <CODE>Kind</CODE> is the same in both cases: <CODE>Pizza</CODE>.
|
||
Thus a <CODE>Kind</CODE> must have both singular and plural forms.
|
||
The obvious way to express this is by using a table:
|
||
</P>
|
||
<PRE>
|
||
lincat Kind = {s : Number => Str} ;
|
||
</PRE>
|
||
<P>
|
||
The linearization rules for <CODE>This</CODE> and <CODE>These</CODE> can now be written
|
||
</P>
|
||
<PRE>
|
||
lin This kind = {
|
||
s = "this" ++ kind.s ! Sg ;
|
||
n = Sg
|
||
} ;
|
||
|
||
lin These kind = {
|
||
s = "these" ++ kind.s ! Pl ;
|
||
n = Pl
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
The grammatical relation between the determiner and the noun is similar to
|
||
agreement, but due to some differences into which we don't go here
|
||
it is often called <B>government</B>.
|
||
</P>
|
||
<P>
|
||
Since the same pattern for determination is used four times in
|
||
the <CODE>FoodsEng</CODE> grammar, we codify it as an operation,
|
||
</P>
|
||
<PRE>
|
||
oper det :
|
||
Str -> Number -> {s : Number => Str} -> {s : Str ; n : Number} =
|
||
\det,n,kind -> {
|
||
s = det ++ kind.s ! n ;
|
||
n = n
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Now we can write, for instance,
|
||
</P>
|
||
<PRE>
|
||
lin This = det Sg "this" ;
|
||
lin These = det Pl "these" ;
|
||
</PRE>
|
||
<P>
|
||
Notice the order of arguments that permits partial
|
||
application (<a href="#secpartapp">here</a>).
|
||
</P>
|
||
<P>
|
||
In a more <B>lexicalized</B> grammar, determiners would be made into a
|
||
category of their own and given an inherent number:
|
||
</P>
|
||
<PRE>
|
||
lincat Det = {s : Str ; n : Number} ;
|
||
fun Det : Det -> Kind -> Item ;
|
||
lin Det det kind = {
|
||
s = det.s ++ kind.s ! det.n ;
|
||
n = det.n
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Linguistically motivated grammars, such as the GF resource grammars,
|
||
usually favour lexicalized treatments of words; see <a href="#seclexical">here</a> below.
|
||
Notice that the fields of the record in <CODE>Det</CODE> are precisely the two
|
||
arguments needed in the <CODE>det</CODE> operation.
|
||
</P>
|
||
<A NAME="toc59"></A>
|
||
<H3>Parametric vs. inherent features</H3>
|
||
<P>
|
||
<CODE>Kind</CODE>s, as in general <B>common nouns</B> in English, have both singular
|
||
and plural forms; what form is chosen is determined by the construction
|
||
in which the noun is used. We say that the number is a
|
||
<B>parametric feature</B> of nouns. In GF, parametric features
|
||
appear as argument types of tables in linearization types.
|
||
</P>
|
||
<PRE>
|
||
lincat Kind = {s : Number => Str} ;
|
||
</PRE>
|
||
<P>
|
||
<CODE>Item</CODE>s, as in general <B>noun phrases</B> in English, don't
|
||
have variation in number. The number is instead an <B>inherent feature</B>,
|
||
which the noun phrase passes to the verb. In GF, inherent features
|
||
appear as record fields in linearization types.
|
||
</P>
|
||
<PRE>
|
||
lincat Item = {s : Str ; n : Number} ;
|
||
</PRE>
|
||
<P>
|
||
A category can have both parametric and inherent features. As we will see
|
||
in the Italian <CODE>Foods</CODE> grammar, nouns have parametric number and
|
||
inherent gender:
|
||
</P>
|
||
<PRE>
|
||
lincat Kind = {s : Number => Str ; g : Gender} ;
|
||
</PRE>
|
||
<P>
|
||
Nothing prevents the same parameter type from appearing both
|
||
as parametric and inherent feature, or the appearance of several inherent
|
||
features of the same type, etc. Determining the linearization types
|
||
of categories is one of the most crucial steps in the design of a GF
|
||
grammar. These two conditions must be in balance:
|
||
</P>
|
||
<UL>
|
||
<LI>existence: what forms are possible to build by morphological and
|
||
other means?
|
||
<LI>need: what features are expected via agreement or government?
|
||
</UL>
|
||
|
||
<P>
|
||
Grammar books and dictionaries give good advice on existence; for instance,
|
||
an Italian dictionary has entries such as
|
||
<center>
|
||
<B>uomo</B>, pl. <I>uomini</I>, n.m. "man"
|
||
</center>
|
||
which tells that <I>uomo</I> is a masculine noun with the plural form <I>uomini</I>.
|
||
From this alone, or with a couple more examples, we can generalize to the type
|
||
for all nouns in Italian: they have both singular and plural forms and thus
|
||
a parametric number, and they have an inherent gender.
|
||
</P>
|
||
<P>
|
||
The distinction between parametric and inherent features can be stated in
|
||
object-oriented programming terms: a linearization type is like a <B>class</B>,
|
||
which has a <B>method</B> for linearization and also some <B>attributes</B>.
|
||
In this class, the parametric features appear as arguments to the
|
||
linearization method, whereas the inherent features appear as attributes.
|
||
</P>
|
||
<P>
|
||
For words, inherent features are usually given <I>ad hoc</I> as lexical information.
|
||
For combinations, they are typically <I>inherited</I> from some part of the construction.
|
||
For instance, qualified noun constructs in Italian inherit their gender from noun part
|
||
(called the <B>head</B> of the construction in linguistics):
|
||
</P>
|
||
<PRE>
|
||
lin QKind qual kind =
|
||
let gen = kind.g in {
|
||
s = table {n => kind.s ! n ++ qual.s ! gen ! n} ;
|
||
g = gen
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
This rule uses a <B>local definition</B> (also known as a <B>let expression</B>) to
|
||
avoid computing <CODE>kind.g</CODE> twice, and also to express the linguistic
|
||
generalization that it is the same gender that is both passed to
|
||
the adjective and inherited by the construct.
|
||
The parametric number feature is in this rule passed to both the noun and
|
||
the adjective. In the table, a <B>variable pattern</B> is used to match
|
||
any possible number. Variables introduced in patterns are in scope in
|
||
the right-hand sides of corresponding branches. Again, it is good to
|
||
use a variable to express the linguistic generalization that the number
|
||
is passed to the parts, rather than expand the table into <CODE>Sg</CODE> and <CODE>Pl</CODE>
|
||
branches.
|
||
</P>
|
||
<P>
|
||
Sometimes the puzzle of making agreement and government work in a grammar has
|
||
several solutions. For instance, <B>precedence</B> in programming languages can
|
||
be equivalently described by a parametric or an inherent feature
|
||
(see <a href="#secprecedence">here</a> below).
|
||
</P>
|
||
<P>
|
||
In natural language applications that use the resource grammar library,
|
||
all parameters are hidden from the user, who thereby does not need to bother
|
||
about them. The only thing that she has to think about is what linguistic
|
||
categories are given as linearization types to each semantic category.
|
||
</P>
|
||
<P>
|
||
For instance, the GF resource grammar library has a category <CODE>NP</CODE> of
|
||
noun phrases, <CODE>AP</CODE> of adjectival phrases, and <CODE>Cl</CODE> of sentence-like clauses.
|
||
In the implementation of <CODE>Foods</CODE> <a href="#secenglish">here</a>, we will define
|
||
</P>
|
||
<PRE>
|
||
lincat Phrase = Cl ; Item = NP ; Quality = AP ;
|
||
</PRE>
|
||
<P>
|
||
To express that an item has a quality, we will use a resource function
|
||
</P>
|
||
<PRE>
|
||
mkCl : NP -> AP -> Cl ;
|
||
</PRE>
|
||
<P>
|
||
in the linearization rule:
|
||
</P>
|
||
<PRE>
|
||
lin Is = mkCl ;
|
||
</PRE>
|
||
<P>
|
||
In this way, we have no need to think about parameters and agreement.
|
||
<a href="#chapfive">the fifth chapter</a> will show a complete implementation of <CODE>Foods</CODE> by the
|
||
resource grammar, port it to many new languages, and extend it with
|
||
many new constructs.
|
||
</P>
|
||
<A NAME="toc60"></A>
|
||
<H2>An English concrete syntax for Foods with parameters</H2>
|
||
<P>
|
||
We repeat some of the rules above by showing the entire
|
||
module <CODE>FoodsEng</CODE>, equipped with parameters. The parameters and
|
||
operations are, for the sake of brevity, included in the same module
|
||
and not in a separate <CODE>resource</CODE>. However, some string operations
|
||
from the library <CODE>Prelude</CODE> are used.
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:prelude
|
||
|
||
concrete FoodsEng of Foods = open Prelude in {
|
||
|
||
lincat
|
||
S, Quality = SS ;
|
||
Kind = {s : Number => Str} ;
|
||
Item = {s : Str ; n : Number} ;
|
||
|
||
lin
|
||
Is item quality = ss (item.s ++ copula item.n ++ quality.s) ;
|
||
This = det Sg "this" ;
|
||
That = det Sg "that" ;
|
||
These = det Pl "these" ;
|
||
Those = det Pl "those" ;
|
||
QKind quality kind = {s = table {n => quality.s ++ kind.s ! n}} ;
|
||
Wine = regNoun "wine" ;
|
||
Cheese = regNoun "cheese" ;
|
||
Fish = noun "fish" "fish" ;
|
||
Pizza = regNoun "pizza" ;
|
||
Very = prefixSS "very" ;
|
||
Fresh = ss "fresh" ;
|
||
Warm = ss "warm" ;
|
||
Italian = ss "Italian" ;
|
||
Expensive = ss "expensive" ;
|
||
Delicious = ss "delicious" ;
|
||
Boring = ss "boring" ;
|
||
|
||
param
|
||
Number = Sg | Pl ;
|
||
|
||
oper
|
||
det : Number -> Str -> {s : Number => Str} -> {s : Str ; n : Number} =
|
||
\n,d,cn -> {
|
||
s = d ++ cn.s ! n ;
|
||
n = n
|
||
} ;
|
||
noun : Str -> Str -> {s : Number => Str} =
|
||
\man,men -> {s = table {
|
||
Sg => man ;
|
||
Pl => men
|
||
}
|
||
} ;
|
||
regNoun : Str -> {s : Number => Str} =
|
||
\car -> noun car (car + "s") ;
|
||
copula : Number -> Str =
|
||
\n -> case n of {
|
||
Sg => "is" ;
|
||
Pl => "are"
|
||
} ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
To find the Prelude library --- or in general,
|
||
GF files located in other directories, a <B>path directive</B> is needed
|
||
either on the command line or as the first line of
|
||
the topmost file compiled.
|
||
The paths in the path list are separated by colons (<CODE>:</CODE>), and every item
|
||
is interpreted primarily relative to the current directory and, secondarily,
|
||
to the value of <CODE>GF_LIB_PATH</CODE> (<B>GF library path</B>). Hence it is a
|
||
good idea to make <CODE>GF_LIB_PATH</CODE> to point into your <CODE>GF/lib/</CODE> whenever
|
||
you start working in GF. For instance, in the Bash shell this is done by
|
||
</P>
|
||
<PRE>
|
||
% export GF_LIB_PATH=<the location of GF/lib in your file system>
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc61"></A>
|
||
<H2>More on inflection paradigms</H2>
|
||
<P>
|
||
<a name="secinflection"></a>
|
||
</P>
|
||
<P>
|
||
Let us try to extend the English noun paradigms so that we can
|
||
deal with all nouns, not just the regular ones. The goal is to
|
||
provide a morphology module that is maximally easy to use when
|
||
words are added to the lexicon. In fact, we can think of a
|
||
division of labour where a linguistically trained grammarian
|
||
writes a morphology and hands it over to the lexicon writer
|
||
who knows much less about the rules of inflection.
|
||
</P>
|
||
<P>
|
||
In passing, we will introduce some new GF constructs: local definitions,
|
||
regular expression patterns, and operation overloading.
|
||
</P>
|
||
<A NAME="toc62"></A>
|
||
<H3>Worst-case functions</H3>
|
||
<P>
|
||
To start with, it is useful to perform <B>data abstraction</B> from the type
|
||
of nouns by writing a constructor operation, a <B>worst-case function</B>:
|
||
</P>
|
||
<PRE>
|
||
oper mkNoun : Str -> Str -> Noun = \x,y -> {
|
||
s = table {
|
||
Sg => x ;
|
||
Pl => y
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
This presupposes that we have defined
|
||
</P>
|
||
<PRE>
|
||
oper Noun : Type = {s : Number => Str} ;
|
||
</PRE>
|
||
<P>
|
||
Using <CODE>mkNoun</CODE>, we can define
|
||
</P>
|
||
<PRE>
|
||
lin Mouse = mkNoun "mouse" "mice" ;
|
||
</PRE>
|
||
<P>
|
||
and
|
||
</P>
|
||
<PRE>
|
||
oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ;
|
||
</PRE>
|
||
<P>
|
||
instead of writing the inflection tables explicitly.
|
||
</P>
|
||
<P>
|
||
Nouns like <I>mouse</I>-<I>mice</I>, are so irregular that
|
||
it hardly makes sense to see them as instances of a
|
||
paradigm that forms the plural from the singular form.
|
||
But in general, as we will see, there can be different
|
||
regular patterns in a language.
|
||
</P>
|
||
<P>
|
||
The grammar engineering advantage of worst-case functions is that
|
||
the author of the resource module may change the definitions of
|
||
<CODE>Noun</CODE> and <CODE>mkNoun</CODE>, and still retain the
|
||
interface (i.e. the system of type signatures) that makes it
|
||
correct to use these functions in concrete modules. In programming
|
||
terms, <CODE>Noun</CODE> is then treated as an <B>abstract datatype</B>:
|
||
its definition is not available, but only an indirect way of constructing
|
||
its objects.
|
||
</P>
|
||
<P>
|
||
A case where a change of the <CODE>Noun</CODE> type could
|
||
actually happen is if we introduces <B>case</B> (nominative or
|
||
genitive) in the noun inflection:
|
||
</P>
|
||
<PRE>
|
||
param Case = Nom | Gen ;
|
||
|
||
oper Noun : Type = {s : Number => Case => Str} ;
|
||
</PRE>
|
||
<P>
|
||
Now we have to redefine the worst-case function
|
||
</P>
|
||
<PRE>
|
||
oper mkNoun : Str -> Str -> Noun = \x,y -> {
|
||
s = table {
|
||
Sg => table {
|
||
Nom => x ;
|
||
Gen => x + "'s"
|
||
} ;
|
||
Pl => table {
|
||
Nom => y ;
|
||
Gen => y + case last y of {
|
||
"s" => "'" ;
|
||
_ => "'s"
|
||
}
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
But up from this level, we can retain the old definitions
|
||
</P>
|
||
<PRE>
|
||
lin Mouse = mkNoun "mouse" "mice" ;
|
||
oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ;
|
||
</PRE>
|
||
<P>
|
||
which will just compute to different values now.
|
||
</P>
|
||
<P>
|
||
In the last definition of <CODE>mkNoun</CODE>, we used a case expression
|
||
on the last character of the plural form to decide if the genitive
|
||
should be formed with an <CODE>'</CODE> (as in <I>dogs</I>-<I>dogs'</I>) or with
|
||
<CODE>'s</CODE> (as in <I>mice</I>-<I>mice's</I>). The expression <CODE>last y</CODE>
|
||
uses the <CODE>Prelude</CODE> operation
|
||
</P>
|
||
<PRE>
|
||
last : Str -> Str ;
|
||
</PRE>
|
||
<P>
|
||
The case expression uses <B>pattern matching over strings</B>, which
|
||
is supported in GF, alongside with pattern matching over
|
||
parameters.
|
||
</P>
|
||
<A NAME="toc63"></A>
|
||
<H3>Intelligent paradigms</H3>
|
||
<P>
|
||
Between the completely regular <I>dog</I>-<I>dogs</I> and the completely
|
||
irregular <I>mouse</I>-<I>mice</I>, there are some
|
||
predictable variations:
|
||
</P>
|
||
<UL>
|
||
<LI>nouns ending with an <I>y</I>: <I>fly</I>-<I>flies</I>, except if
|
||
a vowel precedes the <I>y</I>: <I>boy</I>-<I>boys</I>
|
||
<LI>nouns ending with <I>s</I>, <I>ch</I>, and a number of
|
||
other endings: <I>bus</I>-<I>buses</I>, <I>leech</I>-<I>leeches</I>
|
||
</UL>
|
||
|
||
<P>
|
||
One way to deal with them would be to provide alternative paradigms:
|
||
</P>
|
||
<PRE>
|
||
noun_y : Str -> Noun = \fly -> mkNoun fly (init fly + "ies") ;
|
||
noun_s : Str -> Noun = \bus -> mkNoun bus (bus + "es") ;
|
||
</PRE>
|
||
<P>
|
||
The Prelude function <CODE>init</CODE> drops the last character of a token.
|
||
But this solution has some drawbacks:
|
||
</P>
|
||
<UL>
|
||
<LI>it can be difficult to select the correct paradigm
|
||
<LI>it can be difficult to remember the names of the different paradigms
|
||
</UL>
|
||
|
||
<P>
|
||
To help the lexicon builder in this task, the morphology programmer
|
||
can put some intelligence in the regular noun paradigm. The easiest
|
||
way to express this in GF is by the use of <B>regular expression patterns</B>:
|
||
</P>
|
||
<PRE>
|
||
regNoun : Str -> Noun = \w ->
|
||
let
|
||
ws : Str = case w of {
|
||
_ + ("a" | "e" | "i" | "o") + "o" => w + "s" ; -- bamboo
|
||
_ + ("s" | "x" | "sh" | "o") => w + "es" ; -- bus, hero
|
||
_ + "z" => w + "zes" ;-- quiz
|
||
_ + ("a" | "e" | "o" | "u") + "y" => w + "s" ; -- boy
|
||
x + "y" => x + "ies" ;-- fly
|
||
_ => w + "s" -- car
|
||
}
|
||
in
|
||
mkNoun w ws
|
||
</PRE>
|
||
<P>
|
||
In this definition, we have used a local definition just in order to
|
||
structure the code, even though there is no multiple evaluation to eliminate.
|
||
In the case expression itself, we have used
|
||
</P>
|
||
<UL>
|
||
<LI><B>disjunctive patterns</B> <I>P</I> <CODE>|</CODE> <I>Q</I>
|
||
<LI><B>concatenation patterns</B> <I>P</I> <CODE>+</CODE> <I>Q</I>
|
||
</UL>
|
||
|
||
<P>
|
||
The patterns are ordered in such a way that, for instance,
|
||
the suffix <CODE>"oo"</CODE> prevents <I>bamboo</I> from matching the suffix
|
||
<CODE>"o"</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. The same rules that form plural nouns in English also
|
||
apply in the formation of third-person singular verbs.
|
||
Write a regular verb paradigm that uses this idea, but first
|
||
rewrite <CODE>regNoun</CODE> so that the analysis needed to build <I>s</I>-forms
|
||
is factored out as a separate <CODE>oper</CODE>, which is shared with
|
||
<CODE>regVerb</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the verb paradigms to cover all verb forms
|
||
in English, with special care taken of variations with the suffix
|
||
<I>ed</I> (e.g. <I>try</I>-<I>tried</I>, <I>use</I>-<I>used</I>).
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Implement the German <B>Umlaut</B> operation on word stems.
|
||
The operation changes the vowel of the stressed stem syllable as follows:
|
||
<I>a</I> to <I><EFBFBD></I>, <I>au</I> to <I><EFBFBD>u</I>, <I>o</I> to <I><EFBFBD></I>, and <I>u</I> to <I><EFBFBD></I>. You
|
||
can assume that the operation only takes syllables as arguments. Test the
|
||
operation to see whether it correctly changes <I>Arzt</I> to <I><EFBFBD>rzt</I>,
|
||
<I>Baum</I> to <I>B<EFBFBD>um</I>, <I>Topf</I> to <I>T<EFBFBD>pf</I>, and <I>Kuh</I> to <I>K<EFBFBD>h</I>.
|
||
</P>
|
||
<A NAME="toc64"></A>
|
||
<H3>Function types with variables</H3>
|
||
<P>
|
||
In <a href="#chapsix">the sixth chapter</a>, we will introduce <B>dependent function types</B>, where
|
||
the value type depends on the argument. For this end, we need a notation
|
||
that binds a variable to the argument type, as in
|
||
</P>
|
||
<PRE>
|
||
switchOff : (k : Kind) -> Action k
|
||
</PRE>
|
||
<P>
|
||
Function types <I>without</I>
|
||
variables are actually a shorthand notation: writing
|
||
</P>
|
||
<PRE>
|
||
PredVP : NP -> VP -> S
|
||
</PRE>
|
||
<P>
|
||
is shorthand for
|
||
</P>
|
||
<PRE>
|
||
PredVP : (x : NP) -> (y : VP) -> S
|
||
</PRE>
|
||
<P>
|
||
or any other naming of the variables. Actually the use of variables
|
||
sometimes shortens the code, since they can share a type:
|
||
</P>
|
||
<PRE>
|
||
octuple : (x,y,z,u,v,w,s,t : Str) -> Str
|
||
</PRE>
|
||
<P>
|
||
If a bound variable is not used, it can here, as elsewhere in GF, be replaced by
|
||
a wildcard:
|
||
</P>
|
||
<PRE>
|
||
octuple : (_,_,_,_,_,_,_,_ : Str) -> Str
|
||
</PRE>
|
||
<P>
|
||
A good practice for functions with many arguments of the same type
|
||
is to indicate the number of arguments:
|
||
</P>
|
||
<PRE>
|
||
octuple : (x1,_,_,_,_,_,_,x8 : Str) -> Str
|
||
</PRE>
|
||
<P>
|
||
One can also use heuristic variable names to document what
|
||
information each argument is expected to provide.
|
||
This is very handy in the types of inflection paradigms:
|
||
</P>
|
||
<PRE>
|
||
mkNoun : (mouse,mice : Str) -> Noun
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc65"></A>
|
||
<H3>Separating operation types and definitions</H3>
|
||
<P>
|
||
In grammars intended as libraries, it is useful to separate oparation
|
||
definitions from their type signatures. The user is only interested
|
||
in the type, whereas the definition is kept for the implementor and
|
||
the maintainer. This is possible by using separate <CODE>oper</CODE> fragments
|
||
for the two parts:
|
||
</P>
|
||
<PRE>
|
||
oper regNoun : Str -> Noun ;
|
||
oper regNoun s = mkNoun s (s + "s") ;
|
||
</PRE>
|
||
<P>
|
||
The type checker combines the two into one <CODE>oper</CODE> judgement to see
|
||
if the definition matches the type. Notice that, in this syntax, it
|
||
is moreover possible to bind the argument variables on the left hand side
|
||
instead of using lambda abstration.
|
||
</P>
|
||
<P>
|
||
In the library module, the type signatures are typically placed in
|
||
the beginning and the definitions in the end. A more radical separation
|
||
can be achieved by using the <CODE>interface</CODE> and <CODE>instance</CODE> module types
|
||
(see <a href="#secinterface">here</a>): the type signatures are placed in the interface
|
||
and the definitions in the instance.
|
||
</P>
|
||
<A NAME="toc66"></A>
|
||
<H3>Overloading of operations</H3>
|
||
<P>
|
||
Large libraries, such as the GF Resource Grammar Library, may define
|
||
hundreds of names. This can be unpractical
|
||
for both the library author and the user: the author has to invent longer
|
||
and longer names which are not always intuitive,
|
||
and the author has to learn or at least be able to find all these names.
|
||
A solution to this problem, adopted by languages such as C++,
|
||
is <B>overloading</B>: one and the same name can be used for several functions.
|
||
When such a name is used, the
|
||
compiler performs <B>overload resolution</B> to find out which of
|
||
the possible functions is meant. Overload resolution is based on
|
||
the types of the functions: all functions that
|
||
have the same name must have different types.
|
||
</P>
|
||
<P>
|
||
In C++, functions with the same name can be scattered everywhere in the program.
|
||
In GF, they must be grouped together in <CODE>overload</CODE> groups. Here is an example
|
||
of an overload group, giving the different ways to define nouns in English:
|
||
</P>
|
||
<PRE>
|
||
oper mkN : overload {
|
||
mkN : (dog : Str) -> Noun ; -- regular nouns
|
||
mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Intuitively, the function comes very close to the way in which
|
||
regular and irregular words are given in most dictionaries. If the
|
||
word is regular, just one form is needed. If it is irregular,
|
||
more forms are given. There is no need to use explicit paradigm
|
||
names.
|
||
</P>
|
||
<P>
|
||
The <CODE>mkN</CODE> example gives only the possible types of the overloaded
|
||
operation. Their definitions can be given separately, possibly in another module.
|
||
Here is a definition of the above overload group:
|
||
</P>
|
||
<PRE>
|
||
oper mkN = overload {
|
||
mkN : (dog : Str) -> Noun = regNoun ;
|
||
mkN : (mouse,mice : Str) -> Noun = mkNoun ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Notice that the types of the branches must be repeated so that they can be
|
||
associated with proper definitions; the order of the branches has no
|
||
significance.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Design a system of English verb paradigms presented by
|
||
an overload group.
|
||
</P>
|
||
<A NAME="toc67"></A>
|
||
<H3>Morphological analysis and morphology quiz</H3>
|
||
<P>
|
||
Even though morphology is in GF
|
||
mostly used as an auxiliary for syntax, it
|
||
can also be useful on its own right. The command <CODE>morpho_analyse = ma</CODE>
|
||
can be used to read a text and return for each word the analyses that
|
||
it has in the current concrete syntax.
|
||
</P>
|
||
<PRE>
|
||
> read_file bible.txt | morpho_analyse
|
||
</PRE>
|
||
<P>
|
||
In the same way as translation exercises, morphological exercises can
|
||
be generated, by the command <CODE>morpho_quiz = mq</CODE>. Usually,
|
||
the category is then set to some lexical category. For instance,
|
||
French irregular verbs in the resource grammar library can be trained as
|
||
follows:
|
||
</P>
|
||
<PRE>
|
||
% gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/IrregFre.gfc
|
||
|
||
> morpho_quiz -cat=V
|
||
|
||
Welcome to GF Morphology Quiz.
|
||
...
|
||
|
||
r<>appara<72>tre : VFin VCondit Pl P2
|
||
r<>apparaitriez
|
||
> No, not r<>apparaitriez, but
|
||
r<>appara<72>triez
|
||
Score 0/1
|
||
</PRE>
|
||
<P>
|
||
Just like translation exercises, a list of morphological exercises can be generated
|
||
off-line and saved in a
|
||
file for later use, by the command <CODE>morpho_list = ml</CODE>
|
||
</P>
|
||
<PRE>
|
||
> morpho_list -number=25 -cat=V | write_file exx.txt
|
||
</PRE>
|
||
<P>
|
||
The <CODE>number</CODE> flag gives the number of exercises generated.
|
||
</P>
|
||
<A NAME="toc68"></A>
|
||
<H2>The Italian Foods grammar</H2>
|
||
<P>
|
||
<a name="secitalian"></a>
|
||
</P>
|
||
<P>
|
||
We conclude the parametrization of the Food grammar by presenting an
|
||
Italian variant, now complete with parameters, inflection, and
|
||
agreement.
|
||
</P>
|
||
<P>
|
||
The header part is similar to English:
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:prelude
|
||
|
||
concrete FoodsIta of Foods = open Prelude in {
|
||
</PRE>
|
||
<P>
|
||
Parameters include not only number but also gender.
|
||
</P>
|
||
<PRE>
|
||
param
|
||
Number = Sg | Pl ;
|
||
Gender = Masc | Fem ;
|
||
</PRE>
|
||
<P>
|
||
Qualities are inflected for gender and number, whereas kinds
|
||
have a parametric number (as in English) and an inherent gender.
|
||
Items have an inherent number (as in English) but also gender.
|
||
</P>
|
||
<PRE>
|
||
lincat
|
||
Phr = SS ;
|
||
Quality = {s : Gender => Number => Str} ;
|
||
Kind = {s : Number => Str ; g : Gender} ;
|
||
Item = {s : Str ; g : Gender ; n : Number} ;
|
||
</PRE>
|
||
<P>
|
||
A Quality is expressed by an adjective, which in Italian has one form for each
|
||
gender-number combination.
|
||
</P>
|
||
<PRE>
|
||
oper
|
||
adjective : (_,_,_,_ : Str) -> {s : Gender => Number => Str} =
|
||
\nero,nera,neri,nere -> {
|
||
s = table {
|
||
Masc => table {
|
||
Sg => nero ;
|
||
Pl => neri
|
||
} ;
|
||
Fem => table {
|
||
Sg => nera ;
|
||
Pl => nere
|
||
}
|
||
}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
The very common case of regular adjectives works by adding
|
||
endings to the stem.
|
||
</P>
|
||
<PRE>
|
||
regAdj : Str -> {s : Gender => Number => Str} = \nero ->
|
||
let ner = init nero
|
||
in adjective nero (ner + "a") (ner + "i") (ner + "e") ;
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
For noun inflection, there are several paradigms; since only two forms
|
||
are ever needed, we will just give them explicitly (the resource grammar
|
||
library also has a paradigm that takes the singular form and infers the
|
||
plural and the gender from it).
|
||
</P>
|
||
<PRE>
|
||
noun : Str -> Str -> Gender -> {s : Number => Str ; g : Gender} =
|
||
\vino,vini,g -> {
|
||
s = table {
|
||
Sg => vino ;
|
||
Pl => vini
|
||
} ;
|
||
g = g
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
As in <CODE>FoodEng</CODE>, we need only number variation for the copula.
|
||
</P>
|
||
<PRE>
|
||
copula : Number -> Str =
|
||
\n -> case n of {
|
||
Sg => "<22>" ;
|
||
Pl => "sono"
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Determination is more complex than in English, because of gender:
|
||
it uses separate determiner forms for the two genders, and selects
|
||
one of them as function of the noun determined.
|
||
</P>
|
||
<PRE>
|
||
det : Number -> Str -> Str -> {s : Number => Str ; g : Gender} ->
|
||
{s : Str ; g : Gender ; n : Number} =
|
||
\n,m,f,cn -> {
|
||
s = case cn.g of {Masc => m ; Fem => f} ++ cn.s ! n ;
|
||
g = cn.g ;
|
||
n = n
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Here is, finally, the complete set of linearization rules.
|
||
</P>
|
||
<PRE>
|
||
lin
|
||
Is item quality =
|
||
ss (item.s ++ copula item.n ++ quality.s ! item.g ! item.n) ;
|
||
This = det Sg "questo" "questa" ;
|
||
That = det Sg "quello" "quella" ;
|
||
These = det Pl "questi" "queste" ;
|
||
Those = det Pl "quelli" "quelle" ;
|
||
QKind quality kind = {
|
||
s = \\n => kind.s ! n ++ quality.s ! kind.g ! n ;
|
||
g = kind.g
|
||
} ;
|
||
Wine = noun "vino" "vini" Masc ;
|
||
Cheese = noun "formaggio" "formaggi" Masc ;
|
||
Fish = noun "pesce" "pesci" Masc ;
|
||
Pizza = noun "pizza" "pizze" Fem ;
|
||
Very qual = {s = \\g,n => "molto" ++ qual.s ! g ! n} ;
|
||
Fresh = adjective "fresco" "fresca" "freschi" "fresche" ;
|
||
Warm = regAdj "caldo" ;
|
||
Italian = regAdj "italiano" ;
|
||
Expensive = regAdj "caro" ;
|
||
Delicious = regAdj "delizioso" ;
|
||
Boring = regAdj "noioso" ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Experiment with multilingual generation and translation in the
|
||
<CODE>Foods</CODE> grammars.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add items, qualities, and determiners to the grammar, and try to get
|
||
their inflection and inherent features right.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for a language of your choice,
|
||
now aiming for complete grammatical correctness by the use of parameters.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Measure the size of the context-free grammar corresponding to
|
||
<CODE>FoodsIta</CODE>. You can do this by printing the grammar in the context-free format
|
||
(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines.
|
||
</P>
|
||
<A NAME="toc69"></A>
|
||
<H2>Discontinuous constituents</H2>
|
||
<P>
|
||
A linearization type may contain more strings than one.
|
||
An example of where this is useful are English particle
|
||
verbs, such as <I>switch off</I>. The linearization of
|
||
a sentence may place the object between the verb and the particle:
|
||
<I>he switched it off</I>.
|
||
</P>
|
||
<P>
|
||
The following judgement defines transitive verbs as
|
||
<B>discontinuous constituents</B>, i.e. as having a linearization
|
||
type with two strings and not just one.
|
||
</P>
|
||
<PRE>
|
||
lincat TV = {s : Number => Str ; part : Str} ;
|
||
</PRE>
|
||
<P>
|
||
In the abstract syntax, we can now have a rule that combines a subject and an
|
||
object item with a transitive verb to form a sentence:
|
||
</P>
|
||
<PRE>
|
||
fun AppTV : Item -> TV -> Item -> Phrase ;
|
||
</PRE>
|
||
<P>
|
||
The linearization rule places the object between the two parts of the verb:
|
||
</P>
|
||
<PRE>
|
||
lin AppTV subj tv obj =
|
||
{s = subj.s ++ tv.s ! subj.n ++ obj.s ++ tv.part} ;
|
||
</PRE>
|
||
<P>
|
||
There is no restriction in the number of discontinuous constituents
|
||
(or other fields) a <CODE>lincat</CODE> may contain. The only condition is that
|
||
the fields must be built from records, tables,
|
||
parameters, and <CODE>Str</CODE>, but not functions.
|
||
</P>
|
||
<P>
|
||
Notice that the parsing and linearization commands only give accurate
|
||
results for categories whose linearization type has a unique <CODE>Str</CODE>
|
||
valued field labelled <CODE>s</CODE>. Therefore, discontinuous constituents
|
||
are not a good idea in top-level categories accessed by the users
|
||
of a grammar application.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Define the language <CODE>a^n b^n c^n</CODE> in GF, i.e.
|
||
any number of <I>a</I>'s followed by the same number of <I>b</I>'s and
|
||
the same number of <I>c</I>'s. This language is not context-free,
|
||
but can be defined in GF by using discontinuous constituents.
|
||
</P>
|
||
<A NAME="toc70"></A>
|
||
<H2>Strings at compile time vs. run time</H2>
|
||
<P>
|
||
A common difficulty in GF are the conditions under which tokens
|
||
can be created. Tokens are created in the following ways:
|
||
</P>
|
||
<UL>
|
||
<LI>quoted string: <CODE>"foo"</CODE>
|
||
<LI>gluing : <CODE>t + s</CODE>
|
||
<LI>predefined operations <CODE>init, tail, tk, dp</CODE>
|
||
<LI>pattern matching over strings
|
||
</UL>
|
||
|
||
<P>
|
||
The general principle is that
|
||
<I>tokens must be known at compile time</I>. This means that the above operations
|
||
may not have <B>run-time variables</B> in their arguments. Run-time variables, in
|
||
turn, are the variables that stand for function arguments in linearization rules.
|
||
</P>
|
||
<P>
|
||
Hence it is not legal to write
|
||
</P>
|
||
<PRE>
|
||
cat Noun ;
|
||
fun Plural : Noun -> Noun ;
|
||
lin Plural n = {s = n.s + "s"} ;
|
||
</PRE>
|
||
<P>
|
||
because <CODE>n</CODE> is a run-time variable. Also
|
||
</P>
|
||
<PRE>
|
||
lin Plural n = {s = (regNoun n).s ! Pl} ;
|
||
</PRE>
|
||
<P>
|
||
is incorrect with <CODE>regNoun</CODE> as defined <a href="#secinflection">here</a>, because the run-time
|
||
variable is eventually sent to string pattern matching and gluing.
|
||
</P>
|
||
<P>
|
||
Writing tokens together without a space is an often-wanted behaviour, for instance,
|
||
with punctuation marks. Thus one might try to write
|
||
</P>
|
||
<PRE>
|
||
lin Question p = {s = p + "?"} ;
|
||
</PRE>
|
||
<P>
|
||
which is incorrect. The way to go is to use an <B>unlexer</B> that creates correct spacing
|
||
after linearization. Correspondingly, a <B>lexer</B> that e.g. analyses <CODE>"warm?"</CODE> into
|
||
to tokens is needed before parsing. This can be done by using flags:
|
||
</P>
|
||
<PRE>
|
||
flags lexer=text ; unlexer=text ;
|
||
</PRE>
|
||
<P>
|
||
works in the desired way for English text. More on lexers and unlexers will be
|
||
told <a href="#seclexing">here</a>.
|
||
</P>
|
||
<A NAME="toc71"></A>
|
||
<H2>Summary of GF language features</H2>
|
||
<A NAME="toc72"></A>
|
||
<H3>Parameter and table types</H3>
|
||
<P>
|
||
A judgement of the form
|
||
<center>
|
||
<CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <I>X1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I> <I>Xn</I>
|
||
</center>
|
||
defines a <B>parameter type</B> <I>P</I> with <B>constructors</B> <I>C1</I> ... <I>Cn</I>.
|
||
Each constructor has a <B>context</B> <I>X</I>, which is a (possibly empty)
|
||
sequence of parameter types. A <B>parameter value</B> is an application
|
||
of a constructor to a sequence of parameter values from each type in
|
||
its context.
|
||
</P>
|
||
<P>
|
||
In addition to types defined in <CODE>param</CODE> judgements, also
|
||
records of parameter types are parameter types. Their values are records
|
||
of corresponding field values.
|
||
</P>
|
||
<P>
|
||
Moreover, the type <CODE>Ints</CODE> <I>n</I> is a parameter type for any positive
|
||
integer <I>n</I>, and its values are <CODE>0</CODE>, ..., <I>n-1</I>.
|
||
</P>
|
||
<P>
|
||
A <B>table type</B> <I>P</I> <CODE>=></CODE> <I>T</I> must have a parameter type <I>P</I> as
|
||
its argument type. The normal form of an object of this type is a <B>table</B>
|
||
<center>
|
||
<CODE>table {</CODE> <I>V1</I> <CODE>=></CODE> <I>t1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>Vm</I> <CODE>=></CODE> <I>tm</I> <CODE>}</CODE>
|
||
</center>
|
||
which has a <B>branch</B> for every parameter value <I>Vi</I> of type <I>P</I>.
|
||
A table can be given in many other ways by using pattern matching.
|
||
</P>
|
||
<P>
|
||
Tables with only one branch are a common special case.
|
||
GF provides syntactic sugar for writing one-branch tables concisely:
|
||
</P>
|
||
<PRE>
|
||
\\P,...,Q => t === table {P => ... table {Q => t} ...}
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc73"></A>
|
||
<H3>Pattern matching</H3>
|
||
<P>
|
||
<a name="secmatching"></a>
|
||
</P>
|
||
<P>
|
||
We will list all forms of patterns that can be used in table branches.
|
||
the following are available for any parameter types, as well
|
||
as for the types <CODE>Int</CODE> and <CODE>Str</CODE>
|
||
</P>
|
||
<UL>
|
||
<LI>a constructor pattern <I>C P1 ... Pn</I> matches any value <I>C V1 ... Vn</I> where
|
||
each <I>Vi</I> matches <I>Pi</I>,
|
||
and binds the union of all variables bound in the subpatterns <I>Pi</I>
|
||
<LI>a record pattern
|
||
<CODE>{</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>}</CODE>
|
||
matches any record that has values of the corresponding fields.
|
||
and binds the union of all variables bound in the subpatterns <I>Pi</I>
|
||
<LI>a variable pattern <I>x</I>
|
||
(identifier other than constant parameter) matches any value, and
|
||
binds <I>x</I> to this value
|
||
<LI>the wild card <CODE>_</CODE> matches any value
|
||
<LI>a disjunctive pattern <I>P</I> <CODE>|</CODE> <I>Q</I> matches anything that
|
||
either <I>P</I> or <I>Q</I> matches; bindings must be the same in both
|
||
<LI>a negative pattern <CODE>-</CODE><I>P</I> matches anything that <I>P</I> does not match;
|
||
no bindings are returned
|
||
<LI>an alias pattern <I>x</I> <CODE>@</CODE> <I>P</I> matches whatever value <I>P</I> matches and
|
||
binds <I>x</I> to this value; also the bindings in <I>P</I> are returned
|
||
</UL>
|
||
|
||
<P>
|
||
The following patterns are only available for the type <CODE>Str</CODE>:
|
||
</P>
|
||
<UL>
|
||
<LI>a string literal pattern, e.g. <CODE>"s"</CODE>, matches the same string
|
||
<LI>a concatenation pattern <I>P</I> <CODE>+</CODE> <I>Q</I> matches any string that consists
|
||
of a prefix matching <I>P</I> and a suffix matching <I>Q</I>;
|
||
the union of bindings is returned
|
||
<LI>a repetition pattern <I>P</I><CODE>*</CODE> matches any string that can be decomposed
|
||
into strings that match <I>P</I>; no bindings are returned
|
||
</UL>
|
||
|
||
<P>
|
||
The following pattern is only available for the types <CODE>Int</CODE> and <CODE>Ints</CODE> <I>n</I>:
|
||
</P>
|
||
<UL>
|
||
<LI>an integer literal pattern, e.g. <CODE>214</CODE>, matches the same integer
|
||
</UL>
|
||
|
||
<P>
|
||
Pattern matching is performed in the order in which the branches
|
||
appear in the table: the branch of the first matching pattern is followed.
|
||
The type checker reject sets of patterns that are not exhaustive, and
|
||
warns for completely overshadowed patterns.
|
||
To guarantee exhaustivity when the infinite types <CODE>Int</CODE> and <CODE>Str</CODE> are
|
||
used as argument types, the last pattern must be a "catch-all" variable
|
||
or wild card.
|
||
</P>
|
||
<P>
|
||
It follows from the definition of record pattern matching
|
||
that it can utilize partial records: the branch
|
||
</P>
|
||
<PRE>
|
||
{g = Fem} => t
|
||
</PRE>
|
||
<P>
|
||
in a table of type <CODE>{g : Gender ; n : Number} => T</CODE> means the same as
|
||
</P>
|
||
<PRE>
|
||
{g = Fem ; n = _} => t
|
||
</PRE>
|
||
<P>
|
||
Variables in regular expression patterns
|
||
are always bound to the <B>first match</B>, which is the first
|
||
in the sequence of binding lists. For example:
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>x + "e" + y</CODE> matches <CODE>"peter"</CODE> with <CODE>x = "p", y = "ter"</CODE>
|
||
<LI><CODE>x + "er"*</CODE> matches <CODE>"burgerer"</CODE> with ``x = "burg"
|
||
</UL>
|
||
|
||
<A NAME="toc74"></A>
|
||
<H3>Overloading</H3>
|
||
<P>
|
||
Judgements of the <CODE>oper</CODE> form can introduce overloaded functions.
|
||
The syntax is record-like, but all fields must have the same
|
||
name and different types.
|
||
</P>
|
||
<PRE>
|
||
oper mkN = overload {
|
||
mkN : (dog : Str) -> Noun = regNoun ;
|
||
mkN : (mouse,mice : Str) -> Noun = mkNoun ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
To give just the type of an overloaded operation, the record type
|
||
syntax is used.
|
||
</P>
|
||
<PRE>
|
||
oper mkN : overload {
|
||
mkN : (dog : Str) -> Noun ; -- regular nouns
|
||
mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Overloading is not possible in other forms of judgement.
|
||
</P>
|
||
<A NAME="toc75"></A>
|
||
<H3>Local definitions</H3>
|
||
<P>
|
||
Local definitions ("<CODE>let</CODE> expressions") can appear in groups:
|
||
</P>
|
||
<PRE>
|
||
oper regNoun : Str -> Noun = \vino ->
|
||
let
|
||
vin : Str = init vino ;
|
||
o = last vino
|
||
in
|
||
...
|
||
</PRE>
|
||
<P>
|
||
The type can be omitted if it can be inferred. Later definitions may
|
||
refer to earlier ones.
|
||
</P>
|
||
<A NAME="toc76"></A>
|
||
<H3>Supplementary constructs</H3>
|
||
<P>
|
||
The rest of the GF language constructs are presented for the sake
|
||
of completeness. They will not be used in the rest of this tutorial.
|
||
</P>
|
||
<H4>Record extension and subtyping</H4>
|
||
<P>
|
||
Record types and records can be <B>extended</B> with new fields. For instance,
|
||
in German it is natural to see transitive verbs as verbs with a case, which
|
||
is usually accusative or dative, and is passed to the object of the verb.
|
||
The symbol <CODE>**</CODE> is used for both record types and record objects.
|
||
</P>
|
||
<PRE>
|
||
lincat TV = Verb ** {c : Case} ;
|
||
|
||
lin Follow = regVerb "folgen" ** {c = Dative} ;
|
||
</PRE>
|
||
<P>
|
||
To extend a record type or a record with a field whose label it
|
||
already has is a type error. It is also an error to extend a type or
|
||
object that is not a record.
|
||
</P>
|
||
<P>
|
||
A record type <I>T</I> is a <B>subtype</B> of another one <I>R</I>, if <I>T</I> has
|
||
all the fields of <I>R</I> and possibly other fields. For instance,
|
||
an extension of a record type is always a subtype of it.
|
||
If <I>T</I> is a subtype of <I>R</I>, then <I>R</I> is a <B>supertype</B> of <I>T</I>.
|
||
</P>
|
||
<P>
|
||
If <I>T</I> is a subtype of <I>R</I>, an object of <I>T</I> can be used whenever
|
||
an object of <I>R</I> is required.
|
||
For instance, a transitive verb can be used whenever a verb is required.
|
||
</P>
|
||
<P>
|
||
<B>Covariance</B> means that a function returning a record <I>T</I> as value can
|
||
also be used to return a value of a supertype <I>R</I>.
|
||
<B>Contravariance</B> means that a function taking an <I>R</I> as argument
|
||
can also be applied to any object of a subtype <I>T</I>.
|
||
</P>
|
||
<H4>Tuples and product types</H4>
|
||
<P>
|
||
Product types and tuples are syntactic sugar for record types and records:
|
||
</P>
|
||
<PRE>
|
||
T1 * ... * Tn === {p1 : T1 ; ... ; pn : Tn}
|
||
<t1, ..., tn> === {p1 = T1 ; ... ; pn = Tn}
|
||
</PRE>
|
||
<P>
|
||
Thus the labels <CODE>p1, p2,...</CODE> are hard-coded.
|
||
As patterns, tuples are translated to record patterns in the
|
||
same way as tuples to records; partial patterns make it
|
||
possible to write, slightly surprisingly,
|
||
</P>
|
||
<PRE>
|
||
case <g,n,p> of {
|
||
<Fem> => t
|
||
...
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<H4>Prefix-dependent choices</H4>
|
||
<P>
|
||
Sometimes a token has different forms depending on the token
|
||
that follows. An example is the English indefinite article,
|
||
which is <I>an</I> if a vowel follows, <I>a</I> otherwise.
|
||
Which form is chosen can only be decided at run time, i.e.
|
||
when a string is actually build. GF has a special construct for
|
||
such tokens, the <CODE>pre</CODE> construct exemplified in
|
||
</P>
|
||
<PRE>
|
||
oper artIndef : Str =
|
||
pre {"a" ; "an" / strs {"a" ; "e" ; "i" ; "o"}} ;
|
||
</PRE>
|
||
<P>
|
||
Thus
|
||
</P>
|
||
<PRE>
|
||
artIndef ++ "cheese" ---> "a" ++ "cheese"
|
||
artIndef ++ "apple" ---> "an" ++ "apple"
|
||
</PRE>
|
||
<P>
|
||
This very example does not work in all situations: the prefix
|
||
<I>u</I> has no general rules, and some problematic words are
|
||
<I>euphemism, one-eyed, n-gram</I>. Since the branches are matched in
|
||
order, it is possible to write
|
||
</P>
|
||
<PRE>
|
||
oper artIndef : Str =
|
||
pre {"a" ;
|
||
"a" / strs {"eu" ; "one"} ;
|
||
"an" / strs {"a" ; "e" ; "i" ; "o" ; "n-"}
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
Somewhat illogically, the default value is given as the first element in the list.
|
||
</P>
|
||
<P>
|
||
<I>Prefix-dependent choice may be deprecated in GF version 3.</I>
|
||
</P>
|
||
<A NAME="toc77"></A>
|
||
<H1>Using the resource grammar library</H1>
|
||
<P>
|
||
<a name="chapfive"></a>
|
||
</P>
|
||
<P>
|
||
In this chapter, we will take a look at the GF resource grammar library.
|
||
We will use the library to implement the <CODE>Foods</CODE> grammar of the
|
||
previous chapter
|
||
and port it to some new languages. Some new concepts of GF's module system
|
||
are also introduced, most notably the technique of <B>parametrized modules</B>,
|
||
which has become an important "design pattern" for multilingual grammars.
|
||
</P>
|
||
<A NAME="toc78"></A>
|
||
<H2>The coverage of the library</H2>
|
||
<P>
|
||
The GF Resource Grammar Library contains grammar rules for
|
||
10 languages. In addition, 2 languages are available as yet incomplete
|
||
implementations, and a few more are under construction. The purpose
|
||
of the library is to define the low-level morphological and syntactic
|
||
rules of languages, and thereby enable application programmers
|
||
to concentrate on the semantic and stylistic
|
||
aspects of their grammars. The guiding principle is that
|
||
<center>
|
||
grammar checking becomes type checking
|
||
</center>
|
||
that is, whatever is type-correct in the resource grammar is also
|
||
grammatically correct.
|
||
</P>
|
||
<P>
|
||
The intended level of application grammarians
|
||
is that of a skilled programmer with
|
||
a practical knowledge of the target languages, but without
|
||
theoretical knowledge about their grammars.
|
||
Such a combination of
|
||
skills is typical of programmers who, for instance, want to localize
|
||
language software to new languages.
|
||
</P>
|
||
<P>
|
||
The current resource languages are
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>Ara</CODE>bic (incomplete)
|
||
<LI><CODE>Cat</CODE>alan (incomplete)
|
||
<LI><CODE>Dan</CODE>ish
|
||
<LI><CODE>Eng</CODE>lish
|
||
<LI><CODE>Fin</CODE>nish
|
||
<LI><CODE>Fre</CODE>nch
|
||
<LI><CODE>Ger</CODE>man
|
||
<LI><CODE>Ita</CODE>lian
|
||
<LI><CODE>Nor</CODE>wegian
|
||
<LI><CODE>Rus</CODE>sian
|
||
<LI><CODE>Spa</CODE>nish
|
||
<LI><CODE>Swe</CODE>dish
|
||
</UL>
|
||
|
||
<P>
|
||
The first three letters (<CODE>Eng</CODE> etc) are used in grammar module names.
|
||
We use the three-letter codes for languages from the ISO 639 standard.
|
||
</P>
|
||
<P>
|
||
The incomplete Arabic and Catalan implementations are
|
||
sufficient for use in some applications; they both contain, amoung other
|
||
things, complete inflectional morphology.
|
||
</P>
|
||
<A NAME="toc79"></A>
|
||
<H2>The structure of the library</H2>
|
||
<P>
|
||
<a name="seclexical"></a>
|
||
</P>
|
||
<A NAME="toc80"></A>
|
||
<H3>Lexical vs. phrasal rules</H3>
|
||
<P>
|
||
So far we have looked at grammars from a semantic point of view:
|
||
a grammar defines a system of meanings (specified in the abstract syntax) and
|
||
tells how they are expressed in some language (as specified in the concrete syntax).
|
||
In resource grammars, as often in the linguistic tradition, the goal is more modest:
|
||
to specify the <B>grammatically correct combinations of words</B>, whatever their
|
||
meanings are. With this more modest goal, it is possible to achieve a much
|
||
wider coverage than with semantic grammars.
|
||
</P>
|
||
<P>
|
||
Given the focus on <I>words</I> and their combinations,
|
||
the resource grammar has two kinds of categories and two kinds of rules:
|
||
</P>
|
||
<UL>
|
||
<LI>lexical:
|
||
<UL>
|
||
<LI>lexical categories, to classify words
|
||
<LI>lexical rules, to define words and their properties
|
||
</UL>
|
||
</UL>
|
||
|
||
<UL>
|
||
<LI>phrasal (combinatorial, syntactic):
|
||
<UL>
|
||
<LI>phrasal categories, to classify phrases of arbitrary size
|
||
<LI>phrasal rules, to combine phrases into larger phrases
|
||
</UL>
|
||
</UL>
|
||
|
||
<P>
|
||
Some grammar formalisms make a formal distinction between
|
||
the lexical and syntactic
|
||
components; sometimes it is necessary to use separate formalisms for these
|
||
two kinds of rules. GF has no such restrictions.
|
||
Nevertheless, it has turned out
|
||
to be a good discipline to maintain a distinction between
|
||
the lexical and syntactic components in the resource grammar. This fits
|
||
also well with what is needed in applications: while syntactic structures
|
||
are more or less the same across applications, vocabularies can be
|
||
very different.
|
||
</P>
|
||
<A NAME="toc81"></A>
|
||
<H3>Lexical categories</H3>
|
||
<P>
|
||
Within lexical categories, there is a further classification
|
||
into <B>closed</B> and <B>open</B> categories. The definining property
|
||
of closed categories is that the
|
||
words in them can easily be enumerated; it is very seldom that any
|
||
new words are introduced in them. In general, closed categories
|
||
contain <B>structural words</B>, also known as <B>function words</B>.
|
||
Examples of closed categories are
|
||
</P>
|
||
<PRE>
|
||
QuantSg ; -- singular quantifier e.g. "this"
|
||
QuantPl ; -- plural quantifier e.g. "those"
|
||
AdA ; -- adadjective e.g. "very"
|
||
</PRE>
|
||
<P>
|
||
We have already used words of all these categories in the <CODE>Food</CODE>
|
||
examples; they have just not been assigned a category, but
|
||
treated as <B>syncategorematic</B>. In GF, a syncategoramatic
|
||
word is one that is introduced in a linearization rule of
|
||
some construction alongside with some other expressions that
|
||
are combined; there is no abstract syntax tree for that word
|
||
alone. Thus in the rules
|
||
</P>
|
||
<PRE>
|
||
fun That : Kind -> Item ;
|
||
lin That k = {"that" ++ k.s} ;
|
||
</PRE>
|
||
<P>
|
||
the word <I>that</I> is syncategoramatic. In linguistically motivated
|
||
grammars, syncategorematic words are avoided, whereas in
|
||
semantically motivated grammars, structural words are typically treated
|
||
as syncategoramatic. This is partly so because the function expressed
|
||
by a structural word in one language is often expressed by some other
|
||
means than an individual word in another. For instance, the definite
|
||
article <I>the</I> is a determiner word in English, whereas Swedish expresses
|
||
determination by inflecting the determined noun: <I>the wine</I> is <I>vinet</I>
|
||
in Swedish.
|
||
</P>
|
||
<P>
|
||
As for open categories, we will start with these two:
|
||
</P>
|
||
<PRE>
|
||
N ; -- noun e.g. "pizza"
|
||
A ; -- adjective e.g. "good"
|
||
</PRE>
|
||
<P>
|
||
Later in this chapter we will also need verbs of different kinds.
|
||
</P>
|
||
<P>
|
||
<I>Note</I>. Having adadjectives as a closed category is not quite right, because
|
||
one can form adadjectives from adjectives: <I>incredibly warm</I>.
|
||
</P>
|
||
<A NAME="toc82"></A>
|
||
<H3>Lexical rules</H3>
|
||
<P>
|
||
The words of closed categories can be listed once and for all in a
|
||
library. In the first example, the <CODE>Foods</CODE> grammar of the previous section,
|
||
we will use the following structural words from the <CODE>Syntax</CODE> module:
|
||
</P>
|
||
<PRE>
|
||
this_QuantSg, that_QuantSg : QuantSg ;
|
||
these_QuantPl, those_QuantPl : QuantPl ;
|
||
very_AdA : AdA ;
|
||
</PRE>
|
||
<P>
|
||
The naming convention for lexical rules is that we use a word followed by
|
||
the category. In this way we can for instance distinguish the quantifier
|
||
<I>that</I> from the conjunction <I>that</I>.
|
||
</P>
|
||
<P>
|
||
Open lexical categories have no objects in <CODE>Syntax</CODE>. Such objects
|
||
will be built as they are needed in applications. The abstract
|
||
syntax of words in applications is already familiar, e.g.
|
||
</P>
|
||
<PRE>
|
||
fun Wine : Kind ;
|
||
</PRE>
|
||
<P>
|
||
The concrete syntax can be given directly, e.g.
|
||
</P>
|
||
<PRE>
|
||
lin Wine = mkN "wine" ;
|
||
</PRE>
|
||
<P>
|
||
by using the morphological paradigm library <CODE>ParadigmsEng</CODE>.
|
||
However, there are some advantages in giving the concrete syntax
|
||
indirectly, via the creation of a <B>resource lexicon</B>. In this lexicon,
|
||
there will be entries such as
|
||
</P>
|
||
<PRE>
|
||
oper wine_N : N = mkN "wine" ;
|
||
</PRE>
|
||
<P>
|
||
which can then be used in the linearization rules,
|
||
</P>
|
||
<PRE>
|
||
lin Wine = wine_N ;
|
||
</PRE>
|
||
<P>
|
||
One advantage of this indirect method is that each new word gives
|
||
an addition to a reusable resource lexicon, instead of just doing
|
||
the job of implementing the application. Another advantage will
|
||
be shown <a href="#secfunctor">here</a>: the possibility to write functors over
|
||
lexicon interfaces.
|
||
</P>
|
||
<A NAME="toc83"></A>
|
||
<H3>Phrasal categories</H3>
|
||
<P>
|
||
There are just four phrasal categories needed in the first application:
|
||
</P>
|
||
<PRE>
|
||
Cl ; -- clause e.g. "this pizza is good"
|
||
NP ; -- noun phrase e.g. "this pizza"
|
||
CN ; -- common noun e.g. "warm pizza"
|
||
AP ; -- adjectival phrase e.g. "very warm"
|
||
</PRE>
|
||
<P>
|
||
Clauses are, roughly, the same as declarative sentences; we will
|
||
define in <a href="#secextended">here</a> a sentence <CODE>S</CODE> as a clause that has a fixed tense.
|
||
The distinction between common nouns and noun phrases is that common nouns
|
||
cannot generally be used alone as subjects (?<I>dog sleeps</I>),
|
||
whereas noun phrases can (<I>the dog sleeps</I>).
|
||
Noun phrases can be built from common nouns by adding determiners,
|
||
such as quantifiers; but there are also other kinds of noun phrases, e.g.
|
||
pronouns.
|
||
</P>
|
||
<P>
|
||
The syntactic combinations we need are the following:
|
||
</P>
|
||
<PRE>
|
||
mkCl : NP -> AP -> Cl ; -- e.g. "this pizza is very warm"
|
||
mkNP : QuantSg -> CN -> NP ; -- e.g. "this pizza"
|
||
mkNP : QuantPl -> CN -> NP ; -- e.g. "these pizzas"
|
||
mkCN : AP -> CN -> CN ; -- e.g. "warm pizza"
|
||
mkAP : AdA -> AP -> AP ; -- e.g. "very warm"
|
||
</PRE>
|
||
<P>
|
||
To start building phrases, we need rules of <B>lexical insertion</B>, which
|
||
form phrases from single words:
|
||
</P>
|
||
<PRE>
|
||
mkCN : N -> NP ;
|
||
mkAP : A -> AP ;
|
||
</PRE>
|
||
<P>
|
||
Notice that all (or, as many as possible) operations in the resource library
|
||
have the name <CODE>mk</CODE><I>C</I>, where <I>C</I> is the value category of the operation.
|
||
This means of course heavy overloading. For instance, the current library
|
||
(version 1.2) has no less than 23 operations named <CODE>mkNP</CODE>!
|
||
</P>
|
||
<P>
|
||
Now the sentence
|
||
<center>
|
||
<I>these very warm pizzas are Italian</I>
|
||
</center>
|
||
can be built as follows:
|
||
</P>
|
||
<PRE>
|
||
mkCl
|
||
(mkNP these_QuantPl
|
||
(mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN)))
|
||
(mkAP italian_AP)
|
||
</PRE>
|
||
<P>
|
||
The task we are facing now is to define the concrete syntax of <CODE>Foods</CODE> so that
|
||
this syntactic tree gives the value of linearizing the semantic tree
|
||
</P>
|
||
<PRE>
|
||
Is (These (QKind (Very Warm) Pizza)) Italian
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc84"></A>
|
||
<H2>The resource API</H2>
|
||
<P>
|
||
The resource library API is divided into language-specific
|
||
and language-independent parts. To put it roughly,
|
||
</P>
|
||
<UL>
|
||
<LI>the syntax API is language-independent, i.e. has the same types and
|
||
functions for all languages.
|
||
Its name is <CODE>Syntax</CODE><I>L</I> for each language <I>L</I>
|
||
<LI>the morphology API is language-specific, i.e. has partly
|
||
different types and functions
|
||
for different languages.
|
||
Its name is <CODE>Paradigms</CODE><I>L</I> for each language <I>L</I>
|
||
</UL>
|
||
|
||
<P>
|
||
A full documentation of the API is available on-line in the
|
||
<B>resource synopsis</B>.
|
||
For the examples of this chapter, we will only need a
|
||
fragment of the full API. The fragment needed for <CODE>Foods</CODE> has
|
||
already been introduced, but let us summarize the descriptions
|
||
by giving tables of the same form as used in the resource synopsis.
|
||
</P>
|
||
<P>
|
||
Thus we will make use of the following categories from the module <CODE>Syntax</CODE>.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Category</TH>
|
||
<TH>Explanation</TH>
|
||
<TH COLSPAN="2">Example</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Cl</CODE></TD>
|
||
<TD>clause (sentence), with all tenses</TD>
|
||
<TD><I>she looks at this</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>AP</CODE></TD>
|
||
<TD>adjectival phrase</TD>
|
||
<TD><I>very warm</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>CN</CODE></TD>
|
||
<TD>common noun (without determiner)</TD>
|
||
<TD><I>red house</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>NP</CODE></TD>
|
||
<TD>noun phrase (subject or object)</TD>
|
||
<TD><I>the red house</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>AdA</CODE></TD>
|
||
<TD>adjective-modifying adverb,</TD>
|
||
<TD><I>very</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>QuantSg</CODE></TD>
|
||
<TD>singular quantifier</TD>
|
||
<TD><I>these</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>QuantPl</CODE></TD>
|
||
<TD>plural quantifier</TD>
|
||
<TD><I>these</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>A</CODE></TD>
|
||
<TD>one-place adjective</TD>
|
||
<TD><I>warm</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>N</CODE></TD>
|
||
<TD>common noun</TD>
|
||
<TD><I>house</I></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
We will use the following syntax rules from <CODE>Syntax</CODE>.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH>Type</TH>
|
||
<TH COLSPAN="2">Example</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkCl</CODE></TD>
|
||
<TD><CODE>NP -> AP -> Cl</CODE></TD>
|
||
<TD><I>John is very old</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkNP</CODE></TD>
|
||
<TD><CODE>QuantSg -> CN -> NP</CODE></TD>
|
||
<TD><I>this old man</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkNP</CODE></TD>
|
||
<TD><CODE>QuantPl -> CN -> NP</CODE></TD>
|
||
<TD><I>these old man</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkCN</CODE></TD>
|
||
<TD><CODE>N -> CN</CODE></TD>
|
||
<TD><I>house</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkCN</CODE></TD>
|
||
<TD><CODE>AP -> CN -> CN</CODE></TD>
|
||
<TD><I>very big blue house</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkAP</CODE></TD>
|
||
<TD><CODE>A -> AP</CODE></TD>
|
||
<TD><I>old</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkAP</CODE></TD>
|
||
<TD><CODE>AdA -> AP -> AP</CODE></TD>
|
||
<TD><I>very very old</I></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
We will use the following structural words from <CODE>Syntax</CODE>.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH>Type</TH>
|
||
<TH COLSPAN="2">In English</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>this_QuantSg</CODE></TD>
|
||
<TD><CODE>QuantSg</CODE></TD>
|
||
<TD><I>this</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>that_QuantSg</CODE></TD>
|
||
<TD><CODE>QuantSg</CODE></TD>
|
||
<TD><I>that</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>these_QuantPl</CODE></TD>
|
||
<TD><CODE>QuantPl</CODE></TD>
|
||
<TD><I>this</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>those_QuantPl</CODE></TD>
|
||
<TD><CODE>QuantPl</CODE></TD>
|
||
<TD><I>that</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>very_AdA</CODE></TD>
|
||
<TD><CODE>AdA</CODE></TD>
|
||
<TD><I>very</I></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
For English, we will use the following part of <CODE>ParadigmsEng</CODE>.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH COLSPAN="2">Type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(dog : Str) -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(man,men : Str) -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkA</CODE></TD>
|
||
<TD><CODE>(cold : Str) -> A</CODE></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
For Italian, we need just the following part of <CODE>ParadigmsIta</CODE>
|
||
(Exercise). The "smart" paradigms will take care of variations
|
||
such as <I>formaggio</I>-<I>formaggi</I>, and also infer the genders
|
||
correctly.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH COLSPAN="2">Type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(vino : Str) -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkA</CODE></TD>
|
||
<TD><CODE>(caro : Str) -> A</CODE></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
For German, we will use the following part of <CODE>ParadigmsGer</CODE>.
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH COLSPAN="2">Type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Gender</CODE></TD>
|
||
<TD><CODE>Type</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>masculine</CODE></TD>
|
||
<TD><CODE>Gender</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>feminine</CODE></TD>
|
||
<TD><CODE>Gender</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>neuter</CODE></TD>
|
||
<TD><CODE>Gender</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(Stufe : Str) -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(Bild,Bilder : Str) -> Gender -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkA</CODE></TD>
|
||
<TD><CODE>(klein : Str) -> A</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkA</CODE></TD>
|
||
<TD><CODE>(gut,besser,beste : Str) -> A</CODE></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
For Finnish, we only need the smart regular paradigms:
|
||
</P>
|
||
<TABLE CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Function</TH>
|
||
<TH COLSPAN="2">Type</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkN</CODE></TD>
|
||
<TD><CODE>(talo : Str) -> N</CODE></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>mkA</CODE></TD>
|
||
<TD><CODE>(hieno : Str) -> A</CODE></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Try out the morphological paradigms in different languages. Do
|
||
as follows:
|
||
</P>
|
||
<PRE>
|
||
> i -path=alltenses:prelude -retain alltenses/ParadigmsGer.gfr
|
||
> cc mkN "Farbe"
|
||
> cc mkA "gut" "besser" "beste"
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc85"></A>
|
||
<H2>Example: English</H2>
|
||
<P>
|
||
<a name="secenglish"></a>
|
||
</P>
|
||
<P>
|
||
We work with the abstract syntax <CODE>Foods</CODE> from <a href="#chaptwo">the fourth chapter</a>, and
|
||
build first an English implementation. Now we can do it without
|
||
thinking about inflection and agreement, by just picking appropriate
|
||
functions from the resource grammar library.
|
||
</P>
|
||
<P>
|
||
The concrete syntax opens <CODE>SyntaxEng</CODE> and <CODE>ParadigmsEng</CODE>
|
||
to get access to the resource libraries needed. In order to find
|
||
the libraries, a <CODE>path</CODE> directive is prepended. It contains
|
||
two resource subdirectories --- <CODE>present</CODE> and <CODE>prelude</CODE> ---
|
||
which are found relative to the environment variable <CODE>GF_LIB_PATH</CODE>.
|
||
It also contains the current directory <CODE>.</CODE> and the directory <CODE>../foods</CODE>,
|
||
in which <CODE>Foods.gf</CODE> resides.
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
|
||
concrete FoodsEng of Foods = open SyntaxEng,ParadigmsEng in {
|
||
</PRE>
|
||
<P>
|
||
As linearization types, we will use clauses for <CODE>Phrase</CODE>, noun phrases
|
||
for <CODE>Item</CODE>, common nouns for <CODE>Kind</CODE>, and adjectival phrases for <CODE>Quality</CODE>.
|
||
</P>
|
||
<PRE>
|
||
lincat
|
||
Phrase = Cl ;
|
||
Item = NP ;
|
||
Kind = CN ;
|
||
Quality = AP ;
|
||
</PRE>
|
||
<P>
|
||
These types fit perfectly with the way we have used the categories
|
||
in the application; hence
|
||
the combination rules we need almost write themselves automatically:
|
||
</P>
|
||
<PRE>
|
||
lin
|
||
Is item quality = mkCl item quality ;
|
||
This kind = mkNP this_QuantSg kind ;
|
||
That kind = mkNP that_QuantSg kind ;
|
||
These kind = mkNP these_QuantPl kind ;
|
||
Those kind = mkNP those_QuantPl kind ;
|
||
QKind quality kind = mkCN quality kind ;
|
||
Very quality = mkAP very_AdA quality ;
|
||
</PRE>
|
||
<P>
|
||
We write the lexical part of the grammar by using resource paradigms directly.
|
||
Notice that we have to apply the lexical insertion rules to get type-correct
|
||
linearizations. Notice also that we need to use the two-place noun paradigm for
|
||
<I>fish</I>, but everythins else is regular.
|
||
</P>
|
||
<PRE>
|
||
Wine = mkCN (mkN "wine") ;
|
||
Pizza = mkCN (mkN "pizza") ;
|
||
Cheese = mkCN (mkN "cheese") ;
|
||
Fish = mkCN (mkN "fish" "fish") ;
|
||
Fresh = mkAP (mkA "fresh") ;
|
||
Warm = mkAP (mkA "warm") ;
|
||
Italian = mkAP (mkA "Italian") ;
|
||
Expensive = mkAP (mkA "expensive") ;
|
||
Delicious = mkAP (mkA "delicious") ;
|
||
Boring = mkAP (mkA "boring") ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Compile the grammar <CODE>FoodsEng</CODE> and generate
|
||
and parse some sentences.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a concrete syntax of <CODE>Foods</CODE> for Italian
|
||
or some other language included in the resource library. You can
|
||
compare the results with the hand-written
|
||
grammars presented earlier in this tutorial.
|
||
</P>
|
||
<A NAME="toc86"></A>
|
||
<H2>Functor implementation of multilingual grammars</H2>
|
||
<P>
|
||
<a name="secfunctor"></a>
|
||
</P>
|
||
<P>
|
||
If you did the exercise of writing a concrete syntax of <CODE>Foods</CODE> for some other
|
||
language, you probably noticed that much of the code looks exactly the same
|
||
as for English. The reason for this is that the <CODE>Syntax</CODE> API is the
|
||
same for all languages. This is in turn possible because
|
||
all languages (at least those in the resource package)
|
||
implement the same syntactic structures. Moreover, languages tend to use the
|
||
syntactic structures in similar ways, even though this is not exceptionless.
|
||
But usually, it is only the lexical parts of a concrete syntax that
|
||
we need to write anew for a new language. Thus, to port a grammar to
|
||
a new language, you
|
||
</P>
|
||
<OL>
|
||
<LI>copy the concrete syntax of a given language
|
||
<LI>change the words (strings and inflection paradigms)
|
||
</OL>
|
||
|
||
<P>
|
||
Now, programming by copy-and-paste is not worthy of a functional programmer!
|
||
So, can we write a <I>function</I> that takes care of the shared parts of grammar modules?
|
||
Yes, we can. It is not a function in the <CODE>fun</CODE> or <CODE>oper</CODE> sense, but
|
||
a function operating on modules, called a <B>functor</B>. This construct
|
||
is familiar from the functional programming
|
||
languages ML and OCaml, but it does not
|
||
exist in Haskell. It also bears some resemblance to templates in C++.
|
||
Functors are also known as <B>parametrized modules</B>.
|
||
</P>
|
||
<P>
|
||
In GF, a functor is a module that <CODE>open</CODE>s one or more <B>interfaces</B>.
|
||
An <CODE>interface</CODE> is a module similar to a <CODE>resource</CODE>, but it only
|
||
contains the <I>types</I> of <CODE>oper</CODE>s, not their definitions. You can think
|
||
of an interface as a kind of a record type. The <CODE>oper</CODE> names are the
|
||
labels of this record type. The corresponding <I>record</I> is called an
|
||
<B>instance</B> of the interface.
|
||
Thus a functor is a module-level function taking instances as
|
||
arguments and producing modules as values.
|
||
</P>
|
||
<P>
|
||
Let us now write a functor implementation of the <CODE>Food</CODE> grammar.
|
||
Consider its module header first:
|
||
</P>
|
||
<PRE>
|
||
incomplete concrete FoodsI of Foods = open Syntax, LexFoods in
|
||
</PRE>
|
||
<P>
|
||
A functor is distinguished from an ordinary module by the leading
|
||
keyword <CODE>incomplete</CODE>.
|
||
</P>
|
||
<P>
|
||
In the functor-function analogy, <CODE>FoodsI</CODE> would be presented as a function
|
||
with the following type signature:
|
||
</P>
|
||
<PRE>
|
||
FoodsI :
|
||
instance of Syntax -> instance of LexFoods -> concrete of Foods
|
||
</PRE>
|
||
<P>
|
||
It takes as arguments instances of two interfaces:
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>Syntax</CODE>, the resource grammar interface
|
||
<LI><CODE>LexFoods</CODE>, the domain-specific lexicon interface
|
||
</UL>
|
||
|
||
<P>
|
||
Functors opening <CODE>Syntax</CODE> and a domain lexicon interface are in fact
|
||
so typical in GF applications, that this structure could be called
|
||
a <B>design pattern</B>
|
||
for GF grammars. What makes this pattern so useful is, again, that
|
||
languages tend to use the same syntactic structures and only differ in words.
|
||
</P>
|
||
<P>
|
||
We will show the exact syntax of interfaces and instances in next Section.
|
||
Here it is enough to know that we have
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>SyntaxGer</CODE>, an instance of <CODE>Syntax</CODE>
|
||
<LI><CODE>LexFoodsGer</CODE>, an instance of <CODE>LexFoods</CODE>
|
||
</UL>
|
||
|
||
<P>
|
||
Then we can complete the German implementation by "applying" the functor:
|
||
</P>
|
||
<PRE>
|
||
FoodI SyntaxGer LexFoodsGer : concrete of Foods
|
||
</PRE>
|
||
<P>
|
||
The GF syntax for doing so is
|
||
</P>
|
||
<PRE>
|
||
concrete FoodsGer of Foods = FoodsI with
|
||
(Syntax = SyntaxGer),
|
||
(LexFoods = LexFoodsGer) ;
|
||
</PRE>
|
||
<P>
|
||
Notice that this is the <I>whole</I> module, not just a header of it.
|
||
The module body is received from <CODE>FoodsI</CODE>, by instantiating the
|
||
interface constants with their definitions given in the German
|
||
instances. A module of this form, characterized by the keyword <CODE>with</CODE>, is
|
||
called a <B>functor instantiation</B>.
|
||
</P>
|
||
<P>
|
||
Here is the complete code for the functor <CODE>FoodsI</CODE>:
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
|
||
incomplete concrete FoodsI of Foods = open Syntax, LexFoods in {
|
||
lincat
|
||
Phrase = Cl ;
|
||
Item = NP ;
|
||
Kind = CN ;
|
||
Quality = AP ;
|
||
lin
|
||
Is item quality = mkCl item quality ;
|
||
This kind = mkNP this_QuantSg kind ;
|
||
That kind = mkNP that_QuantSg kind ;
|
||
These kind = mkNP these_QuantPl kind ;
|
||
Those kind = mkNP those_QuantPl kind ;
|
||
QKind quality kind = mkCN quality kind ;
|
||
Very quality = mkAP very_AdA quality ;
|
||
|
||
Wine = mkCN wine_N ;
|
||
Pizza = mkCN pizza_N ;
|
||
Cheese = mkCN cheese_N ;
|
||
Fish = mkCN fish_N ;
|
||
Fresh = mkAP fresh_A ;
|
||
Warm = mkAP warm_A ;
|
||
Italian = mkAP italian_A ;
|
||
Expensive = mkAP expensive_A ;
|
||
Delicious = mkAP delicious_A ;
|
||
Boring = mkAP boring_A ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc87"></A>
|
||
<H2>Interfaces and instances</H2>
|
||
<P>
|
||
<a name="secinterface"></a>
|
||
</P>
|
||
<P>
|
||
Let us now define the <CODE>LexFoods</CODE> interface:
|
||
</P>
|
||
<PRE>
|
||
interface LexFoods = open Syntax in {
|
||
oper
|
||
wine_N : N ;
|
||
pizza_N : N ;
|
||
cheese_N : N ;
|
||
fish_N : N ;
|
||
fresh_A : A ;
|
||
warm_A : A ;
|
||
italian_A : A ;
|
||
expensive_A : A ;
|
||
delicious_A : A ;
|
||
boring_A : A ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
In this interface, only lexical items are declared. In general, an
|
||
interface can declare any functions and also types. The <CODE>Syntax</CODE>
|
||
interface does so.
|
||
</P>
|
||
<P>
|
||
Here is a German instance of the interface.
|
||
</P>
|
||
<PRE>
|
||
instance LexFoodsGer of LexFoods = open SyntaxGer, ParadigmsGer in {
|
||
oper
|
||
wine_N = mkN "Wein" ;
|
||
pizza_N = mkN "Pizza" "Pizzen" feminine ;
|
||
cheese_N = mkN "K<>se" "K<>sen" masculine ;
|
||
fish_N = mkN "Fisch" ;
|
||
fresh_A = mkA "frisch" ;
|
||
warm_A = mkA "warm" "w<>rmer" "w<>rmste" ;
|
||
italian_A = mkA "italienisch" ;
|
||
expensive_A = mkA "teuer" ;
|
||
delicious_A = mkA "k<>stlich" ;
|
||
boring_A = mkA "langweilig" ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Notice that when an interface opens an interface, such as <CODE>Syntax</CODE>,
|
||
here, then its instance has to open an instance of it. But the instance
|
||
may also open some other resources --- very typically, like here,
|
||
a domain lexicon instance opens a <CODE>Paradigms</CODE> module.
|
||
</P>
|
||
<P>
|
||
Just to complete the picture, we repeat the German functor instantiation
|
||
for <CODE>FoodsI</CODE>, this time with a path directive that makes it compilable.
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
|
||
concrete FoodsGer of Foods = FoodsI with
|
||
(Syntax = SyntaxGer),
|
||
(LexFoods = LexFoodsGer) ;
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Compile and test <CODE>FoodsGer</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Refactor <CODE>FoodsEng</CODE> into a functor instantiation.
|
||
</P>
|
||
<A NAME="toc88"></A>
|
||
<H2>Adding languages to a functor implementation</H2>
|
||
<P>
|
||
Once we have an application grammar defined by using a functor,
|
||
adding a new language is simple. Just two modules need to be written:
|
||
</P>
|
||
<UL>
|
||
<LI>a domain lexicon instance
|
||
<LI>a functor instantiation
|
||
</UL>
|
||
|
||
<P>
|
||
The functor instantiation is completely mechanical to write.
|
||
Here is one for Finnish:
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
|
||
concrete FoodsFin of Foods = FoodsI with
|
||
(Syntax = SyntaxFin),
|
||
(LexFoods = LexFoodsFin) ;
|
||
</PRE>
|
||
<P>
|
||
The domain lexicon instance requires some knowledge of the words of the
|
||
language: what words are used for which concepts, how the words are
|
||
inflected, plus features such as genders. Here is a lexicon instance for
|
||
Finnish:
|
||
</P>
|
||
<PRE>
|
||
instance LexFoodsFin of LexFoods = open SyntaxFin, ParadigmsFin in {
|
||
oper
|
||
wine_N = mkN "viini" ;
|
||
pizza_N = mkN "pizza" ;
|
||
cheese_N = mkN "juusto" ;
|
||
fish_N = mkN "kala" ;
|
||
fresh_A = mkA "tuore" ;
|
||
warm_A = mkA "l<>mmin" ;
|
||
italian_A = mkA "italialainen" ;
|
||
expensive_A = mkA "kallis" ;
|
||
delicious_A = mkA "herkullinen" ;
|
||
boring_A = mkA "tyls<6C>" ;
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Instantiate the functor <CODE>FoodsI</CODE> to some language of
|
||
your choice.
|
||
</P>
|
||
<A NAME="toc89"></A>
|
||
<H2>Division of labour revisited</H2>
|
||
<P>
|
||
One purpose with the resource grammars was stated to be a division
|
||
of labour between linguists and application grammarians. We can now
|
||
reflect on what this means more precisely, by asking ourselves what
|
||
skills are required of grammarians working on different components.
|
||
</P>
|
||
<P>
|
||
Building a GF application starts from the abstract syntax. Writing
|
||
an abstract syntax requires
|
||
</P>
|
||
<UL>
|
||
<LI>understanding of the semantic structure of the application domain
|
||
<LI>knowledge of the GF fragment with categories and functions
|
||
</UL>
|
||
|
||
<P>
|
||
If the concrete syntax is written by using a functor, the programmer
|
||
has to decide what parts of the implementation are put to the interface
|
||
and what parts are shared in the functor. This requires
|
||
</P>
|
||
<UL>
|
||
<LI>knowing how the domain concepts are expressed in natural language
|
||
<LI>knowledge of the resource grammar library --- the categories and combinators
|
||
<LI>understanding what parts are likely to be expressed in language-dependent
|
||
ways, so that they are put to an interface and not the functor
|
||
<LI>knowledge of the GF fragment with function applications and strings
|
||
</UL>
|
||
|
||
<P>
|
||
Instantiating a ready-made functor to a new language is less demanding.
|
||
It requires essentially
|
||
</P>
|
||
<UL>
|
||
<LI>knowing how the domain words are expressed in the language
|
||
<LI>knowing, roughly, how these words are inflected
|
||
<LI>knowledge of the paradigms available in the library
|
||
<LI>knowledge of the GF fragment with function applications and strings
|
||
</UL>
|
||
|
||
<P>
|
||
Notice that none of these tasks requires the use of GF records, tables,
|
||
or parameters. Thus only a small fragment of GF is needed; the rest of
|
||
GF is only relevant for those who write the libraries. Essentially,
|
||
all the machinery introduced in <a href="#chaptwo">the fourth chapter</a> is unnecessary!
|
||
</P>
|
||
<P>
|
||
Of course, grammar writing is not always just straightforward usage of libraries.
|
||
For example, GF can be used for other languages than just those in the
|
||
libraries --- for both natural and formal languages. A knowledge of records
|
||
and tables can, unfortunately, also be needed for understanding GF's error
|
||
messages.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Design a small grammar that can be used for controlling
|
||
an MP3 player. The grammar should be able to recognize commands such
|
||
as <I>play this song</I>, with the following variations:
|
||
</P>
|
||
<UL>
|
||
<LI>verbs: <I>play</I>, <I>remove</I>
|
||
<LI>objects: <I>song</I>, <I>artist</I>
|
||
<LI>determiners: <I>this</I>, <I>the previous</I>
|
||
<LI>verbs without arguments: <I>stop</I>, <I>pause</I>
|
||
</UL>
|
||
|
||
<P>
|
||
The implementation goes in the following phases:
|
||
</P>
|
||
<OL>
|
||
<LI>abstract syntax
|
||
<LI>functor and lexicon interface
|
||
<LI>lexicon instance for the first language
|
||
<LI>functor instantiation for the first language
|
||
<LI>lexicon instance for the second language
|
||
<LI>functor instantiation for the second language
|
||
<LI>...
|
||
</OL>
|
||
|
||
<A NAME="toc90"></A>
|
||
<H2>Restricted inheritance</H2>
|
||
<P>
|
||
A functor implementation using the resource <CODE>Syntax</CODE> interface
|
||
works well as long as all concepts are expressed by using the same structures
|
||
in all languages. If this is not the case, the deviant linearization can
|
||
be made into a parameter and moved to the domain lexicon interface.
|
||
</P>
|
||
<P>
|
||
The <CODE>Foods</CODE> grammar works so well that we have to
|
||
take a contrived example: assume that English has
|
||
no word for <CODE>Pizza</CODE>, but has to use the paraphrase <I>Italian pie</I>.
|
||
This paraphrase is no longer a noun <CODE>N</CODE>, but a complex phrase
|
||
in the category <CODE>CN</CODE>. An obvious way to solve this problem is
|
||
to change interface <CODE>LexFoods</CODE> so that the constant declared for
|
||
<CODE>Pizza</CODE> gets a new type:
|
||
</P>
|
||
<PRE>
|
||
oper pizza_CN : CN ;
|
||
</PRE>
|
||
<P>
|
||
But this solution is unstable: we may end up changing the interface
|
||
and the function with each new language, and we must every time also
|
||
change the interface instances for the old languages to maintain
|
||
type correctness.
|
||
</P>
|
||
<P>
|
||
A better solution is to use <B>restricted inheritance</B>: the English
|
||
instantiation inherits the functor implementation except for the
|
||
constant <CODE>Pizza</CODE>. This is how we write:
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
|
||
concrete FoodsEng of Foods = FoodsI - [Pizza] with
|
||
(Syntax = SyntaxEng),
|
||
(LexFoods = LexFoodsEng) **
|
||
open SyntaxEng, ParadigmsEng in {
|
||
|
||
lin Pizza = mkCN (mkA "Italian") (mkN "pie") ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Restricted inheritance is available for all inherited modules. One can for
|
||
instance exclude some mushrooms and pick up just some fruit in
|
||
the <CODE>FoodMarket</CODE> example "Rsecarchitecture:
|
||
</P>
|
||
<PRE>
|
||
abstract Foodmarket = Food, Fruit [Peach], Mushroom - [Agaric]
|
||
</PRE>
|
||
<P>
|
||
A concrete syntax of <CODE>Foodmarket</CODE> must then have the same inheritance
|
||
restrictions, in order to be well-typed with respect to the abstract syntax.
|
||
</P>
|
||
<A NAME="toc91"></A>
|
||
<H2>Grammar reuse</H2>
|
||
<P>
|
||
The alert reader has certainly noticed an analogy between <CODE>abstract</CODE>
|
||
and <CODE>concrete</CODE>, on the one hand, and <CODE>interface</CODE> and <CODE>instance</CODE>,
|
||
on the other. Why are these two pairs of module types kept separate
|
||
at all? There is, in fact, a very close correspondence between
|
||
judgements in the two kinds of modules:
|
||
</P>
|
||
<PRE>
|
||
cat C <---> oper C : Type
|
||
|
||
fun f : A <---> oper f : A
|
||
|
||
lincat C = T <---> oper C : Type = T
|
||
|
||
lin f = t <---> oper f : A = t
|
||
</PRE>
|
||
<P>
|
||
But there are also some differences:
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>abstract</CODE> and <CODE>concrete</CODE> modules define <B>top-level grammars</B>, i.e.
|
||
grammars that can be used for parsing and linearization; this is because
|
||
<LI>the types and terms in <CODE>concrete</CODE> modules are restricted to a subset
|
||
of those available in <CODE>interface</CODE>, <CODE>instance</CODE>, and <CODE>resource</CODE>
|
||
<LI><CODE>param</CODE> judgements have no counterparts in top-level grammars
|
||
</UL>
|
||
|
||
<P>
|
||
The term that can be used for interfaces, instances, and resources is
|
||
<B>resource-level grammars</B>.
|
||
From these explanations and the above translations it follows that top-level
|
||
grammars are, in a sense, a special case of resource-level grammars.
|
||
</P>
|
||
<P>
|
||
Thus, indeed, abstract syntax modules can be used like interfaces, and concrete syntaxes
|
||
as their instances. The use of top-level grammars as resources
|
||
is called <B>grammar reuse</B>. Whether a library module is a top-level or a
|
||
resource-level module is mostly invisible to application programmers
|
||
(see the Summary <a href="#seclock">here</a>
|
||
for an exception to this). The GF resource grammar
|
||
library itself is in fact built in two layers:
|
||
</P>
|
||
<UL>
|
||
<LI>the <B>ground resource</B>: a set of top-level grammars for syntactic structures
|
||
<LI>the <B>surface resource</B>: a resource-level grammar with overloaded operations
|
||
defined in terms of the ground resource
|
||
</UL>
|
||
|
||
<P>
|
||
Both the ground
|
||
resource and the surface resource can be used by application programmers,
|
||
but it is the surface resource that we use in this book. Because of overloading,
|
||
it has much fewer function names and also flatter trees. For instance, the clause
|
||
<center>
|
||
<I>these very warm pizzas are Italian</I>
|
||
</center>
|
||
which in the surface resource can be built as
|
||
</P>
|
||
<PRE>
|
||
mkCl
|
||
(mkNP these_QuantPl
|
||
(mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN)))
|
||
(mkAP italian_AP)
|
||
</PRE>
|
||
<P>
|
||
has in the ground resource the much more complex tree
|
||
</P>
|
||
<PRE>
|
||
PredVP
|
||
(DetCN (DetPl (PlQuant this_Quant) NoNum NoOrd)
|
||
(AdjCN (AdAP very_AdA (PositA warm_A)) (UseN pizza_N)))
|
||
(UseComp (CompAP (PositA italian_A)))
|
||
</PRE>
|
||
<P>
|
||
The main advantage of using the ground resource is that the trees can then be found
|
||
by using the parser, as shown in the next section. Otherwise, the overloaded surface
|
||
resource constants are much easier to use.
|
||
</P>
|
||
<P>
|
||
Needless to say, once a library has been defined in some way, it is easy to
|
||
build layers of <B>derived libraries</B> on top of it, by using grammar reuse
|
||
and, in the case of multilingual libraries, functors. This is indeed how
|
||
the surface resource has been implemented: as a functored parametrized on
|
||
the abstract syntax of the ground resource.
|
||
</P>
|
||
<A NAME="toc92"></A>
|
||
<H2>Browsing the resource with GF commands</H2>
|
||
<P>
|
||
<a name="secbrowsing"></a>
|
||
</P>
|
||
<P>
|
||
In addition to reading the
|
||
<A HREF="../../lib/resource-1.0/synopsis.html">resource synopsis</A>, you
|
||
can find resource function combinations by using the parser. This
|
||
is so because the resource library is in the end implemented as
|
||
a top-level <CODE>abstract-concrete</CODE> grammar, on which parsing
|
||
and linearization work.
|
||
</P>
|
||
<P>
|
||
Unfortunately, currently (GF 2.8)
|
||
only English and the Scandinavian languages can be
|
||
parsed within acceptable computer resource limits when the full
|
||
resource is used.
|
||
</P>
|
||
<P>
|
||
To look for a syntax tree in the overload API by parsing, do like this:
|
||
</P>
|
||
<PRE>
|
||
% gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/OverLangEng.gfc
|
||
|
||
> p -cat=S -overload "this grammar is too big"
|
||
mkS (mkCl (mkNP this_QuantSg grammar_N) (mkAP too_AdA big_A))
|
||
</PRE>
|
||
<P>
|
||
The <CODE>-overload</CODE> option given to the parser is a directive to find the
|
||
shallowest overloaded term that matches the parse tree.
|
||
</P>
|
||
<P>
|
||
To view linearizations in all languages by parsing from English:
|
||
</P>
|
||
<PRE>
|
||
% gf $GF_LIB_PATH/alltenses/langs.gfcm
|
||
|
||
> p -cat=S -lang=LangEng "this grammar is too big" | tb
|
||
UseCl TPres ASimul PPos (PredVP (DetCN (DetSg (SgQuant this_Quant)
|
||
NoOrd) (UseN grammar_N)) (UseComp (CompAP (AdAP too_AdA (PositA big_A)))))
|
||
Den h<>r grammatiken <20>r f<>r stor
|
||
Esta gram<61>tica es demasiado grande
|
||
(Cyrillic: eta grammatika govorit des'at' jazykov)
|
||
Denne grammatikken er for stor
|
||
Questa grammatica <20> troppo grande
|
||
Diese Grammatik ist zu gro<72>
|
||
Cette grammaire est trop grande
|
||
T<>m<EFBFBD> kielioppi on liian suuri
|
||
This grammar is too big
|
||
Denne grammatik er for stor
|
||
</PRE>
|
||
<P>
|
||
This method shows the unambiguous ground resource functions and not
|
||
the overloaded ones. It uses a precompiled grammar package of the GFCM or GFCC
|
||
format; see <a href="#chapeight">the eighth chapter</a> for more information on this.
|
||
</P>
|
||
<P>
|
||
Unfortunately, the Russian grammar uses at the moment a different
|
||
character encoding than the rest and is therefore not displayed correctly
|
||
in a terminal window. However, the GF syntax editor does display all
|
||
examples correctly --- again, using the ground resource:
|
||
</P>
|
||
<PRE>
|
||
% gfeditor $GF_LIB_PATH/alltenses/langs.gfcm
|
||
</PRE>
|
||
<P>
|
||
When you have constructed the tree, you will see the following screen:
|
||
</P>
|
||
<P>
|
||
<center>
|
||
</P>
|
||
<P>
|
||
<IMG ALIGN="right" SRC="10lang-small.png" BORDER="0" ALT="">
|
||
</P>
|
||
<P>
|
||
</center>
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Find the resource grammar translations for the following
|
||
English phrases (parse in the category <CODE>Phr</CODE>). You can first try to
|
||
build the terms manually.
|
||
</P>
|
||
<P>
|
||
<I>every man loves a woman</I>
|
||
</P>
|
||
<P>
|
||
<I>this grammar speaks more than ten languages</I>
|
||
</P>
|
||
<P>
|
||
<I>which languages aren't in the grammar</I>
|
||
</P>
|
||
<P>
|
||
<I>which languages did you want to speak</I>
|
||
</P>
|
||
<A NAME="toc93"></A>
|
||
<H2>An extended Foods grammar</H2>
|
||
<P>
|
||
<a name="secextended"></a>
|
||
</P>
|
||
<P>
|
||
Now that we know how to find information in the resource grammar,
|
||
we can easily extend the <CODE>Foods</CODE> fragment considerably. We shall enable
|
||
the following new expressions:
|
||
</P>
|
||
<UL>
|
||
<LI>questions: <I>Is this pizza Italian?</I> <I>Which pizza do you want to eat?</I>
|
||
<LI>imperatives: <I>Eat that pizza please!</I>
|
||
<LI>denials: <I>These pizzas are not Italian.</I>
|
||
<LI>verbs: <I>eat</I>, <I>pay</I>
|
||
<LI>guests, in addition to food items: <I>I, you, this lady</I>
|
||
</UL>
|
||
|
||
<A NAME="toc94"></A>
|
||
<H3>Abstract syntax</H3>
|
||
<P>
|
||
Since we don't want to change the already existing <CODE>Foods</CODE> module,
|
||
we build an extension of it, <CODE>ExtFoods</CODE>:
|
||
</P>
|
||
<PRE>
|
||
abstract ExtFoods = Foods ** {
|
||
|
||
flags startcat=Move ;
|
||
|
||
cat
|
||
Move ; -- dialogue move: declarative, question, or imperative
|
||
Verb ; -- transitive verb
|
||
Guest ; -- guest in restaurant
|
||
GuestKind ; -- type of guest
|
||
|
||
fun
|
||
MAssert : Phrase -> Move ; -- This pizza is warm.
|
||
MDeny : Phrase -> Move ; -- This pizza isn't warm.
|
||
MAsk : Phrase -> Move ; -- Is this pizza warm?
|
||
|
||
PVerb : Guest -> Verb -> Item -> Phrase ; -- we eat this pizza
|
||
PVerbWant : Guest -> Verb -> Item -> Phrase ; -- we want to eat this pizza
|
||
|
||
WhichVerb :
|
||
Kind -> Guest -> Verb -> Move ; -- Which pizza do you eat?
|
||
WhichVerbWant :
|
||
Kind -> Guest -> Verb -> Move ; -- Which pizza do you want to eat?
|
||
WhichIs : Kind -> Quality -> Move ; -- Which wine is Italian?
|
||
|
||
Do : Verb -> Item -> Move ; -- Pay this wine!
|
||
DoPlease : Verb -> Item -> Move ; -- Pay this wine please!
|
||
|
||
I, You, We : Guest ;
|
||
|
||
GThis, GThat, GThese, GThose : GuestKind -> Guest ;
|
||
|
||
Eat, Drink, Pay : Verb ;
|
||
|
||
Lady, Gentleman : GuestKind ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
The concrete syntax is implemented by a functor that extends the
|
||
already defined functor <CODE>FoodsI</CODE>.
|
||
</P>
|
||
<PRE>
|
||
incomplete concrete ExtFoodsI of ExtFoods =
|
||
FoodsI ** open Syntax, LexFoods in {
|
||
|
||
flags lexer=text ; unlexer=text ;
|
||
</PRE>
|
||
<P>
|
||
The flags set up a lexer and unlexer that can deal with sentence-initial
|
||
capital letters and proper spacing with punctuation (see <a href="#seclexing">here</a>
|
||
for more information on lexers and unlexers).
|
||
</P>
|
||
<A NAME="toc95"></A>
|
||
<H3>Linearization types</H3>
|
||
<P>
|
||
If we look at the resource documentation, we find several categories
|
||
that are above the clause level and can thus host different kinds
|
||
of dialogue moves:
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Category</TH>
|
||
<TH>Explanation</TH>
|
||
<TH COLSPAN="2">Example</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Text</CODE></TD>
|
||
<TD>text consisting of phrases</TD>
|
||
<TD><I>He is here. Why?</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Phr</CODE></TD>
|
||
<TD>phrase in a text</TD>
|
||
<TD><I>but be quiet please</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Utt</CODE></TD>
|
||
<TD>sentence, question, word...</TD>
|
||
<TD><I>be quiet</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>S</CODE></TD>
|
||
<TD>declarative sentence</TD>
|
||
<TD><I>she lived here</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>QS</CODE></TD>
|
||
<TD>question</TD>
|
||
<TD><I>where did she live</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>Imp</CODE></TD>
|
||
<TD>imperative</TD>
|
||
<TD><I>look at this</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>QCl</CODE></TD>
|
||
<TD>question clause, with all tenses</TD>
|
||
<TD><I>why does she walk</I></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
We also find that only the category <CODE>Text</CODE> contains punctuation marks.
|
||
So we choose this as the linearization type of <CODE>Move</CODE>. The other types
|
||
are quite obvious.
|
||
</P>
|
||
<PRE>
|
||
lincat
|
||
Move = Text ;
|
||
Verb = V2 ;
|
||
Guest = NP ;
|
||
GuestKind = CN ;
|
||
</PRE>
|
||
<P>
|
||
The category <CODE>V2</CODE> of <B>two-place verbs</B> includes both
|
||
<B>transitive verbs</B> that take <B>direct objects</B> (e.g. <I>we watch him</I>)
|
||
and verbs that take other kinds of <B>complements</B>, often with
|
||
prepositions (<I>we look at him</I>). In a multilingual grammar, it is
|
||
not guaranteed that transitive verbs are transitive in all languages,
|
||
so the more general notion of two-place verb is more appropriate.
|
||
</P>
|
||
<A NAME="toc96"></A>
|
||
<H3>Linearization rules</H3>
|
||
<P>
|
||
Now we need to find constructors that combine the new categories in
|
||
appropriate ways. To form a text from a clause, we first make it into
|
||
a sentence with <CODE>mkS</CODE>, and then apply <CODE>mkText</CODE>:
|
||
</P>
|
||
<PRE>
|
||
lin MAssert p = mkText (mkS p) ;
|
||
</PRE>
|
||
<P>
|
||
The function <CODE>mkS</CODE> has in the resource synopsis been given the type
|
||
</P>
|
||
<PRE>
|
||
mkS : (Tense) -> (Ant) -> (Pol) -> Cl -> S
|
||
</PRE>
|
||
<P>
|
||
Parentheses around type names do not make any difference for the GF compiler,
|
||
but in the synopsis notation they indicate <B>optionality</B>: any of the
|
||
optional arguments can be omitted, and there is an instance of <CODE>mkS</CODE>
|
||
available. For each optional type, it uses the <B>default value</B> for that
|
||
type, which for the <B>polarity</B> <CODE>Pol</CODE> is positive i.e. unnegated.
|
||
To build a negative sentence, we use an explicit polarity constructor:
|
||
</P>
|
||
<PRE>
|
||
MDeny p = mkText (mkS negativePol p) ;
|
||
</PRE>
|
||
<P>
|
||
Of course, we could have used <CODE>positivePol</CODE> in the first rule, instead of
|
||
relying on the default. (The types <CODE>Tense</CODE> and <CODE>Ant</CODE> will be explained
|
||
<a href="#sectense">here</a>.)
|
||
</P>
|
||
<P>
|
||
Phrases can be made into <B>question sentences</B>, which in turn can be
|
||
made into texts in a similar way as sentences; the default
|
||
punctuation mark is not the full stop but the question mark.
|
||
</P>
|
||
<PRE>
|
||
MAsk p = mkText (mkQS p) ;
|
||
</PRE>
|
||
<P>
|
||
There is an <CODE>mkCl</CODE> instance that directly builds a clause from a noun phrase,
|
||
a two-place verb, and another noun phrase.
|
||
</P>
|
||
<PRE>
|
||
PVerb = mkCl ;
|
||
</PRE>
|
||
<P>
|
||
The auxiliary verb <I>want</I> requires a <B>verb phrase</B> (<CODE>VP</CODE>) as its complement. It
|
||
can be built from a two-place verb and its noun phrase complement.
|
||
</P>
|
||
<PRE>
|
||
PVerbWant guest verb item = mkCl guest want_VV (mkVP verb item) ;
|
||
</PRE>
|
||
<P>
|
||
The <B>interrogative determiner</B> (<CODE>IDet</CODE>) <I>which</I> can be combined with
|
||
a common noun to form an <B>interrogative phrase</B> (<CODE>IP</CODE>). This <CODE>IP</CODE> can then
|
||
be used as a subject in a <B>question clause</B> (<CODE>QCl</CODE>), which in turn is
|
||
made into a <CODE>QS</CODE> and finally to a <CODE>Text</CODE>.
|
||
</P>
|
||
<PRE>
|
||
WhichIs kind quality =
|
||
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) (mkVP quality))) ;
|
||
</PRE>
|
||
<P>
|
||
When interrogative phrases are used as <I>objects</I>, the resource library
|
||
uses a category named <CODE>Slash</CODE> of
|
||
objectless sentences. The name cames from the <B>slash categories</B> of the
|
||
GPSG grammar formalism
|
||
(Gazdar & al. 1985). Slashes can be formed from subjects and two-place verbs,
|
||
also with an intervening auxiliary verb.
|
||
</P>
|
||
<PRE>
|
||
WhichVerb kind guest verb =
|
||
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind)
|
||
(mkSlash guest verb))) ;
|
||
WhichVerbWant kind guest verb =
|
||
mkText (mkQS (mkQCl (mkIP whichSg_IDet kind)
|
||
(mkSlash guest want_VV verb))) ;
|
||
</PRE>
|
||
<P>
|
||
Finally, we form the <B>imperative</B> (<CODE>Imp</CODE>) of a transitive verb
|
||
and its object. We make it into a <B>polite</B> form utterance, and finally
|
||
into a <CODE>Text</CODE> with an exclamation mark.
|
||
</P>
|
||
<PRE>
|
||
Do verb item =
|
||
mkText
|
||
(mkPhr (mkUtt politeImpForm (mkImp verb item))) exclMarkPunct ;
|
||
DoPlease verb item =
|
||
mkText
|
||
(mkPhr (mkUtt politeImpForm (mkImp verb item)) please_Voc)
|
||
exclMarkPunct ;
|
||
</PRE>
|
||
<P>
|
||
The rest of the concrete syntax is straightforward use of structural words,
|
||
</P>
|
||
<PRE>
|
||
I = mkNP i_Pron ;
|
||
You = mkNP youPol_Pron ;
|
||
We = mkNP we_Pron ;
|
||
GThis = mkNP this_QuantSg ;
|
||
GThat = mkNP that_QuantSg ;
|
||
GThese = mkNP these_QuantPl ;
|
||
GThose = mkNP those_QuantPl ;
|
||
</PRE>
|
||
<P>
|
||
and of the food lexicon,
|
||
</P>
|
||
<PRE>
|
||
Eat = eat_V2 ;
|
||
Drink = drink_V2 ;
|
||
Pay = pay_V2 ;
|
||
Lady = lady_N ;
|
||
Gentleman = gentleman_N ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Notice that we have no reason to build an extension of <CODE>LexFoods</CODE>, but we just
|
||
add words to the old one. Since <CODE>LexFoods</CODE> instances are resource modules,
|
||
the superfluous definitions that they contain have no effect on the
|
||
modules that just <CODE>open</CODE> them, and thus the smaller <CODE>Foods</CODE> grammars
|
||
don't suffer from the additions we make.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Port the <CODE>ExtFoods</CODE> grammars to some new languages, building
|
||
on the <CODE>Foods</CODE> implementations from previous sections, and using the functor
|
||
defined in this section.
|
||
</P>
|
||
<A NAME="toc97"></A>
|
||
<H2>Tenses</H2>
|
||
<P>
|
||
<a name="sectense"></a>
|
||
</P>
|
||
<P>
|
||
When compiling the <CODE>ExtFoods</CODE> grammars, we have used the path
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:present:prelude
|
||
</PRE>
|
||
<P>
|
||
where the library subdirectory <CODE>present</CODE> refers to a restricted version
|
||
of the resource that covers only the present tense of verbs and sentences.
|
||
Having this version available is motivatad by efficiency reasons: tenses
|
||
produce in many languages a manifold of forms and combinations, which
|
||
multiply the size of the grammar; at the same time, many applications,
|
||
both technical ones and spoken dialogues, only need the present tense.
|
||
</P>
|
||
<P>
|
||
But it is easy change the grammars so that they admit of the full set
|
||
of tenses. It is enough to change the path to
|
||
</P>
|
||
<PRE>
|
||
--# -path=.:../foods:alltenses:prelude
|
||
</PRE>
|
||
<P>
|
||
and recompile the grammars from source (flag <CODE>-src</CODE>); the libraries are
|
||
not recompiled, because their sources cannot be found on the path list.
|
||
Then it is possible to see all the tenses of
|
||
phrases, by using the <CODE>-all</CODE> flag in linearization:
|
||
</P>
|
||
<PRE>
|
||
> gr -cat=Phrase | l -all
|
||
This wine is delicious
|
||
Is this wine delicious
|
||
This wine isn't delicious
|
||
Isn't this wine delicious
|
||
This wine is not delicious
|
||
Is this wine not delicious
|
||
This wine has been delicious
|
||
Has this wine been delicious
|
||
This wine hasn't been delicious
|
||
Hasn't this wine been delicious
|
||
This wine has not been delicious
|
||
Has this wine not been delicious
|
||
This wine was delicious
|
||
Was this wine delicious
|
||
This wine wasn't delicious
|
||
Wasn't this wine delicious
|
||
This wine was not delicious
|
||
Was this wine not delicious
|
||
This wine had been delicious
|
||
Had this wine been delicious
|
||
This wine hadn't been delicious
|
||
Hadn't this wine been delicious
|
||
This wine had not been delicious
|
||
Had this wine not been delicious
|
||
This wine will be delicious
|
||
Will this wine be delicious
|
||
This wine won't be delicious
|
||
Won't this wine be delicious
|
||
This wine will not be delicious
|
||
Will this wine not be delicious
|
||
This wine will have been delicious
|
||
Will this wine have been delicious
|
||
This wine won't have been delicious
|
||
Won't this wine have been delicious
|
||
This wine will not have been delicious
|
||
Will this wine not have been delicious
|
||
This wine would be delicious
|
||
Would this wine be delicious
|
||
This wine wouldn't be delicious
|
||
Wouldn't this wine be delicious
|
||
This wine would not be delicious
|
||
Would this wine not be delicious
|
||
This wine would have been delicious
|
||
Would this wine have been delicious
|
||
This wine wouldn't have been delicious
|
||
Wouldn't this wine have been delicious
|
||
This wine would not have been delicious
|
||
Would this wine not have been delicious
|
||
</PRE>
|
||
<P>
|
||
In addition to tenses, the linearization writes all parametric
|
||
variations --- polarity and word order (direct vs. inverted) --- as
|
||
well as the variation between contracted and full negation words.
|
||
Of course, the list is even longer in languages that have more
|
||
tenses and moods, e.g. the Romance languages.
|
||
</P>
|
||
<P>
|
||
In the <CODE>ExtFoods</CODE> grammar, tenses never find their way to the
|
||
top level of <CODE>Move</CODE>s. Therefore it is useless to carry around
|
||
the clause and verb tenses given in the <CODE>alltenses</CODE> set of libraries.
|
||
But with the library, it is easy to add tenses to <CODE>Move</CODE>s. For
|
||
instance, one can add the rules
|
||
</P>
|
||
<PRE>
|
||
fun MAssertFut : Phrase -> Move ; -- I will pay this wine
|
||
fun MAssertPastPerf : Phrase -> Move ; -- I had paid that wine
|
||
lin MAssertFut p = mkText (mkS futureTense p) ;
|
||
lin MAssertPastPerf p = mkText (mkS pastTense anteriorAnt p) ;
|
||
</PRE>
|
||
<P>
|
||
Comparison with <CODE>MAssert</CODE> above shows that the absence of the tense
|
||
and anteriority features defaults to present simultaneous tenses.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Measure the size of the context-free grammar corresponding to
|
||
some concrete syntax of <CODE>ExtFoods</CODE> with all tenses.
|
||
You can do this by printing the grammar in the context-free format
|
||
(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines.
|
||
</P>
|
||
<A NAME="toc98"></A>
|
||
<H2>Summary of GF language features</H2>
|
||
<A NAME="toc99"></A>
|
||
<H3>Interfaces and instances</H3>
|
||
<P>
|
||
An <B>interface module</B> (<CODE>interface</CODE> <I>I</I>) is like a <CODE>resource</CODE> module,
|
||
the difference being that it does not need to give definitions in
|
||
its <CODE>oper</CODE> and <CODE>param</CODE> judgements. Definitions are, however,
|
||
allowed, and they may use constants that appear undefined in the
|
||
module. For example, here is an interface for predication, which
|
||
is parametrized on NP case and agreement features, and on the constituent
|
||
order:
|
||
</P>
|
||
<PRE>
|
||
interface Predication = {
|
||
param
|
||
Case ;
|
||
Agreement ;
|
||
oper
|
||
subject : Case ;
|
||
object : Case ;
|
||
order : (verb,subj,obj : String) -> String ;
|
||
|
||
NP : Type = {s : Case => Str ; a : Agreement} ;
|
||
TV : Type = {s : Agreement => Str} ;
|
||
|
||
sentence : TV -> NP -> NP -> {s : Str} = \verb,subj,obj -> {
|
||
s = order (verb ! subj.a) (subj ! subject) (obj ! object) ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
An <B>instance module</B> (<CODE>instance</CODE> <I>J</I> <CODE>of</CODE> <I>I</I>) is also like a
|
||
<CODE>resource</CODE>, but it is compiled in union with the interface that it
|
||
is an instance <CODE>of</CODE>. This means that the definitions given in the
|
||
instance are type-checked with respect to the types given in the
|
||
interface. Moreover, overwriting types or definitions given in the interface
|
||
is not allowed. But it is legal for an instance to contain definitions
|
||
not included in the corresponding interface. Here is an instance of
|
||
<CODE>Predication</CODE>, suitable for languages like English.
|
||
</P>
|
||
<PRE>
|
||
instance PredicationSimpleSVO of Predication = {
|
||
param
|
||
Case = Nom | Acc | Gen ;
|
||
Agreement = Agr Number Person ;
|
||
|
||
-- two new types
|
||
Number = Sg | Pl ;
|
||
Person = P1 | P2 | P3 ;
|
||
|
||
oper
|
||
subject = Nom ;
|
||
object = Acc ;
|
||
order = \verb,subj,obj -> subj ++ verb ++ obj ;
|
||
|
||
-- the rest of the definitions don't need repetition
|
||
}
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc100"></A>
|
||
<H3>Grammar reuse</H3>
|
||
<P>
|
||
<a name="seclock"></a>
|
||
</P>
|
||
<P>
|
||
Abstract syntax modules can be used like interfaces, and concrete syntaxes
|
||
as their instances. The following translations then take place:
|
||
</P>
|
||
<PRE>
|
||
cat C ---> oper C : Type
|
||
|
||
fun f : A ---> oper f : A*
|
||
|
||
lincat C = T ---> oper C : Type = T'
|
||
|
||
lin f = t ---> oper f : A* = t'
|
||
</PRE>
|
||
<P>
|
||
This translation is called <B>grammar reuse</B>. It uses a homomorphism
|
||
from abstract types and terms to the concrete types and terms. For the
|
||
sake of more type safety, the types are not exactly the same. Currently
|
||
(GF 2.8), the type <I>T'</I> formed from the linearization type <I>T</I> of
|
||
a category <I>C</I> is <I>T</I> extended with a dummy <B>lock field</B>. Thus
|
||
</P>
|
||
<PRE>
|
||
lincat C = T ---> oper C = T ** {lock_C : {}}
|
||
</PRE>
|
||
<P>
|
||
and the linearization terms are lifted correspondingly. The user of
|
||
a GF library should never see any lock fields; when they appear in
|
||
the compiler's warnings, they indicate that some library category is
|
||
constructed improperly by a user program.
|
||
</P>
|
||
<A NAME="toc101"></A>
|
||
<H3>Functors</H3>
|
||
<P>
|
||
A <B>parametrized module</B>, aka. an <B>incomplete module</B>, or a
|
||
<B>functor</B>, is any module that <CODE>open</CODE>s an <CODE>interface</CODE> (or
|
||
an <CODE>abstract</CODE>). Several interfaces may be opened by one
|
||
functor. The module header must be prefixed by the word <CODE>incomplete</CODE>.
|
||
Here is a typical example, using the resource <CODE>Syntax</CODE> and
|
||
a domain specific lexicon:
|
||
</P>
|
||
<PRE>
|
||
incomplete concrete DomainI of Domain = open Syntax, Lex in {...} ;
|
||
</PRE>
|
||
<P>
|
||
A <B>functor instantiation</B> is a module that inherits a functor and
|
||
provides an instance to each of its open interfaces. Here is an example:
|
||
</P>
|
||
<PRE>
|
||
concrete DomainSwe of Domain = DomainI with
|
||
(Syntax = SyntaxSwe),
|
||
(Lex = LexSwe) ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc102"></A>
|
||
<H3>Restricted inheritance</H3>
|
||
<P>
|
||
A module of any type can make <B>restricted inheritance</B>, which is
|
||
either exclusion or inclusion:
|
||
</P>
|
||
<PRE>
|
||
module M = A[f,g], B-[k] ** ...
|
||
</PRE>
|
||
<P>
|
||
A concrete syntax given to an abstract syntax that uses restricted inheritance
|
||
must make the corresponding restrictions. In addition, the concrete syntax can
|
||
make its own restrictions in order to redefine inherited linearization types and
|
||
rules.
|
||
</P>
|
||
<P>
|
||
Overriding old definitions without explicit restrictions is not allowed.
|
||
</P>
|
||
<A NAME="toc103"></A>
|
||
<H1>Refining semantics in abstract syntax</H1>
|
||
<P>
|
||
<a name="chapsix"></a>
|
||
</P>
|
||
<P>
|
||
While the concrete syntax constructs of GF have been already
|
||
covered, there is much more that can be done in the abstract
|
||
syntax. The techniques of <B>dependent types</B> and
|
||
<B>higher order abstract syntax</B> are introduced in this chapter,
|
||
which thereby concludes the presentation of the GF language.
|
||
</P>
|
||
<P>
|
||
Many of the examples in this chapter are somewhat less close to
|
||
applications than the ones shown before. Moreover, the tools for
|
||
embedded grammars in <a href="#chapeight">the eighth chapter</a> do not yet fully support dependent
|
||
types and higher order abstract syntax.
|
||
</P>
|
||
<A NAME="toc104"></A>
|
||
<H2>GF as a logical framework</H2>
|
||
<P>
|
||
In this chapter, we will show how
|
||
to encode advanced semantic concepts in an abstract syntax.
|
||
We use concepts inherited from <B>type theory</B>. Type theory
|
||
is the basis of many systems known as <B>logical frameworks</B>, which are
|
||
used for representing mathematical theorems and their proofs on a computer.
|
||
In fact, GF has a logical framework as its proper part:
|
||
this part is the abstract syntax.
|
||
</P>
|
||
<P>
|
||
In a logical framework, the formalization of a mathematical theory
|
||
is a set of type and function declarations. The following is an example
|
||
of such a theory, represented as an <CODE>abstract</CODE> module in GF.
|
||
</P>
|
||
<PRE>
|
||
abstract Arithm = {
|
||
cat
|
||
Prop ; -- proposition
|
||
Nat ; -- natural number
|
||
fun
|
||
Zero : Nat ; -- 0
|
||
Succ : Nat -> Nat ; -- the successor of x
|
||
Even : Nat -> Prop ; -- x is even
|
||
And : Prop -> Prop -> Prop ; -- A and B
|
||
}
|
||
</PRE>
|
||
<P>
|
||
This example does not show any new type-theoretical constructs yet, but
|
||
it could nevertheless be used as a part of a proof system for arithmetic.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Give a concrete syntax of <CODE>Arithm</CODE>, preferably
|
||
by using the resource library.
|
||
</P>
|
||
<A NAME="toc105"></A>
|
||
<H2>Dependent types</H2>
|
||
<P>
|
||
<a name="secsmarthouse"></a>
|
||
</P>
|
||
<P>
|
||
<B>Dependent types</B> are a characteristic feature of GF,
|
||
inherited from the <B>constructive type theory</B> of Martin-L<>f and
|
||
distinguishing GF from most other grammar formalisms and
|
||
functional programming languages.
|
||
</P>
|
||
<P>
|
||
Dependent types can be used for stating stronger
|
||
<B>conditions of well-formedness</B> than ordinary types.
|
||
A simple example is a "smart house" system, which
|
||
defines voice commands for household appliances. This example
|
||
is borrowed from the
|
||
Regulus Book
|
||
(Rayner & al. 2006).
|
||
</P>
|
||
<P>
|
||
One who enters a smart house can use a spoken <CODE>Command</CODE> to dim lights, switch
|
||
on the fan, etc. For <CODE>Device</CODE>s of each <CODE>Kind</CODE>, there is a set of
|
||
<CODE>Action</CODE>s that can be performed on them; thus one can dim the lights but
|
||
not the fan, for example. These dependencies can be expressed
|
||
by making the type <CODE>Action</CODE> dependent on <CODE>Kind</CODE>. We express these
|
||
dependencies in <CODE>cat</CODE> declarations by attaching argument types to
|
||
categories:
|
||
</P>
|
||
<PRE>
|
||
cat
|
||
Command ;
|
||
Kind ;
|
||
Device Kind ; -- argument type Kind
|
||
Action Kind ;
|
||
</PRE>
|
||
<P>
|
||
The crucial use of the dependencies is made in the rule for forming commands:
|
||
</P>
|
||
<PRE>
|
||
fun CAction : (k : Kind) -> Action k -> Device k -> Command ;
|
||
</PRE>
|
||
<P>
|
||
In other words: an action and a device can be combined into a command only
|
||
if they are of the same <CODE>Kind</CODE> <CODE>k</CODE>. If we have the functions
|
||
</P>
|
||
<PRE>
|
||
DKindOne : (k : Kind) -> Device k ; -- the light
|
||
|
||
light, fan : Kind ;
|
||
dim : Action light ;
|
||
</PRE>
|
||
<P>
|
||
we can form the syntax tree
|
||
</P>
|
||
<PRE>
|
||
CAction light dim (DKindOne light)
|
||
</PRE>
|
||
<P>
|
||
but we cannot form the trees
|
||
</P>
|
||
<PRE>
|
||
CAction light dim (DKindOne fan)
|
||
CAction fan dim (DKindOne light)
|
||
CAction fan dim (DKindOne fan)
|
||
</PRE>
|
||
<P>
|
||
Linearization rules are written as usual: the concrete syntax does not
|
||
know if a category is a dependent type. In English, one could write as follows:
|
||
</P>
|
||
<PRE>
|
||
lincat Action = {s : Str} ;
|
||
lin CAction _ act dev = {s = act.s ++ dev.s} ;
|
||
</PRE>
|
||
<P>
|
||
Notice that the argument for <CODE>Kind</CODE> does not appear in the linearization;
|
||
therefore it is good practice to make this clear by
|
||
using a wild card for it, rather than a real
|
||
variable.
|
||
As we will show,
|
||
the type checker can reconstruct the kind from the <CODE>dev</CODE> argument.
|
||
</P>
|
||
<P>
|
||
Parsing with dependent types is performed in two phases:
|
||
</P>
|
||
<OL>
|
||
<LI>context-free parsing
|
||
<LI>filtering through type checker
|
||
</OL>
|
||
|
||
<P>
|
||
If you just parse in the usual way, you don't enter the second phase, and
|
||
the <CODE>kind</CODE> argument is not found:
|
||
</P>
|
||
<PRE>
|
||
> parse "dim the light"
|
||
CAction ? dim (DKindOne light)
|
||
</PRE>
|
||
<P>
|
||
Moreover, type-incorrect commands are not rejected:
|
||
</P>
|
||
<PRE>
|
||
> parse "dim the fan"
|
||
CAction ? dim (DKindOne fan)
|
||
</PRE>
|
||
<P>
|
||
The question mark <CODE>?</CODE> is a <B>metavariable</B>, and is returned by the parser
|
||
for any subtree that is suppressed by a linearization rule.
|
||
These are exactly the same kind of metavariables as were used <a href="#secediting">here</a>
|
||
to mark incomplete parts of trees in the syntax editor.
|
||
</P>
|
||
<P>
|
||
To get rid of metavariables, we must feed the parse result into the
|
||
second phase of <B>solving</B> them. The <CODE>solve</CODE> process uses the dependent
|
||
type checker to restore the values of the metavariables. It is invoked by
|
||
the command <CODE>put_tree = pt</CODE> with the flag <CODE>-transform=solve</CODE>:
|
||
</P>
|
||
<PRE>
|
||
> parse "dim the light" | put_tree -transform=solve
|
||
CAction light dim (DKindOne light)
|
||
</PRE>
|
||
<P>
|
||
The <CODE>solve</CODE> process may fail, in which case no tree is returned:
|
||
</P>
|
||
<PRE>
|
||
> parse "dim the fan" | put_tree -transform=solve
|
||
no tree found
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Write an abstract syntax module with above contents
|
||
and an appropriate English concrete syntax. Try to parse the commands
|
||
<I>dim the light</I> and <I>dim the fan</I>, with and without <CODE>solve</CODE> filtering.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Perform random and exhaustive generation, with and without
|
||
<CODE>solve</CODE> filtering.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add some device kinds and actions to the grammar.
|
||
</P>
|
||
<A NAME="toc106"></A>
|
||
<H2>Polymorphism</H2>
|
||
<P>
|
||
<a name="secpolymorphic"></a>
|
||
</P>
|
||
<P>
|
||
Sometimes an action can be performed on all kinds of devices. It would be
|
||
possible to introduce separate <CODE>fun</CODE> constants for each kind-action pair,
|
||
but this would be tedious. Instead, one can use <B>polymorphic</B> actions,
|
||
i.e. actions that take a <CODE>Kind</CODE> as an argument and produce an <CODE>Action</CODE>
|
||
for that <CODE>Kind</CODE>:
|
||
</P>
|
||
<PRE>
|
||
fun switchOn, switchOff : (k : Kind) -> Action k ;
|
||
</PRE>
|
||
<P>
|
||
Functions that are not polymorphic are <B>monomorphic</B>. However, the
|
||
dichotomy into monomorphism and full polymorphism is not always sufficient
|
||
for good semantic modelling: very typically, some actions are defined
|
||
for a proper subset of devices, but not just one. For instance, both doors and
|
||
windows can be opened, whereas lights cannot.
|
||
We will return to this problem by introducing the
|
||
concept of <B>restricted polymorphism</B> later,
|
||
after a section on proof objects.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. The grammar <CODE>ExtFoods</CODE> <a href="#secextended">here</a> permits the
|
||
formation of phrases such as <I>we drink this fish</I> and <I>we eat this wine</I>.
|
||
A way to prevent them is to distinguish between eatable and drinkable food items.
|
||
Another, related problem is that there is some duplicated code
|
||
due to a category distinction between guests and food items, for instance,
|
||
two constructors for the determiner <I>this</I>. This problem can also
|
||
be solved by dependent types. Rewrite the abstract syntax in <CODE>Foods</CODE> and
|
||
<CODE>ExtFoods</CODE> by using such a type system, and also update the concrete syntaxes.
|
||
If you do this right, you only have to change the functor modules
|
||
<CODE>FoodsI</CODE> and <CODE>ExtFoodsI</CODE> in the concrete syntax.
|
||
</P>
|
||
<A NAME="toc107"></A>
|
||
<H3>Digression: dependent types in concrete syntax</H3>
|
||
<P>
|
||
The <B>functional fragment</B> of GF
|
||
terms and types comprises function types, applications, lambda
|
||
abstracts, constants, and variables. This fragment is the same in
|
||
abstract and concrete syntax. In particular,
|
||
dependent types are also available in concrete syntax.
|
||
We have not made use of them yet,
|
||
but we will now look at one example of how they
|
||
can be used.
|
||
</P>
|
||
<P>
|
||
Those readers who are familiar with functional programming languages
|
||
like ML and Haskell, may already have missed <B>polymorphic</B>
|
||
functions. For instance, Haskell programmers have access to
|
||
the functions
|
||
</P>
|
||
<PRE>
|
||
const :: a -> b -> a
|
||
const c _ = c
|
||
|
||
flip :: (a -> b -> c) -> b -> a -> c
|
||
flip f y x = f x y
|
||
</PRE>
|
||
<P>
|
||
which can be used for any given types <CODE>a</CODE>,<CODE>b</CODE>, and <CODE>c</CODE>.
|
||
</P>
|
||
<P>
|
||
The GF counterpart of polymorphic functions are <B>monomorphic</B>
|
||
functions with explicit <B>type variables</B> --- a techniques that we already
|
||
used in abstract syntax for modelling actions that can be performed
|
||
on all kinds of devices. Thus the above definitions can be written
|
||
</P>
|
||
<PRE>
|
||
oper const :(a,b : Type) -> a -> b -> a =
|
||
\_,_,c,_ -> c ;
|
||
|
||
oper flip : (a,b,c : Type) -> (a -> b ->c) -> b -> a -> c =
|
||
\_,_,_,f,x,y -> f y x ;
|
||
</PRE>
|
||
<P>
|
||
When the operations are used, the type checker requires
|
||
them to be equipped with all their arguments; this may be a nuisance
|
||
for a Haskell or ML programmer. They have not been used very much,
|
||
except in the <CODE>Coordination</CODE> module of the resource library.
|
||
</P>
|
||
<A NAME="toc108"></A>
|
||
<H2>Proof objects</H2>
|
||
<P>
|
||
Perhaps the most well-known idea in constructive type theory is
|
||
the <B>Curry-Howard isomorphism</B>, also known as the
|
||
<B>propositions as types principle</B>. Its earliest formulations
|
||
were attempts to give semantics to the logical systems of
|
||
propositional and predicate calculus. In this section, we will consider
|
||
a more elementary example, showing how the notion of proof is useful
|
||
outside mathematics, as well.
|
||
</P>
|
||
<P>
|
||
We use the already shown category of unary (also known as Peano-style)
|
||
natural numbers:
|
||
</P>
|
||
<PRE>
|
||
cat Nat ;
|
||
fun Zero : Nat ;
|
||
fun Succ : Nat -> Nat ;
|
||
</PRE>
|
||
<P>
|
||
The <B>successor function</B> <CODE>Succ</CODE> generates an infinite
|
||
sequence of natural numbers, beginning from <CODE>Zero</CODE>.
|
||
</P>
|
||
<P>
|
||
We then define what it means for a number <I>x</I> to be <I>less than</I>
|
||
a number <I>y</I>. Our definition is based on two axioms:
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>Zero</CODE> is less than <CODE>Succ</CODE> <I>y</I> for any <I>y</I>.
|
||
<LI>If <I>x</I> is less than <I>y</I>, then <CODE>Succ</CODE> <I>x</I> is less than <CODE>Succ</CODE> <I>y</I>.
|
||
</UL>
|
||
|
||
<P>
|
||
The most straightforward way of expressing these axioms in type theory
|
||
is with a dependent type <CODE>Less</CODE> <I>x y</I>, and two functions constructing
|
||
its objects:
|
||
</P>
|
||
<PRE>
|
||
cat Less Nat Nat ;
|
||
fun lessZ : (y : Nat) -> Less Zero (Succ y) ;
|
||
fun lessS : (x,y : Nat) -> Less x y -> Less (Succ x) (Succ y) ;
|
||
</PRE>
|
||
<P>
|
||
Objects formed by <CODE>lessZ</CODE> and <CODE>lessS</CODE> are
|
||
called <B>proof objects</B>: they establish the truth of certain
|
||
mathematical propositions.
|
||
For instance, the fact that 2 is less that
|
||
4 has the proof object
|
||
</P>
|
||
<PRE>
|
||
lessS (Succ Zero) (Succ (Succ (Succ Zero)))
|
||
(lessS Zero (Succ (Succ Zero)) (lessZ (Succ Zero)))
|
||
</PRE>
|
||
<P>
|
||
whose type is
|
||
</P>
|
||
<PRE>
|
||
Less (Succ (Succ Zero)) (Succ (Succ (Succ (Succ Zero))))
|
||
</PRE>
|
||
<P>
|
||
which is the formalization of the proposition that 2 is less than 4.
|
||
</P>
|
||
<P>
|
||
GF grammars can be used to provide a <B>semantic control</B> of
|
||
well-formedness of expressions. We have already seen examples of this:
|
||
the grammar of well-formed actions on household devices. By introducing proof objects
|
||
we have now added an even more powerful technique of expressing semantic conditions.
|
||
</P>
|
||
<P>
|
||
A simple example of the use of proof objects is the definition of
|
||
well-formed <I>time spans</I>: a time span is expected to be from an earlier to
|
||
a later time:
|
||
</P>
|
||
<PRE>
|
||
from 3 to 8
|
||
</PRE>
|
||
<P>
|
||
is thus well-formed, whereas
|
||
</P>
|
||
<PRE>
|
||
from 8 to 3
|
||
</PRE>
|
||
<P>
|
||
is not. The following rules for spans impose this condition
|
||
by using the <CODE>Less</CODE> predicate:
|
||
</P>
|
||
<PRE>
|
||
cat Span ;
|
||
fun span : (m,n : Nat) -> Less m n -> Span ;
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
<B>Exercise</B>. Write an abstract and concrete syntax with the
|
||
concepts of this section, and experiment with it in GF.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Define the notions of "even" and "odd" in terms
|
||
of proof objects. <B>Hint</B>. You need one function for proving
|
||
that 0 is even, and two other functions for propagating the
|
||
properties.
|
||
</P>
|
||
<A NAME="toc109"></A>
|
||
<H3>Proof-carrying documents</H3>
|
||
<P>
|
||
Another possible application of proof objects is <B>proof-carrying documents</B>:
|
||
to be semantically well-formed, the abstract syntax of a document must contain a proof
|
||
of some property, although the proof is not shown in the concrete document.
|
||
Think, for instance, of small documents describing flight connections:
|
||
</P>
|
||
<P>
|
||
<I>To fly from Gothenburg to Prague, first take LH3043 to Frankfurt, then OK0537 to Prague.</I>
|
||
</P>
|
||
<P>
|
||
The well-formedness of this text is partly expressible by dependent typing:
|
||
</P>
|
||
<PRE>
|
||
cat
|
||
City ;
|
||
Flight City City ;
|
||
fun
|
||
Gothenburg, Frankfurt, Prague : City ;
|
||
LH3043 : Flight Gothenburg Frankfurt ;
|
||
OK0537 : Flight Frankfurt Prague ;
|
||
</PRE>
|
||
<P>
|
||
This rules out texts saying <I>take OK0537 from Gothenburg to Prague</I>.
|
||
However, there is a
|
||
further condition saying that it must be possible to
|
||
change from LH3043 to OK0537 in Frankfurt.
|
||
This can be modelled as a proof object of a suitable type,
|
||
which is required by the constructor
|
||
that connects flights.
|
||
</P>
|
||
<PRE>
|
||
cat
|
||
IsPossible (x,y,z : City)(Flight x y)(Flight y z) ;
|
||
fun
|
||
Connect : (x,y,z : City) ->
|
||
(u : Flight x y) -> (v : Flight y z) ->
|
||
IsPossible x y z u v -> Flight x z ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc110"></A>
|
||
<H2>Restricted polymorphism</H2>
|
||
<P>
|
||
In the first version of the smart house grammar <CODE>Smart</CODE>,
|
||
all Actions were either of
|
||
</P>
|
||
<UL>
|
||
<LI><B>monomorphic</B>: defined for one Kind
|
||
<LI><B>polymorphic</B>: defined for all Kinds
|
||
</UL>
|
||
|
||
<P>
|
||
To make this scale up for new Kinds, we can refine this to
|
||
<B>restricted polymorphism</B>: defined for Kinds of a certain <B>class</B>
|
||
</P>
|
||
<P>
|
||
The notion of class can be expressed in abstract syntax
|
||
by using the Curry-Howard isomorphism as follows:
|
||
</P>
|
||
<UL>
|
||
<LI>a class is a <B>predicate</B> of Kinds --- i.e. a type depending of Kinds
|
||
<LI>a Kind is in a class if there is a proof object of this type
|
||
</UL>
|
||
|
||
<P>
|
||
Here is an example with switching and dimming. The classes are called
|
||
<CODE>switchable</CODE> and <CODE>dimmable</CODE>.
|
||
</P>
|
||
<PRE>
|
||
cat
|
||
Switchable Kind ;
|
||
Dimmable Kind ;
|
||
fun
|
||
switchable_light : Switchable light ;
|
||
switchable_fan : Switchable fan ;
|
||
dimmable_light : Dimmable light ;
|
||
|
||
switchOn : (k : Kind) -> Switchable k -> Action k ;
|
||
dim : (k : Kind) -> Dimmable k -> Action k ;
|
||
</PRE>
|
||
<P>
|
||
One advantage of this formalization is that classes for new
|
||
actions can be added incrementally.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a new version of the <CODE>Smart</CODE> grammar with
|
||
classes, and test it in GF.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add some actions, kinds, and classes to the grammar.
|
||
Try to port the grammar to a new language. You will probably find
|
||
out that restricted polymorphism works differently in different languages.
|
||
For instance, in Finnish not only doors but also TVs and radios
|
||
can be "opened", which means switching them on.
|
||
</P>
|
||
<A NAME="toc111"></A>
|
||
<H2>Variable bindings</H2>
|
||
<P>
|
||
<a name="secbinding"></a>
|
||
</P>
|
||
<P>
|
||
Mathematical notation and programming languages have
|
||
expressions that <B>bind</B> variables. For instance,
|
||
a universally quantifier proposition
|
||
</P>
|
||
<PRE>
|
||
(All x)B(x)
|
||
</PRE>
|
||
<P>
|
||
consists of the <B>binding</B> <CODE>(All x)</CODE> of the variable <CODE>x</CODE>,
|
||
and the <B>body</B> <CODE>B(x)</CODE>, where the variable <CODE>x</CODE> can have
|
||
<B>bound occurrences</B>.
|
||
</P>
|
||
<P>
|
||
Variable bindings appear in informal mathematical language as well, for
|
||
instance,
|
||
</P>
|
||
<PRE>
|
||
for all x, x is equal to x
|
||
|
||
the function that for any numbers x and y returns the maximum of x+y
|
||
and x*y
|
||
|
||
Let x be a natural number. Assume that x is even. Then x + 3 is odd.
|
||
</PRE>
|
||
<P>
|
||
In type theory, variable-binding expression forms can be formalized
|
||
as functions that take functions as arguments. The universal
|
||
quantifier is defined
|
||
</P>
|
||
<PRE>
|
||
fun All : (Ind -> Prop) -> Prop
|
||
</PRE>
|
||
<P>
|
||
where <CODE>Ind</CODE> is the type of individuals and <CODE>Prop</CODE>,
|
||
the type of propositions. If we have, for instance, the equality predicate
|
||
</P>
|
||
<PRE>
|
||
fun Eq : Ind -> Ind -> Prop
|
||
</PRE>
|
||
<P>
|
||
we may form the tree
|
||
</P>
|
||
<PRE>
|
||
All (\x -> Eq x x)
|
||
</PRE>
|
||
<P>
|
||
which corresponds to the ordinary notation
|
||
</P>
|
||
<PRE>
|
||
(All x)(x = x).
|
||
</PRE>
|
||
<P>
|
||
An abstract syntax where trees have functions as arguments, as in
|
||
the two examples above, has turned out to be precisely the right
|
||
thing for the semantics and computer implementation of
|
||
variable-binding expressions. The advantage lies in the fact that
|
||
only one variable-binding expression form is needed, the lambda abstract
|
||
<CODE>\x -> b</CODE>, and all other bindings can be reduced to it.
|
||
This makes it easier to implement mathematical theories and reason
|
||
about them, since variable binding is tricky to implement and
|
||
to reason about. The idea of using functions as arguments of
|
||
syntactic constructors is known as <B>higher-order abstract syntax</B>.
|
||
</P>
|
||
<P>
|
||
The question now arises: how to define linearization rules
|
||
for variable-binding expressions?
|
||
Let us first consider universal quantification,
|
||
</P>
|
||
<PRE>
|
||
fun All : (Ind -> Prop) -> Prop
|
||
</PRE>
|
||
<P>
|
||
In GF, we write
|
||
</P>
|
||
<PRE>
|
||
lin All B = {s = "(" ++ "All" ++ B.$0 ++ ")" ++ B.s}
|
||
</PRE>
|
||
<P>
|
||
to obtain the form shown above.
|
||
This linearization rule brings in a new GF concept --- the <CODE>$0</CODE>
|
||
field of <CODE>B</CODE> containing a bound variable symbol.
|
||
The general rule is that, if an argument type of a function is
|
||
itself a function type <CODE>A -> C</CODE>, the linearization type of
|
||
this argument is the linearization type of <CODE>C</CODE>
|
||
together with a new field <CODE>$0 : Str</CODE>. In the linearization rule
|
||
for <CODE>All</CODE>, the argument <CODE>B</CODE> thus has the linearization
|
||
type
|
||
</P>
|
||
<PRE>
|
||
{$0 : Str ; s : Str},
|
||
</PRE>
|
||
<P>
|
||
since the linearization type of <CODE>Prop</CODE> is
|
||
</P>
|
||
<PRE>
|
||
{s : Str}
|
||
</PRE>
|
||
<P>
|
||
In other words, the linearization of a function
|
||
consists of a linearization of the body together with a
|
||
field for a linearization of the bound variable.
|
||
Those familiar with type theory or lambda calculus
|
||
should notice that GF requires trees to be in
|
||
<B>eta-expanded</B> form in order for this to make sense:
|
||
for any function of type
|
||
</P>
|
||
<PRE>
|
||
A -> B
|
||
</PRE>
|
||
<P>
|
||
an eta-expanded syntax tree has the form
|
||
</P>
|
||
<PRE>
|
||
\x -> b
|
||
</PRE>
|
||
<P>
|
||
where <CODE>b : B</CODE> under the assumption <CODE>x : A</CODE>.
|
||
It is in this form that an expression can be analysed
|
||
as having a bound variable and a body, which can be put into
|
||
a linearization record.
|
||
</P>
|
||
<P>
|
||
Given the linearization rule
|
||
</P>
|
||
<PRE>
|
||
lin Eq a b = {s = "(" ++ a.s ++ "=" ++ b.s ++ ")"}
|
||
</PRE>
|
||
<P>
|
||
the linearization of
|
||
</P>
|
||
<PRE>
|
||
\x -> Eq x x
|
||
</PRE>
|
||
<P>
|
||
is the record
|
||
</P>
|
||
<PRE>
|
||
{$0 = "x", s = ["( x = x )"]}
|
||
</PRE>
|
||
<P>
|
||
Thus we can compute the linearization of the formula,
|
||
</P>
|
||
<PRE>
|
||
All (\x -> Eq x x) --> {s = "[( All x ) ( x = x )]"}.
|
||
</PRE>
|
||
<P>
|
||
But how did we get the linearization of the variable <CODE>x</CODE>
|
||
into the string <CODE>"x"</CODE>? GF grammars have no rules for
|
||
this: it is just hard-wired in GF that variable symbols are
|
||
linearized into the same strings that represent them in
|
||
the print-out of the abstract syntax.
|
||
</P>
|
||
<P>
|
||
To be able to <I>parse</I> variable symbols, however, GF needs to know what
|
||
to look for (instead of e.g. trying to parse <I>any</I>
|
||
string as a variable). What strings are parsed as variable symbols
|
||
is defined in the lexical analysis part of GF parsing
|
||
</P>
|
||
<PRE>
|
||
> p -cat=Prop -lexer=codevars "(All x)(x = x)"
|
||
All (\x -> Eq x x)
|
||
</PRE>
|
||
<P>
|
||
(see more details on lexers <a href="#seclexing">here</a>). If several variables are bound in the
|
||
same argument, the labels are <CODE>$0, $1, $2</CODE>, etc.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write an abstract syntax of the whole
|
||
<B>predicate calculus</B>, with the
|
||
<B>connectives</B> "and", "or", "implies", and "not", and the
|
||
<B>quantifiers</B> "exists" and "for all". Use higher-order functions
|
||
to guarantee that unbounded variables do not occur.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a concrete syntax for your favourite
|
||
notation of predicate calculus. Use Latex as target language
|
||
if you want nice output. You can also try producing boolean
|
||
expressions of some programming language. Use as many parenthesis as you need to
|
||
guarantee non-ambiguity.
|
||
</P>
|
||
<A NAME="toc112"></A>
|
||
<H2>Semantic definitions</H2>
|
||
<P>
|
||
<a name="secdefdef"></a>
|
||
</P>
|
||
<P>
|
||
Just like any functional programming language, abstract syntax in
|
||
GF has declarations of functions, telling what the type of a function is.
|
||
But we have not yet shown how to <B>compute</B>
|
||
these functions: all we can do is provide them with arguments
|
||
and linearize the resulting terms.
|
||
Since our main interest is the well-formedness of expressions,
|
||
this has not yet bothered
|
||
us very much. As we will see, however, computation does play a role
|
||
even in the well-formedness of expressions when dependent types are
|
||
present.
|
||
</P>
|
||
<P>
|
||
GF has a form of judgement for <B>semantic definitions</B>,
|
||
marked by the key word <CODE>def</CODE>. At its simplest, it is just
|
||
the definition of one constant, e.g.
|
||
</P>
|
||
<PRE>
|
||
fun one : Nat ;
|
||
def one = Succ Zero ;
|
||
</PRE>
|
||
<P>
|
||
Notice a <CODE>def</CODE> definition can only be given to names declared by
|
||
<CODE>fun</CODE> judgements in the same module; it is not possible to define
|
||
an inherited name.
|
||
</P>
|
||
<P>
|
||
We can also define a function with arguments,
|
||
</P>
|
||
<PRE>
|
||
fun twice : Nat -> Nat ;
|
||
def twice x = plus x x ;
|
||
</PRE>
|
||
<P>
|
||
which is still a special case of the most general notion of
|
||
definition, that of a group of <B>pattern equations</B>:
|
||
</P>
|
||
<PRE>
|
||
fun plus : Nat -> Nat -> Nat ;
|
||
def
|
||
plus x Zero = x ;
|
||
plus x (Succ y) = Succ (Sum x y) ;
|
||
</PRE>
|
||
<P>
|
||
To compute a term is, as in functional programming languages,
|
||
simply to follow a chain of reductions until no definition
|
||
can be applied. For instance, we compute
|
||
</P>
|
||
<PRE>
|
||
Sum one one -->
|
||
Sum (Succ Zero) (Succ Zero) -->
|
||
Succ (sum (Succ Zero) Zero) -->
|
||
Succ (Succ Zero)
|
||
</PRE>
|
||
<P>
|
||
Computation in GF is performed with the <CODE>pt</CODE> command and the
|
||
<CODE>compute</CODE> transformation, e.g.
|
||
</P>
|
||
<PRE>
|
||
> p -tr "1 + 1" | pt -transform=compute -tr | l
|
||
sum one one
|
||
Succ (Succ Zero)
|
||
s(s(0))
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
The <CODE>def</CODE> definitions of a grammar induce a notion of
|
||
<B>definitional equality</B> among trees: two trees are
|
||
definitionally equal if they compute into the same tree.
|
||
Thus, trivially, all trees in a chain of computation
|
||
(such as the one above) are definitionally equal to each other.
|
||
In general, there can be infinitely many definitionally equal trees.
|
||
</P>
|
||
<P>
|
||
An important property of definitional equality is that it is
|
||
<B>extensional</B>, i.e. has to do with the sameness of semantic value.
|
||
Linearization, on the other hand, is an <B>intensional</B> operation,
|
||
i.e. has to do with the sameness of expression. This means that
|
||
<CODE>def</CODE> definitions are <I>not</I> evaluated as linearization steps.
|
||
Intensionality is a crucial property of linearization, since we want
|
||
to use it for things like tracing a chain of evaluation.
|
||
For instance, each of the steps of the computation above
|
||
has a different linearization into standard arithmetic notation:
|
||
</P>
|
||
<PRE>
|
||
1 + 1
|
||
s(0) + s(0)
|
||
s(s(0) + 0)
|
||
s(s(0))
|
||
</PRE>
|
||
<P>
|
||
In most programming languages, the operations that can be performed on
|
||
expressions are extensional, i.e. give equal values to equal arguments.
|
||
But GF has both extensional and intensional operations.
|
||
Type checking is extensional:
|
||
in the type theory with dependent types, types may depend on terms,
|
||
and types depending on definitionally equal terms are
|
||
equal types. For instance,
|
||
</P>
|
||
<PRE>
|
||
Less Zero one
|
||
Less Zero (Succ Zero))
|
||
</PRE>
|
||
<P>
|
||
are equal types. Hence, any tree that type checks as a proof that
|
||
1 is odd also type checks as a proof that the successor of 0 is odd.
|
||
(Recall, in this connection, that the
|
||
arguments a category depends on never play any role
|
||
in the linearization of trees of that category,
|
||
nor in the definition of the linearization type.)
|
||
</P>
|
||
<P>
|
||
When pattern matching is performed with <CODE>def</CODE> equations, it is
|
||
crucial to distinguish between <B>constructors</B> and other functions
|
||
(cf. <a href="#secmatching">here</a> on pattern matching in concrete syntax).
|
||
GF has a judgement form <CODE>data</CODE> to tell that a category has
|
||
certain functions as constructors:
|
||
</P>
|
||
<PRE>
|
||
data Nat = Succ | Zero ;
|
||
</PRE>
|
||
<P>
|
||
Unlike in Haskell and ML, new constructors can be added to
|
||
a type with new <CODE>data</CODE> judgements. The type signatures of constructors
|
||
are given separately, in ordinary <CODE>fun</CODE> judgements.
|
||
One can also write directly
|
||
</P>
|
||
<PRE>
|
||
data Succ : Nat -> Nat ;
|
||
</PRE>
|
||
<P>
|
||
which is syntactic sugar for the pair of judgements
|
||
</P>
|
||
<PRE>
|
||
fun Succ : Nat -> Nat ;
|
||
data Nat = Succ ;
|
||
</PRE>
|
||
<P>
|
||
If we did not mark <CODE>Zero</CODE> as <CODE>data</CODE>, the definition
|
||
</P>
|
||
<PRE>
|
||
fun isZero : Nat -> Bool ;
|
||
def isZero Zero = True ;
|
||
def isZero _ = False ;
|
||
</PRE>
|
||
<P>
|
||
would return <CODE>True</CODE> for all arguments, because the pattern <CODE>Zero</CODE>
|
||
would be treated as a variable and it would hence match all values!
|
||
This is a common pitfall in GF.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Implement an interpreter of a small functional programming
|
||
language with natural numbers, lists, pairs, lambdas, etc. Use higher-order
|
||
abstract syntax with semantic definitions. As onject language, use
|
||
your favourite programming language.
|
||
</P>
|
||
<A NAME="toc113"></A>
|
||
<H2>Summary of GF language features</H2>
|
||
<A NAME="toc114"></A>
|
||
<H3>Judgements</H3>
|
||
<P>
|
||
We have generalized the <CODE>cat</CODE> judgement form and introduced two new forms
|
||
for abstract syntax:
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>form</TH>
|
||
<TH COLSPAN="2">reading</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>cat</CODE> <I>C</I> <I>G</I></TD>
|
||
<TD><I>C</I> is a category in context <I>G</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>def</CODE> <I>f</I> <I>P1</I> ... <I>Pn</I> <CODE>=</CODE> t</TD>
|
||
<TD>function <I>f</I> applied to <I>P1</I>...<I>Pn</I> has value <I>t</I></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>data</CODE> <I>C</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD>
|
||
<TD>category <I>C</I> has constructors <I>C1</I>...<I>Cn</I></TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
The <B>context</B> in the <CODE>cat</CODE> judgement has the form
|
||
</P>
|
||
<PRE>
|
||
(x1 : T1) ... (xn : Tn)
|
||
</PRE>
|
||
<P>
|
||
where the types <I>T1 ... Tn</I> may be increasingly dependent. To form a
|
||
type, <I>C</I> must be equipped with arguments of each type in the
|
||
context, satisfying the dependencies. As syntactic sugar, we have
|
||
</P>
|
||
<PRE>
|
||
T G === (x : T) G
|
||
</PRE>
|
||
<P>
|
||
if <I>x</I> does not occur in <I>G</I>. The linearization type definition of a
|
||
category does not mention the context.
|
||
</P>
|
||
<P>
|
||
In <CODE>def</CODE> judgements, the arguments <I>P1</I>...<I>Pn</I> can be constructor and
|
||
variable patterns as well as wild cards, and the binding and
|
||
evaluation rules are the same as <a href="#secmatching">here</a>.
|
||
</P>
|
||
<P>
|
||
A <CODE>data</CODE> judgement states that the names on the right-hand side are constructors
|
||
of the category on the left-hand side. The precise types of the constructors are
|
||
given in the <CODE>fun</CODE> judgements introducing them; the value type of a constructor
|
||
of <I>C</I> must be of the form <I>C a1 ... am</I>. As syntactic sugar,
|
||
</P>
|
||
<PRE>
|
||
data f : A1 ... An -> C a1 ... am ===
|
||
fun f : A1 ... An -> C a1 ... am ; data C = f ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc115"></A>
|
||
<H3>Dependent function types</H3>
|
||
<P>
|
||
A <B>dependent function type</B> has the form
|
||
</P>
|
||
<PRE>
|
||
(x : A) -> B
|
||
</PRE>
|
||
<P>
|
||
where <I>B</I> depends on a variable <I>x</I> of type <I>A</I>. We have the
|
||
following syntactic sugar:
|
||
</P>
|
||
<PRE>
|
||
(x,y : A) -> B === (x : A) -> (y : A) -> B
|
||
|
||
(_ : A) -> B === (x : A) -> B if B does not depend on x
|
||
|
||
A -> B === (_ : A) -> B
|
||
</PRE>
|
||
<P>
|
||
A <CODE>fun</CODE> function in abstract syntax may have function types as
|
||
argument types. This is called <B>higher-order abstract syntax</B>.
|
||
The linearization of an argument
|
||
</P>
|
||
<PRE>
|
||
\z0, ..., zn -> b : (x0 : A1) -> ... -> (xn : An) -> B
|
||
</PRE>
|
||
<P>
|
||
if formed from the linearization of <I>b*</I> of <I>b</I> by adding
|
||
fields that hold the variable symbols:
|
||
</P>
|
||
<PRE>
|
||
b* ** {$0 = "z0" ; ... ; $n = "zn"}
|
||
</PRE>
|
||
<P>
|
||
If an argument function is itself a higher-order function, its
|
||
bound variables cannot be reached in linearization. Thus, in a sense,
|
||
the higher-order abstract syntax of GF is just <B>second-orde abstract syntax</B>.
|
||
</P>
|
||
<P>
|
||
A <B>syntax tree</B> is a well-typed term in <B>beta-eta normal form</B>, which
|
||
means that
|
||
</P>
|
||
<UL>
|
||
<LI>its type is a basic type, i.e. it is not a partial application;
|
||
<LI>its arguments are in eta normal form, i.e. either full applications or
|
||
lambda abstractions with bodies that are full applications;
|
||
<LI>it has no beta redexes, i.e. applications of abstractions.
|
||
</UL>
|
||
|
||
<P>
|
||
Terms that are not in this form may occur as arguments of dependent types
|
||
and in <CODE>def</CODE> judgements, but they cannot be linearized.
|
||
</P>
|
||
<A NAME="toc116"></A>
|
||
<H1>Grammars of formal languages</H1>
|
||
<P>
|
||
<a name="chapseven"></a>
|
||
</P>
|
||
<P>
|
||
In this chapter, we will write a grammar for arithmetic expressions as known
|
||
from school mathematics and many programming languages. We will see how to
|
||
define precedences in GF, how to include built-in integers in grammars, and
|
||
how to deal with spaces between tokens in desired ways. As an alternative concrete
|
||
syntax, we will generate code for a JVM-like stack machine. We will conclude
|
||
by extending the language with variable declarations and assignments, which
|
||
are handled in a type-safe way by using higher-order abstract syntax.
|
||
</P>
|
||
<P>
|
||
To write grammars for formal languages is usually less challenging than for
|
||
natural languages. There are standard tools for this task, such as the YACC
|
||
family of parser generators. Using GF would be overkill for many projects,
|
||
and come with a penalty in efficiency. However, it is still worth while to
|
||
look at this task. A typical application of GF are natural-language interfaces
|
||
to formal systems: in such applications, the translation between natural and
|
||
formal language can be defined as a multilingual grammar. The use of higher-order
|
||
abstract syntax, together with dependent types, provides a way to define a
|
||
complete compiler in GF.
|
||
</P>
|
||
<A NAME="toc117"></A>
|
||
<H2>Arithmetic expressions</H2>
|
||
<A NAME="toc118"></A>
|
||
<H3>Abstract syntax</H3>
|
||
<P>
|
||
We want to write a grammar for what is usually called <B>expressions</B>
|
||
in programming languages. The expressions are built from integers by
|
||
the binary operations of addition, subtraction, multiplication, and
|
||
division. The abstract syntax is easy to write. We call it <CODE>Calculator</CODE>,
|
||
since it can be used as the basis of a calculator.
|
||
</P>
|
||
<PRE>
|
||
abstract Calculator = {
|
||
|
||
cat Exp ;
|
||
|
||
fun
|
||
EPlus, EMinus, ETimes, EDiv : Exp -> Exp -> Exp ;
|
||
EInt : Int -> Exp ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Notice the use of the category <CODE>Int</CODE>. It is a built-in category of
|
||
integers. Its syntax trees are denoted by <B>integer literals</B>, which are
|
||
sequences of digits. For instance,
|
||
</P>
|
||
<PRE>
|
||
5457455814608954681 : Int
|
||
</PRE>
|
||
<P>
|
||
These are the only objects of type <CODE>Int</CODE>:
|
||
grammars are not allowed to declare functions with <CODE>Int</CODE> as value type.
|
||
</P>
|
||
<A NAME="toc119"></A>
|
||
<H3>Concrete syntax: a simple approach</H3>
|
||
<P>
|
||
Arithmetic expressions should be unambiguous. If we write
|
||
</P>
|
||
<PRE>
|
||
2 + 3 * 4
|
||
</PRE>
|
||
<P>
|
||
it should be parsed as one, but not both, of
|
||
</P>
|
||
<PRE>
|
||
EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
|
||
ETimes (EPlus (EInt 2) (EInt 3)) (EInt 4)
|
||
</PRE>
|
||
<P>
|
||
Under normal conventions, the former is chosen, because
|
||
multiplication has <B>higher precedence</B> than addition.
|
||
If we want to express the latter tree, we have to use
|
||
parentheses:
|
||
</P>
|
||
<PRE>
|
||
(2 + 3) * 4
|
||
</PRE>
|
||
<P>
|
||
However, it is not completely trivial to decide when to use
|
||
parentheses and when not. We will therefore begin with a
|
||
concrete syntax that always uses parentheses around binary
|
||
operator applications.
|
||
</P>
|
||
<PRE>
|
||
concrete CalculatorP of Calculator = {
|
||
|
||
lincat
|
||
Exp = SS ;
|
||
lin
|
||
EPlus = infix "+" ;
|
||
EMinus = infix "-" ;
|
||
ETimes = infix "*" ;
|
||
EDiv = infix "/" ;
|
||
EInt i = i ;
|
||
|
||
oper
|
||
infix : Str -> SS -> SS -> SS = \f,x,y ->
|
||
ss ("(" ++ x.s ++ f ++ y.s ++ ")") ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Now we will obtain
|
||
</P>
|
||
<PRE>
|
||
> linearize EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
|
||
( 2 + ( 3 * 4 ) )
|
||
</PRE>
|
||
<P>
|
||
The first problem, even more urgent than superfluous parentheses, is
|
||
to get rid of superfluous spaces and to recognize integer literals
|
||
in the parser.
|
||
</P>
|
||
<A NAME="toc120"></A>
|
||
<H2>Lexing and unlexing</H2>
|
||
<P>
|
||
<a name="seclexing"></a>
|
||
</P>
|
||
<P>
|
||
The input of parsing in GF is not just a string, but a list of
|
||
<B>tokens</B>. By default, a list of tokens is obtained from a string
|
||
by analysing it into <B>words</B>, which means chunks separated by
|
||
spaces. Thus for instance
|
||
</P>
|
||
<PRE>
|
||
"(12 + (3 * 4))"
|
||
</PRE>
|
||
<P>
|
||
is split into the tokens
|
||
</P>
|
||
<PRE>
|
||
"(12", "+", "(3". "*". "4))"
|
||
</PRE>
|
||
<P>
|
||
The parser then tries to find each of these tokens among the terminals
|
||
of the grammar, i.e. among the strings that can appear in linearizations.
|
||
In our example, only the tokens <CODE>"+"</CODE> and <CODE>"*"</CODE> can be found, and
|
||
parsing therefore fails.
|
||
</P>
|
||
<P>
|
||
The proper way to split the above string into tokens would be
|
||
</P>
|
||
<PRE>
|
||
"(", "12", "+", "(", "3", "*", "4", ")", ")"
|
||
</PRE>
|
||
<P>
|
||
Moreover, the tokens <CODE>"12"</CODE>, <CODE>"3"</CODE>, and <CODE>"4"</CODE> should not be sought
|
||
among the terminals in the grammar, but treated as integer tokens, which
|
||
are defined outside the grammar. Since GF aims to be fully general, such
|
||
conventions are not built in: it must be possible for a grammar to have
|
||
tokens such as <CODE>"12"</CODE> and <CODE>"12)"</CODE>. Therefore, GF has a way to select
|
||
a <B>lexer</B>, a function that splits strings into tokens and classifies
|
||
them into terminals, literalts, etc.
|
||
</P>
|
||
<P>
|
||
A lexer can be given as a flag to the parsing command:
|
||
</P>
|
||
<PRE>
|
||
> parse -cat=Exp -lexer=codelit "(2 + (3 * 4))"
|
||
EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
|
||
</PRE>
|
||
<P>
|
||
Since the lexer is usually a part of the language specification, it
|
||
makes sense to put it in the concrete syntax by using the judgement
|
||
</P>
|
||
<PRE>
|
||
flags lexer = codelit ;
|
||
</PRE>
|
||
<P>
|
||
The problem of getting correct spacing after linearization is likewise solved
|
||
by an <B>unlexer</B>:
|
||
</P>
|
||
<PRE>
|
||
> l -unlexer=code EPlus (EInt 2) (ETimes (EInt 3) (EInt 4))
|
||
(2 + (3 * 4))
|
||
</PRE>
|
||
<P>
|
||
Also this flag is usually put into the concrete syntax file.
|
||
</P>
|
||
<P>
|
||
The lexers and unlexers that are available in the GF system can be
|
||
seen by
|
||
</P>
|
||
<PRE>
|
||
> help -lexer
|
||
> help -unlexer
|
||
</PRE>
|
||
<P>
|
||
A table of the most common lexers and unlexers is given in the Summary
|
||
section 7.8.
|
||
</P>
|
||
<A NAME="toc121"></A>
|
||
<H2>Precedence and fixity</H2>
|
||
<P>
|
||
<a name="secprecedence"></a>
|
||
</P>
|
||
<P>
|
||
Here is a summary of the usual
|
||
precedence rules in mathematics and programming languages:
|
||
</P>
|
||
<UL>
|
||
<LI>Integer constants and expressions in parentheses have the highest precedence.
|
||
<LI>Multiplication and division have equal precedence, lower than the highest
|
||
but higher than addition and subtraction, which are again equal.
|
||
<LI>All the four binary operations are <B>left-associative</B>, which means that
|
||
e.g. <CODE>1 + 2 + 3</CODE> means the same as <CODE>(1 + 2) + 3</CODE>.
|
||
</UL>
|
||
|
||
<P>
|
||
One way of dealing with precedences in compiler books is by dividing expressions
|
||
into three categories:
|
||
</P>
|
||
<UL>
|
||
<LI>expressions: addition and subtraction
|
||
<LI>terms: multiplication and division
|
||
<LI>factors: constants and expressions in parentheses
|
||
</UL>
|
||
|
||
<P>
|
||
The context-free grammar, also taking care of associativity, is the following:
|
||
</P>
|
||
<PRE>
|
||
Exp ::= Exp "+" Term | Exp "-" Term | Term ;
|
||
Term ::= Term "*" Fact | Term "/" Fact | Fact ;
|
||
Fact ::= Int | "(" Exp ")" ;
|
||
</PRE>
|
||
<P>
|
||
A compiler, however, does not want to make a semantic distinction between the
|
||
three categories. Nor does it want to build syntax trees with the
|
||
<B>coercions</B> that enable the use of a higher level expressions on a lower, and
|
||
encode the use of parentheses. In compiler tools such as YACC, building abstract
|
||
syntax trees is performed as a <B>semantic action</B>. For instance, if the parser
|
||
recognizes an expression in parentheses, the action is to return only the
|
||
expression, without encoding the parentheses.
|
||
</P>
|
||
<P>
|
||
In GF, semantic actions could be encoded by using <CODE>def</CODE> definitions introduced
|
||
<a href="#secdefdef">here</a>. But there is a more straightforward way of thinking about
|
||
precedences: we introduce a parameter for precedence, and treat it as
|
||
an inherent feature of expressions:
|
||
</P>
|
||
<PRE>
|
||
oper
|
||
param Prec = Ints 2 ;
|
||
TermPrec : Type = {s : Str ; p : Prec} ;
|
||
|
||
mkPrec : Prec -> Str -> TermPrec = \p,s -> {s = s ; p = p} ;
|
||
|
||
lincat
|
||
Exp = TermPrec ;
|
||
</PRE>
|
||
<P>
|
||
This example shows another way to use built-in integers in GF:
|
||
the type <CODE>Ints 2</CODE> is a parameter type, whose values are the integers
|
||
<CODE>0,1,2</CODE>. These are the three precedence levels we need. The main idea
|
||
is to compare the inherent precedence of an expression with the context
|
||
in which it is used. If the precedence is higher than or equal to
|
||
the expected, then
|
||
no parentheses are needed. Otherwise they are. We encode this rule in
|
||
the operation
|
||
</P>
|
||
<PRE>
|
||
oper usePrec : TermPrec -> Prec -> Str = \x,p ->
|
||
case lessPrec x.p p of {
|
||
True => "(" x.s ")" ;
|
||
False => x.s
|
||
} ;
|
||
</PRE>
|
||
<P>
|
||
With this operation, we can build another one, that can be used for
|
||
defining left-associative infix expressions:
|
||
</P>
|
||
<PRE>
|
||
infixl : Prec -> Str -> (_,_ : TermPrec) -> TermPrec = \p,f,x,y ->
|
||
mkPrec p (usePrec x p ++ f ++ usePrec y (nextPrec p)) ;
|
||
</PRE>
|
||
<P>
|
||
Constant-like expressions (the highest level) can be built simply by
|
||
</P>
|
||
<PRE>
|
||
constant : Str -> TermPrec = mkPrec 2 ;
|
||
</PRE>
|
||
<P>
|
||
All these operations can be found in the library module <CODE>lib/prelude/Formal</CODE>,
|
||
so we don't have to define them in our own code. Also the auxiliary operations
|
||
<CODE>nextPrec</CODE> and <CODE>lessPrec</CODE> used in their definitions are defined there.
|
||
The library has 5 levels instead of 3.
|
||
</P>
|
||
<P>
|
||
Now we can express the whole concrete syntax of <CODE>Calculator</CODE> compactly:
|
||
</P>
|
||
<PRE>
|
||
concrete CalculatorC of Calculator = open Formal, Prelude in {
|
||
|
||
flags lexer = codelit ; unlexer = code ; startcat = Exp ;
|
||
|
||
lincat Exp = TermPrec ;
|
||
|
||
lin
|
||
EPlus = infixl 0 "+" ;
|
||
EMinus = infixl 0 "-" ;
|
||
ETimes = infixl 1 "*" ;
|
||
EDiv = infixl 1 "/" ;
|
||
EInt i = constant i.s ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
Let us just take one more look at the operation <CODE>usePrec</CODE>, which decides whether
|
||
to put parentheses around a term or not. The case where parentheses are not needed
|
||
around a string was defined as the string itself.
|
||
However, this would imply that superfluous parentheses
|
||
are never correct. A more liberal grammar is obtained by using the operation
|
||
</P>
|
||
<PRE>
|
||
parenthOpt : Str -> Str = \s -> variants {s ; "(" ++ s ++ ")"} ;
|
||
</PRE>
|
||
<P>
|
||
which is actually used in the <CODE>Formal</CODE> library.
|
||
But even in this way, we can only allow one pair of superfluous parentheses.
|
||
Thus the parameter-based grammar has not quite reached the goal
|
||
of implementing the same language as the expression-term-factor grammar.
|
||
But it has the advantage of eliminating precedence distinctions from the
|
||
abstract syntax.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Define non-associative and right-associative infix operations
|
||
analogous to <CODE>infixl</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Add a constructor that puts parentheses around expressions
|
||
to raise their precedence, but that is eliminated by a <CODE>def</CODE> definition.
|
||
Test parsing with and without a pipe to <CODE>pt -transform=compute</CODE>.
|
||
</P>
|
||
<A NAME="toc122"></A>
|
||
<H2>Code generation as linearization</H2>
|
||
<P>
|
||
The classical use of grammars of programming languages is in <B>compilers</B>,
|
||
which translate one language into another. Typically the source language of
|
||
a compiler is a high-level language and the target language is a machine
|
||
language. The hub of a compiler is abstract syntax: the <B>front end</B> of
|
||
the compiler parses source language strings into abstract syntax trees, and
|
||
the <B>back end</B> linearizes these trees into the target language. This processing
|
||
model is of course what GF uses for natural language translation; the main
|
||
difference is that, in GF, the compiler could run in the opposite direction as
|
||
well, that is, function as a <B>decompiler</B>. (In full-size compilers, the
|
||
abstract syntax is also transformed by several layers of semantic analysis
|
||
and optimizations, before the target code is generated; this can destroy
|
||
reversibility and hence decompilation.)
|
||
</P>
|
||
<P>
|
||
More for the sake of illustration
|
||
than as a serious compiler, let us write a concrete
|
||
syntax of <CODE>Calculator</CODE> that generates machine code similar to JVM (Java Virtual
|
||
Machine). JVM is a so-called <B>stack machine</B>, whose code follows the
|
||
<B>postfix</B> notation, also known as <B>reverse Polish</B> notation. Thus the
|
||
expression
|
||
</P>
|
||
<PRE>
|
||
2 + 3 * 4
|
||
</PRE>
|
||
<P>
|
||
is translated to
|
||
</P>
|
||
<PRE>
|
||
iconst 2 : iconst 3 ; iconst 4 ; imul ; iadd
|
||
</PRE>
|
||
<P>
|
||
The linearization rules are not difficult to give:
|
||
</P>
|
||
<PRE>
|
||
lin
|
||
EPlus = postfix "iadd" ;
|
||
EMinus = postfix "isub" ;
|
||
ETimes = postfix "imul" ;
|
||
EDiv = postfix "idiv" ;
|
||
EInt i = ss ("iconst" ++ i.s) ;
|
||
oper
|
||
postfix : Str -> SS -> SS -> SS = \op,x,y ->
|
||
ss (x.s ++ ";" ++ y.s ++ ";" ++ op) ;
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc123"></A>
|
||
<H2>Speaking aloud arithmetic expressions</H2>
|
||
<P>
|
||
Natural languages have sometimes difficulties in expressing mathematical
|
||
formulas unambiguously, because they have no universal device of parentheses.
|
||
For arithmetic formulas, a solution exists:
|
||
</P>
|
||
<PRE>
|
||
2 + 3 * 4
|
||
</PRE>
|
||
<P>
|
||
can be expressed
|
||
</P>
|
||
<PRE>
|
||
the sum of 2 and the product of 3 and 4
|
||
</PRE>
|
||
<P>
|
||
However, this format is very verbose and unnatural, and becomes
|
||
impossible to understand when the complexity of expressions grows.
|
||
Fortunately, spoken language
|
||
has a very nice way of using <B>pauses</B> for disambiguation. This device was
|
||
introduced by Torbj<62>rn Lager (personal communication, 2003)
|
||
as an input mechanism to a calculator dialogue
|
||
system; it seems to correspond very closely to how we actually speak when we
|
||
want to communicate arithmetic expressions. Another application would be as
|
||
a part of a programming assistant that reads aloud code.
|
||
</P>
|
||
<P>
|
||
The idea is that, after every completed operation, there is a pause. Try this
|
||
by speaking aloud the following lines, making a pause instead of pronouncing the
|
||
word <CODE>PAUSE</CODE>:
|
||
</P>
|
||
<PRE>
|
||
2 plus 3 times 4 PAUSE
|
||
2 plus 3 PAUSE times 4 PAUSE
|
||
</PRE>
|
||
<P>
|
||
A grammar implementing this convention is again simple to write:
|
||
</P>
|
||
<PRE>
|
||
lin
|
||
EPlus = infix "plus" ;
|
||
EMinus = infix "minus" ;
|
||
ETimes = infix "times" ;
|
||
EDiv = infix ["divided by"] ;
|
||
EInt i = i ;
|
||
oper
|
||
infix : Str -> SS -> SS -> SS = \op,x,y ->
|
||
ss (x.s ++ op ++ y.s ++ "PAUSE") ;
|
||
</PRE>
|
||
<P>
|
||
Intuitively, a pause is taken to give the hearer time to compute an
|
||
intermediate result.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Is the pause-based grammar unambiguous? Test with random examples!
|
||
</P>
|
||
<A NAME="toc124"></A>
|
||
<H2>Programs with variables</H2>
|
||
<P>
|
||
A useful extension of arithmetic expressions is a <B>straight code</B> programming
|
||
language. The programs of this language are <B>assignments</B> of the form <CODE>x = exp</CODE>,
|
||
which assign expressions to variables. Expressions can moreover contain variables
|
||
that have been given values in previous assignments.
|
||
</P>
|
||
<P>
|
||
In this language, we use two new categories: programs and variables.
|
||
A program is a sequence of assignments, where a variable is given a value.
|
||
Logically, we want to distinguish <B>initializations</B> from other assignments:
|
||
these are the assignments where a variable is given a value for the first time.
|
||
Just like in C-like languages,
|
||
we prefix an initializing assignment with the type of the variable.
|
||
Here is an example of a piece of code written in the language:
|
||
</P>
|
||
<PRE>
|
||
int x = 2 + 3 ;
|
||
int y = x + 1 ;
|
||
x = x + 9 * y ;
|
||
</PRE>
|
||
<P>
|
||
We define programs by the following constructors:
|
||
</P>
|
||
<PRE>
|
||
fun
|
||
PEmpty : Prog ;
|
||
PInit : Exp -> (Var -> Prog) -> Prog ;
|
||
PAss : Var -> Exp -> Prog -> Prog ;
|
||
</PRE>
|
||
<P>
|
||
The interesting constructor is <CODE>PInit</CODE>, which uses
|
||
higher-order abstract syntax for making the initialized variable available in
|
||
the <B>continuation</B> of the program. The abstract syntax tree for the above code
|
||
is
|
||
</P>
|
||
<PRE>
|
||
PInit (EPlus (EInt 2) (EInt 3)) (\x ->
|
||
PInit (EPlus (EVar x) (EInt 1)) (\y ->
|
||
PAss x (EPlus (EVar x) (ETimes (EInt 9) (EVar y)))
|
||
PEmpty))
|
||
</PRE>
|
||
<P>
|
||
Since we want to prevent the use of uninitialized variables in programs, we
|
||
don't give any constructors for <CODE>Var</CODE>! We just have a rule for using variables
|
||
as expressions:
|
||
</P>
|
||
<PRE>
|
||
fun EVar : Var -> Exp ;
|
||
</PRE>
|
||
<P>
|
||
The rest of the grammar is just the same as for arithmetic expressions
|
||
<a href="#secprecedence">here</a>. The best way to implement it is perhaps by writing a
|
||
module that extends the expression module. The most natural start category
|
||
of the extension is <CODE>Prog</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the straight-code language to expressions of type <CODE>float</CODE>.
|
||
To guarantee type safety, you can define a category <CODE>Typ</CODE> of types, and
|
||
make <CODE>Exp</CODE> and <CODE>Var</CODE> dependent on <CODE>Typ</CODE>. Basic floating point expressions
|
||
can be formed from literal of the built-in GF type <CODE>Float</CODE>. The arithmetic
|
||
operations should be made polymorphic (as <a href="#secpolymorphic">here</a>).
|
||
</P>
|
||
<A NAME="toc125"></A>
|
||
<H3>The concrete syntax of assignments</H3>
|
||
<P>
|
||
We can define a C-like concrete syntax by using GF's <CODE>$</CODE> variables, as explained
|
||
<a href="#secbinding">here</a>.
|
||
</P>
|
||
<P>
|
||
In a JVM-like syntax, we need two more instructions: <CODE>iload</CODE> <I>x</I>, which
|
||
loads (pushes on the stack) the value of the variable <I>x</I>, and <CODE>istore</CODE> <I>x</I>,
|
||
which stores the value of the currently topmost expression in the variable <I>x</I>.
|
||
Thus the code for the example in the previous section is
|
||
</P>
|
||
<PRE>
|
||
iconst 2 ; iconst 3 ; iadd ; istore x ;
|
||
iload x ; iconst 1 ; iadd ; istore y ;
|
||
iload x ; iconst 9 ; iload y ; imul ; iadd ; istore x ;
|
||
</PRE>
|
||
<P>
|
||
Those familiar with JVM will notice that we are using <B>symbolic addresses</B>, i.e.
|
||
variable names instead of integer offsets in the memory. Neither real JVM nor
|
||
our variant makes any distinction between the initialization and reassignment
|
||
of a variable.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Finish the implementation of the
|
||
C-to-JVM compiler by extending the expression modules
|
||
to straight code programs.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. If you made the exercise of adding floating point numbers to
|
||
the language, you can now cash out the main advantage of type checking
|
||
for code generation: selecting type-correct JVM instructions. The floating
|
||
point instructions are precisely the same as the integer one, except that
|
||
the prefix is <CODE>f</CODE> instead of <CODE>i</CODE>, and that <CODE>fconst</CODE> takes floating
|
||
point literals as arguments.
|
||
</P>
|
||
<A NAME="toc126"></A>
|
||
<H3>A liberal syntax of variables</H3>
|
||
<P>
|
||
In many applications, the task of GF is just linearization and parsing;
|
||
keeping track of bound variables and other semantic constraints is
|
||
the task of other parts of the program. For instance, if we want to
|
||
write a natural language interface that reads aloud C code, we can
|
||
quite as well use a context-free grammar of C, and leave it to the C
|
||
compiler to check that variables make sense. In such a program, we may
|
||
want to treat variables as <I>Strings</I>, i.e. to have a constructor
|
||
</P>
|
||
<PRE>
|
||
fun VString : String -> Var ;
|
||
</PRE>
|
||
<P>
|
||
The built-in category <CODE>String</CODE> has as its values <B>string literals</B>,
|
||
which are strings in double quotes. The lexer and unlexer <CODE>codelit</CODE>
|
||
restore and remove the quotes; when the lexer finds a token that is
|
||
neither a terminal in the grammar nor an integer literal, it sends
|
||
it to the parser as a string literal.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Write a grammar for straight code without higher-order
|
||
abstract syntax.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Extend the liberal straight code grammar to <CODE>while</CODE> loops and
|
||
some other program constructs, and investigate if you can build a reasonable spoken
|
||
language generator for this fragment.
|
||
</P>
|
||
<A NAME="toc127"></A>
|
||
<H2>Conclusion</H2>
|
||
<P>
|
||
Since formal languages are syntactically simpler than natural languages, it
|
||
is no wonder that their grammars can be defined in GF. Some thought is needed
|
||
for dealing with precedences and spacing, but much of it is encoded in GF's
|
||
libraries and built-in lexers and unlexers. If the sole purpose of a grammar
|
||
is to implement a programming language, then the <B>BNF Converter</B> tool
|
||
(BNFC) is more appropriate than GF:
|
||
<center>
|
||
<CODE>www.cs.chalmers.se/~markus/BNFC/</CODE>
|
||
</center>
|
||
BNFC uses standard YACC-like parser tools. GF has flags for printing
|
||
grammars in the BNFC format.
|
||
</P>
|
||
<P>
|
||
The most common applications of GF grammars of formal languages
|
||
are in natural-language interfaces of various kinds.
|
||
These systems don't usually need semantic control in GF abstract
|
||
syntax. However, the situation can be different if the interface also comprises
|
||
an interactive syntax editor, as in the GF-Key system
|
||
(Beckert & al. 2006, Burke & Johannisson 2005).
|
||
In that system, the editor is used for guiding programmers only to write
|
||
type-correct code.
|
||
</P>
|
||
<P>
|
||
The technique of continuations in modelling programming languages has recently
|
||
been applied to natural language, for processing <B>anaphoric reference</B>,
|
||
e.g. pronouns. It may be good to know that GF has the machinery available;
|
||
for the time being, however (GF 2.8), dependent types and
|
||
higher-order abstract syntax are not supported by the embedded GF implementations
|
||
in Haskell and Java.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. The book <I>C programming language</I> by Kernighan and Ritchie
|
||
(p. 123, 2nd edition, 1988) describes an English-like syntax for pointer and
|
||
array declarations, and a C program for translating between English and C.
|
||
The following example pair shows all the expression forms needed:
|
||
</P>
|
||
<PRE>
|
||
char (*(*x[3])())[5]
|
||
|
||
x: array[3] of pointer to function returning
|
||
pointer to array[5] of char
|
||
</PRE>
|
||
<P>
|
||
Implement these translations by a GF grammar.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Design a natural-language interface to Unix command lines.
|
||
It should be able to express verbally commands such as
|
||
<CODE>cat, cd, grep, ls, mv, rm, wc</CODE> and also
|
||
pipes built from them.
|
||
</P>
|
||
<A NAME="toc128"></A>
|
||
<H2>Summary of GF language constructs</H2>
|
||
<A NAME="toc129"></A>
|
||
<H3>Lexers and unlexers</H3>
|
||
<P>
|
||
Lexers are set by the flag <CODE>lexer</CODE> and unlexers by the flag <CODE>unlexer</CODE>.
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>lexer</TH>
|
||
<TH COLSPAN="2">description</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>words</CODE></TD>
|
||
<TD>(default) tokens are separated by spaces or newlines</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>literals</CODE></TD>
|
||
<TD>like words, but integer and string literals recognized</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>chars</CODE></TD>
|
||
<TD>each character is a token</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>code</CODE></TD>
|
||
<TD>program code conventions (uses Haskell's lex)</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>text</CODE></TD>
|
||
<TD>with conventions on punctuation and capital letters</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>codelit</CODE></TD>
|
||
<TD>like code, but recognize literals (unknown words as strings)</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>textlit</CODE></TD>
|
||
<TD>like text, but recognize literals (unknown words as strings)</TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>unlexer</TH>
|
||
<TH COLSPAN="2">description</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>unwords</CODE></TD>
|
||
<TD>(default) space-separated token list</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>text</CODE></TD>
|
||
<TD>format as text: punctuation, capitals, paragraph <p></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>code</CODE></TD>
|
||
<TD>format as code (spacing, indentation)</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>textlit</CODE></TD>
|
||
<TD>like text, but remove string literal quotes</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>codelit</CODE></TD>
|
||
<TD>like code, but remove string literal quotes</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>concat</CODE></TD>
|
||
<TD>remove all spaces</TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<A NAME="toc130"></A>
|
||
<H3>Built-in abstract syntax types</H3>
|
||
<P>
|
||
There are three built-in types. Their syntax trees are literals of corresponding kinds:
|
||
</P>
|
||
<UL>
|
||
<LI><CODE>Int</CODE>, with nonnegative integer literals e.g. <CODE>987031434</CODE>
|
||
<LI><CODE>Float</CODE>, with nonnegative floating-point literals e.g. <CODE>907.219807</CODE>
|
||
<LI><CODE>String</CODE>, with string literals e.g. <CODE>"foo"</CODE>
|
||
</UL>
|
||
|
||
<P>
|
||
Their linearization type is uniformly <CODE>{s : Str}</CODE>.
|
||
</P>
|
||
<A NAME="toc131"></A>
|
||
<H1>Embedded grammars</H1>
|
||
<P>
|
||
<a name="chapeight"></a>
|
||
</P>
|
||
<P>
|
||
GF grammars can be used as parts of programs written in other programming
|
||
languages. Haskell and Java.
|
||
This facility is based on several components:
|
||
</P>
|
||
<UL>
|
||
<LI>a portable format for multilingual GF grammars
|
||
<LI>an interpreter for this format written in the host language
|
||
<LI>an API that enables reading grammar files and calling the interpreter
|
||
<LI>a way to manipulate abstract syntax trees in the host language
|
||
</UL>
|
||
|
||
<P>
|
||
In this chapter, we will show the basic ways of producing such
|
||
<B>embedded grammars</B> and using them in Haskell, Java, and JavaScript programs.
|
||
We will build a simple example application in each language:
|
||
</P>
|
||
<UL>
|
||
<LI>a question-answering system in Haskell
|
||
<LI>a translator GUI in Java
|
||
<LI>a multilingual syntax editor in JavaScript
|
||
</UL>
|
||
|
||
<P>
|
||
Moreover, we will use how grammar applications can be extended to
|
||
spoken language by generating <B>language models</B> for speech recognition
|
||
in various standard formats.
|
||
</P>
|
||
<A NAME="toc132"></A>
|
||
<H2>The portable grammar format</H2>
|
||
<P>
|
||
The portable format is called GFCC, "GF Canonical Compiled". A file
|
||
of this form can be produced from GF by the command
|
||
</P>
|
||
<PRE>
|
||
> print_multi -printer=gfcc | write_file FILE.gfcc
|
||
</PRE>
|
||
<P>
|
||
Files written in this format can also be imported in the GF system,
|
||
which recognizes the suffix <CODE>.gfcc</CODE> and builds the multilingual
|
||
grammar in memory.
|
||
</P>
|
||
<P>
|
||
<I>This applies to GF version 3 and upwards. Older GF used a format suffixed</I>
|
||
<CODE>.gfcm</CODE>.
|
||
<I>At the moment of writing, also the Java interpreter still uses the GFCM format.</I>
|
||
</P>
|
||
<P>
|
||
GFCC is, in fact, the recommended format in
|
||
which final grammar products are distributed, because they
|
||
are stripped from superfluous information and can be started and applied
|
||
faster than sets of separate modules.
|
||
</P>
|
||
<P>
|
||
Application programmers have never any need to read or modify GFCC files.
|
||
Also in this sense, they play the same role as machine code in
|
||
general-purpose programming.
|
||
</P>
|
||
<A NAME="toc133"></A>
|
||
<H2>The embedded interpreter and its API</H2>
|
||
<P>
|
||
The interpreter is a kind of a miniature GF system, which can parse and
|
||
linearize with grammars. But it can only perform a subset of the commands of
|
||
the GF system. For instance, it
|
||
cannot compile source grammars into the GFCC format; the compiler is the most
|
||
heavy-weight component of the GF system, and should not be carried around
|
||
in end-user applications.
|
||
Since GFCC is much
|
||
simpler than source GF, building an interpreter is relatively easy.
|
||
Full-scale interpreters currently exist in Haskell and Java, and partial
|
||
ones in C++, JavaScript, and Prolog. We will in this chapter focus
|
||
on Haskell, Java, and JavaScript.
|
||
</P>
|
||
<P>
|
||
Application programmers never need to read or modify the interpreter.
|
||
They only need to access it via its API.
|
||
</P>
|
||
<A NAME="toc134"></A>
|
||
<H2>Embedded GF applications in Haskell</H2>
|
||
<P>
|
||
Readers unfamiliar with Haskell, or who just want to program in Java, can safely
|
||
skip this section. Everything will be repeated in the corresponding Java
|
||
section. However, seeing the Haskell code may still be helpful because
|
||
Haskell is in many ways closer to GF than Java is. In particular, recursive
|
||
types of syntax trees and pattern matching over them are very similar in
|
||
Haskell and GF,
|
||
but require a complex encoding with classes and visitors in Java.
|
||
</P>
|
||
<A NAME="toc135"></A>
|
||
<H3>The EmbedAPI module</H3>
|
||
<P>
|
||
The Haskell API contains (among other things) the following types and functions:
|
||
</P>
|
||
<PRE>
|
||
module EmbedAPI where
|
||
|
||
type MultiGrammar
|
||
type Language
|
||
type Category
|
||
type Tree
|
||
|
||
file2grammar :: FilePath -> IO MultiGrammar
|
||
|
||
linearize :: MultiGrammar -> Language -> Tree -> String
|
||
parse :: MultiGrammar -> Language -> Category -> String -> [Tree]
|
||
|
||
linearizeAll :: MultiGrammar -> Tree -> [String]
|
||
linearizeAllLang :: MultiGrammar -> Tree -> [(Language,String)]
|
||
|
||
parseAll :: MultiGrammar -> Category -> String -> [[Tree]]
|
||
parseAllLang :: MultiGrammar -> Category -> String -> [(Language,[Tree])]
|
||
|
||
languages :: MultiGrammar -> [Language]
|
||
categories :: MultiGrammar -> [Category]
|
||
startCat :: MultiGrammar -> Category
|
||
</PRE>
|
||
<P>
|
||
This is the only module that needs to be imported in the Haskell application.
|
||
It is available as a part of the GF distribution, in the file
|
||
<CODE>src/GF/GFCC/API.hs</CODE>.
|
||
</P>
|
||
<A NAME="toc136"></A>
|
||
<H3>First application: a translator</H3>
|
||
<P>
|
||
Let us first build a stand-alone translator, which can translate
|
||
in any multilingual grammar between any languages in the grammar.
|
||
The whole code for this translator is here:
|
||
</P>
|
||
<PRE>
|
||
module Main where
|
||
|
||
import GF.GFCC.API
|
||
import System (getArgs)
|
||
|
||
main :: IO ()
|
||
main = do
|
||
file:_ <- getArgs
|
||
gr <- file2grammar file
|
||
interact (translate gr)
|
||
|
||
translate :: MultiGrammar -> String -> String
|
||
translate gr = case parseAllLang gr (startCat gr) s of
|
||
(lg,t:_):_ -> unlines [linearize gr l t | l <- languages gr, l /= lg]
|
||
_ -> "NO PARSE"
|
||
</PRE>
|
||
<P>
|
||
To run the translator, first compile it by
|
||
</P>
|
||
<PRE>
|
||
% ghc --make -o trans Translator.hs
|
||
</PRE>
|
||
<P>
|
||
Then produce a GFCC file. For instance, the <CODE>Food</CODE> grammar set can be
|
||
compiled as follows:
|
||
</P>
|
||
<PRE>
|
||
% gfc --make FoodEng.gf FoodIta.gf
|
||
</PRE>
|
||
<P>
|
||
This produces the file <CODE>Food.gfcc</CODE> (its name comes from the abstract syntax).
|
||
</P>
|
||
<P>
|
||
<I>The gfc batch compiler program is available in GF 3 and upwards.</I>
|
||
<I>In earlier versions, the appropriate command can be piped to gf:</I>
|
||
</P>
|
||
<PRE>
|
||
% echo "pm -printer=gfcc | wf Food.gfcc" | gf FoodEng.gf FoodIta.gf
|
||
</PRE>
|
||
<P>
|
||
Equivalently, the grammars could be read into GF shell and the <CODE>pm</CODE> command
|
||
issued from there. But the unix command has the advantage that it can
|
||
be put into a <CODE>Makefile</CODE> to automate the compilation of an application.
|
||
</P>
|
||
<P>
|
||
The Haskell library function <CODE>interact</CODE> makes the <CODE>trans</CODE> program work
|
||
like a Unix filter, which reads from standard input and writes to standard
|
||
output. Therefore it can be a part of a pipe and read and write files.
|
||
The simplest way to translate is to <CODE>echo</CODE> input to the program:
|
||
</P>
|
||
<PRE>
|
||
% echo "this wine is delicious" | ./trans Food.gfcc
|
||
questo vino <20> delizioso
|
||
</PRE>
|
||
<P>
|
||
The result is given in all languages except the input language.
|
||
</P>
|
||
<A NAME="toc137"></A>
|
||
<H3>A looping translator</H3>
|
||
<P>
|
||
If the user wants to translate many expressions in a sequence, it
|
||
is cumbersome to have to start the translator over and over again,
|
||
because reading the grammar and building the parser always takes
|
||
time. The translator of the previous section is easy to modify
|
||
to enable this: just change <CODE>interact</CODE> in the main function to
|
||
<CODE>loop</CODE>. It is not a standard Haskell function, so its definition has
|
||
to be included:
|
||
</P>
|
||
<PRE>
|
||
loop :: (String -> String) -> IO ()
|
||
loop trans = do
|
||
s <- getLine
|
||
if s == "quit" then putStrLn "bye" else do
|
||
putStrLn $ trans s
|
||
loop trans
|
||
</PRE>
|
||
<P>
|
||
The loop keeps on translating line by line until the input line
|
||
is <CODE>quit</CODE>.
|
||
</P>
|
||
<A NAME="toc138"></A>
|
||
<H3>A question-answer system</H3>
|
||
<P>
|
||
<a name="secmathprogram"></a>
|
||
</P>
|
||
<P>
|
||
The next application is also a translator, but it adds a
|
||
<B>transfer</B> component to the grammar. Transfer is a function that
|
||
takes the input syntax tree into some other syntax tree, which is
|
||
then linearized and shown back to the user. The transfer function we
|
||
are going to use is one that computes a question into an answer.
|
||
The program accepts simple questions about arithmetic and answers
|
||
"yes" or "no" in the language in which the question was made:
|
||
</P>
|
||
<PRE>
|
||
Is 123 prime?
|
||
No.
|
||
77 est impair ?
|
||
Oui.
|
||
</PRE>
|
||
<P>
|
||
The main change that is needed to the pure translator is to give
|
||
the type of <CODE>translate</CODE> an extra argument: a transfer function.
|
||
</P>
|
||
<PRE>
|
||
translate :: (Tree -> Tree) -> MultiGrammar -> String -> String
|
||
</PRE>
|
||
<P>
|
||
You can think of ordinary translation as a special case where
|
||
transfer is the identity function (<CODE>id</CODE> in Haskell).
|
||
</P>
|
||
<P>
|
||
Also the behaviour of returning the reply in different languages
|
||
should be changed so that the reply is returned in the <I>same</I> language.
|
||
Here is the complete definition of <CODE>translate</CODE> in the new form.
|
||
</P>
|
||
<PRE>
|
||
translate tr gr = case parseAllLang gr (startCat gr) s of
|
||
(lg,t:_):_ -> linearize gr lg (tr t)
|
||
_ -> "NO PARSE"
|
||
</PRE>
|
||
<P>
|
||
To complete the system, we have to define the transfer function.
|
||
So, how can we write a function from from abstract syntax trees
|
||
to abstract syntax trees? The embedded API does not make
|
||
the constructors of the type <CODE>Tree</CODE> available for users. Even if it did, it would
|
||
be quite complicated to use the type, and programs would be likely
|
||
to produce trees that are ill-typed in GF and therefore cannot
|
||
be linearized.
|
||
</P>
|
||
<A NAME="toc139"></A>
|
||
<H3>Exporting GF datatypes</H3>
|
||
<P>
|
||
The way to go in defining transfer is to use GF's tree constructors, that
|
||
is, the <CODE>fun</CODE> functions, as if they were Haskell's data constructors.
|
||
There is enough resemblance between GF and Haskell to make this possible
|
||
in most cases. It is even possible in Java, as we shall see later.
|
||
</P>
|
||
<P>
|
||
Thus every category of GF is translated into a Haskell datatype, where the
|
||
functions producing a value of that category are treated as constructors.
|
||
The translation is obtained by using the batch compiler with the command
|
||
</P>
|
||
<PRE>
|
||
% gfc -haskell Food.gfcc
|
||
</PRE>
|
||
<P>
|
||
It is also possible to produce the Haskell file together with GFCC, by
|
||
</P>
|
||
<PRE>
|
||
% gfc --make -haskell FoodEng.gf FoodIta.gf
|
||
</PRE>
|
||
<P>
|
||
The result is a file named <CODE>GSyntax.hs</CODE>, containing a
|
||
module named <CODE>GSyntax</CODE>.
|
||
</P>
|
||
<P>
|
||
<I>In GF before version 3, the same result is obtained from within GF, by the command</I>
|
||
</P>
|
||
<PRE>
|
||
> print_grammar -printer=gfcc_haskell | write_file GSyntax.hs
|
||
</PRE>
|
||
<P></P>
|
||
<P>
|
||
As an example, we take
|
||
the grammar we are going to use for queries. The abstract syntax is
|
||
</P>
|
||
<PRE>
|
||
abstract Math = {
|
||
|
||
flags startcat = Question ;
|
||
|
||
cat Answer ; Question ; Object ;
|
||
|
||
fun
|
||
Even : Object -> Question ;
|
||
Odd : Object -> Question ;
|
||
Prime : Object -> Question ;
|
||
Number : Int -> Object ;
|
||
|
||
Yes : Answer ;
|
||
No : Answer ;
|
||
}
|
||
</PRE>
|
||
<P>
|
||
It is translated to the following system of datatypes:
|
||
</P>
|
||
<PRE>
|
||
newtype GInt = GInt Integer
|
||
|
||
data GAnswer =
|
||
GYes
|
||
| GNo
|
||
|
||
data GObject = GNumber GInt
|
||
|
||
data GQuestion =
|
||
GPrime GObject
|
||
| GOdd GObject
|
||
| GEven GObject
|
||
</PRE>
|
||
<P>
|
||
All type and constructor names are prefixed with a <CODE>G</CODE> to prevent clashes.
|
||
</P>
|
||
<P>
|
||
Now it is possible to define functions from and to these datatype, in Haskell.
|
||
Haskell's type checker guarantees that the functions are well-typed also with
|
||
respect to GF. Here is a question-to-answer function for this language:
|
||
</P>
|
||
<PRE>
|
||
answer :: GQuestion -> GAnswer
|
||
answer p = case p of
|
||
GOdd x -> test odd x
|
||
GEven x -> test even x
|
||
GPrime x -> test prime x
|
||
|
||
value :: GObject -> Int
|
||
value e = case e of
|
||
GNumber (GInt i) -> fromInteger i
|
||
|
||
test :: (Int -> Bool) -> GObject -> GAnswer
|
||
test f x = if f (value x) then GYes else GNo
|
||
</PRE>
|
||
<P>
|
||
So it is the function <CODE>answer</CODE> that we want to apply as transfer.
|
||
The only problem is the <I>type</I> of this function: the parsing and
|
||
linearization method of <CODE>API</CODE> work with <CODE>Tree</CODE>s and not
|
||
with <CODE>GQuestion</CODE>s and <CODE>GAnswers</CODE>.
|
||
</P>
|
||
<P>
|
||
Fortunately the Haskell translation of GF takes care of translating
|
||
between trees and the generated datatypes. This is done by using
|
||
a class with the required translation methods:
|
||
</P>
|
||
<PRE>
|
||
class Gf a where
|
||
gf :: a -> Tree
|
||
fg :: Tree -> a
|
||
</PRE>
|
||
<P>
|
||
The Haskell code generator also generates instances of these classes
|
||
for each datatype, for example,
|
||
</P>
|
||
<PRE>
|
||
instance Gf GQuestion where
|
||
gf (GEven x1) = DTr [] (AC (CId "Even")) [gf x1]
|
||
gf (GOdd x1) = DTr [] (AC (CId "Odd")) [gf x1]
|
||
gf (GPrime x1) = DTr [] (AC (CId "Prime")) [gf x1]
|
||
fg t =
|
||
case t of
|
||
DTr [] (AC (CId "Even")) [x1] -> GEven (fg x1)
|
||
DTr [] (AC (CId "Odd")) [x1] -> GOdd (fg x1)
|
||
DTr [] (AC (CId "Prime")) [x1] -> GPrime (fg x1)
|
||
_ -> error ("no Question " ++ show t)
|
||
</PRE>
|
||
<P>
|
||
Needless to say, <CODE>GSyntax</CODE> is a module that a programmer
|
||
never needs to look into, let alone change: it is enough to know that it
|
||
contains a systematic encoding and decoding between an abstract syntax
|
||
and Haskell datatypes, where
|
||
</P>
|
||
<UL>
|
||
<LI>all GF names are in Haskell prefixed with <CODE>G</CODE>
|
||
<LI><CODE>gf</CODE> translates from Haskell to GF
|
||
<LI><CODE>fg</CODE> translates from GF to Haskell
|
||
</UL>
|
||
|
||
<A NAME="toc140"></A>
|
||
<H3>Putting it all together</H3>
|
||
<P>
|
||
Here is the complete code for the Haskell module <CODE>TransferLoop.hs</CODE>.
|
||
</P>
|
||
<PRE>
|
||
module Main where
|
||
|
||
import GF.GFCC.API
|
||
import TransferDef (transfer)
|
||
|
||
main :: IO ()
|
||
main = do
|
||
gr <- file2grammar "Math.gfcc"
|
||
loop (translate transfer gr)
|
||
|
||
loop :: (String -> String) -> IO ()
|
||
loop trans = do
|
||
s <- getLine
|
||
if s == "quit" then putStrLn "bye" else do
|
||
putStrLn $ trans s
|
||
loop trans
|
||
|
||
translate :: (Tree -> Tree) -> MultiGrammar -> String -> String
|
||
translate tr gr = case parseAllLang gr (startCat gr) s of
|
||
(lg,t:_):_ -> linearize gr lg (tr t)
|
||
_ -> "NO PARSE"
|
||
</PRE>
|
||
<P>
|
||
This is the <CODE>Main</CODE> module, which just needs a function <CODE>transfer</CODE> from
|
||
<CODE>TransferDef</CODE> in order to compile. In the current application, this module
|
||
looks as follows:
|
||
</P>
|
||
<PRE>
|
||
module TransferDef where
|
||
|
||
import GF.GFCC.API (Tree)
|
||
import GSyntax
|
||
|
||
transfer :: Tree -> Tree
|
||
transfer = gf . answer . fg
|
||
|
||
answer :: GQuestion -> GAnswer
|
||
answer p = case p of
|
||
GOdd x -> test odd x
|
||
GEven x -> test even x
|
||
GPrime x -> test prime x
|
||
|
||
value :: GObject -> Int
|
||
value e = case e of
|
||
GNumber (GInt i) -> fromInteger i
|
||
|
||
test :: (Int -> Bool) -> GObject -> GAnswer
|
||
test f x = if f (value x) then GYes else GNo
|
||
|
||
prime :: Int -> Bool
|
||
prime x = elem x primes where
|
||
primes = sieve [2 .. x]
|
||
sieve (p:xs) = p : sieve [ n | n <- xs, n `mod` p > 0 ]
|
||
sieve [] = []
|
||
</PRE>
|
||
<P>
|
||
This module, in turn, needs <CODE>GSyntax</CODE> to compile, and the main module
|
||
needs <CODE>Math.gfcc</CODE> to run. To automate the production of the system,
|
||
we write a <CODE>Makefile</CODE> as follows:
|
||
</P>
|
||
<PRE>
|
||
all:
|
||
gfc --make -haskell MathEng.gf MathFre.gf
|
||
ghc --make -o ./math TransferLoop.hs
|
||
strip math
|
||
</PRE>
|
||
<P>
|
||
(Notice that the empty segments starting the command lines in a Makefile must be tabs.)
|
||
Now we can compile the whole system by just typing
|
||
</P>
|
||
<PRE>
|
||
make
|
||
</PRE>
|
||
<P>
|
||
Then you can run it by typing
|
||
</P>
|
||
<PRE>
|
||
./math
|
||
</PRE>
|
||
<P>
|
||
Well --- you will of course need some concrete syntaxes of <CODE>Math</CODE> in order
|
||
to succeed. We have defined ours by using the resource functor design pattern,
|
||
as explained <a href="#secfunctor">here</a>.
|
||
</P>
|
||
<P>
|
||
Just to summarize, the source of the application consists of the following files:
|
||
</P>
|
||
<PRE>
|
||
Makefile -- a makefile
|
||
Math.gf -- abstract syntax
|
||
Math???.gf -- concrete syntaxes
|
||
TransferDef.hs -- definition of question-to-answer function
|
||
TransferLoop.hs -- Haskell Main module
|
||
</PRE>
|
||
<P></P>
|
||
<A NAME="toc141"></A>
|
||
<H2>Embedded GF applications in Java</H2>
|
||
<P>
|
||
When an API for GFCC in Java is available,
|
||
we will write the same applications in Java as
|
||
were written in Haskell above. Until then, we will
|
||
build another kind of application, which does not require
|
||
modification of generated Java code.
|
||
</P>
|
||
<P>
|
||
More information on embedded GF grammars in Java can be found in the document
|
||
</P>
|
||
<PRE>
|
||
www.cs.chalmers.se/~bringert/gf/gf-java.html
|
||
</PRE>
|
||
<P>
|
||
by Bj<42>rn Bringert.
|
||
</P>
|
||
<A NAME="toc142"></A>
|
||
<H3>Translets</H3>
|
||
<P>
|
||
A Java system needs many more files than a Haskell system.
|
||
To get started, one can fetch the package <CODE>gfc2java</CODE> from
|
||
</P>
|
||
<PRE>
|
||
www.cs.chalmers.se/~bringert/darcs/gfc2java/
|
||
</PRE>
|
||
<P>
|
||
by using the Darcs version control system as described in the <CODE>gf-java</CODE> home page.
|
||
</P>
|
||
<P>
|
||
The <CODE>gfc2java</CODE> package contains a script <CODE>build-translet</CODE>, which can be applied
|
||
to any <CODE>.gfcm</CODE> file to create a <B>translet</B>, a small translation GUI. Foor the <CODE>Food</CODE>
|
||
grammars of <a href="#chapthree">the third chapter</a>, we first create a file <CODE>food.gfcm</CODE> by
|
||
</P>
|
||
<PRE>
|
||
% echo "pm | wf food.gfcm" | gf FoodEng.gf FoodIta.gf
|
||
</PRE>
|
||
<P>
|
||
and then run
|
||
</P>
|
||
<PRE>
|
||
% build_translet food.gfcm
|
||
</PRE>
|
||
<P>
|
||
The resulting file <CODE>translate-food.jar</CODE> can be run with
|
||
</P>
|
||
<PRE>
|
||
% java -jar translate-food.jar
|
||
</PRE>
|
||
<P>
|
||
The translet looks like this:
|
||
</P>
|
||
<P>
|
||
<IMG ALIGN="right" SRC="food-translet.png" BORDER="0" ALT="">
|
||
</P>
|
||
<A NAME="toc143"></A>
|
||
<H3>Dialogue systems</H3>
|
||
<P>
|
||
A question-answer system is a special case of a <B>dialogue system</B>, where the user and
|
||
the computer communicate by writing or, even more properly, by speech. The <CODE>gf-java</CODE>
|
||
homepage provides an example of a most simple dialogue system imaginable, where two
|
||
the conversation has just two rules:
|
||
</P>
|
||
<UL>
|
||
<LI>if the user says <I>here you go</I>, the system says <I>thanks</I>
|
||
<LI>if the user says <I>thanks</I>, the system says <I>you are welcome</I>
|
||
</UL>
|
||
|
||
<P>
|
||
The conversation can be made in both English and Swedish; the user's initiative
|
||
decides which language the system replies in. Thus the structure is very similar
|
||
to the <CODE>math</CODE> program <a href="#secmathprogram">here</a>. The GF and
|
||
Java sources of the program can be
|
||
found in
|
||
</P>
|
||
<PRE>
|
||
www.cs.chalmers.se/~bringert/darcs/simpledemo
|
||
</PRE>
|
||
<P>
|
||
again accessible with the Darcs version control system.
|
||
</P>
|
||
<A NAME="toc144"></A>
|
||
<H2>Language models for speech recognition</H2>
|
||
<P>
|
||
The standard way of using GF in speech recognition is by building
|
||
<B>grammar-based language models</B>. To this end, GF comes with compilers
|
||
into several formats that are used in speech recognition systems.
|
||
One such format is GSL, used in the <A HREF="http://www.nuance.com">Nuance speech recognizer</A>.
|
||
It is produced from GF simply by printing a grammar with the flag
|
||
<CODE>-printer=gsl</CODE>. The following example uses the smart house grammar defined
|
||
<a href="#secsmarthouse">here</a>.
|
||
</P>
|
||
<PRE>
|
||
> import -conversion=finite SmartEng.gf
|
||
> print_grammar -printer=gsl
|
||
|
||
;GSL2.0
|
||
; Nuance speech recognition grammar for SmartEng
|
||
; Generated by GF
|
||
|
||
.MAIN SmartEng_2
|
||
|
||
SmartEng_0 [("switch" "off") ("switch" "on")]
|
||
SmartEng_1 ["dim" ("switch" "off")
|
||
("switch" "on")]
|
||
SmartEng_2 [(SmartEng_0 SmartEng_3)
|
||
(SmartEng_1 SmartEng_4)]
|
||
SmartEng_3 ("the" SmartEng_5)
|
||
SmartEng_4 ("the" SmartEng_6)
|
||
SmartEng_5 "fan"
|
||
SmartEng_6 "light"
|
||
</PRE>
|
||
<P>
|
||
Other formats available via the <CODE>-printer</CODE> flag include:
|
||
</P>
|
||
<TABLE ALIGN="center" CELLPADDING="4" BORDER="1">
|
||
<TR>
|
||
<TH>Format</TH>
|
||
<TH COLSPAN="2">Description</TH>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>gsl</CODE></TD>
|
||
<TD>Nuance GSL speech recognition grammar</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>jsgf</CODE></TD>
|
||
<TD>Java Speech Grammar Format (JSGF)</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>jsgf_sisr_old</CODE></TD>
|
||
<TD>JSGF with semantic tags in SISR WD 20030401 format</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>srgs_abnf</CODE></TD>
|
||
<TD>SRGS ABNF format</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>srgs_xml</CODE></TD>
|
||
<TD>SRGS XML format</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>srgs_xml_prob</CODE></TD>
|
||
<TD>SRGS XML format, with weights</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>slf</CODE></TD>
|
||
<TD>finite automaton in the HTK SLF format</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD><CODE>slf_sub</CODE></TD>
|
||
<TD>finite automaton with sub-automata in HTK SLF</TD>
|
||
</TR>
|
||
</TABLE>
|
||
|
||
<P></P>
|
||
<P>
|
||
All currently available formats can be seen in gf with <CODE>help -printer</CODE>.
|
||
</P>
|
||
<A NAME="toc145"></A>
|
||
<H2>Dependent types and spoken language models</H2>
|
||
<P>
|
||
We have used dependent types to control semantic well-formedness
|
||
in grammars. This is important in traditional type theory
|
||
applications such as proof assistants, where only mathematically
|
||
meaningful formulas should be constructed. But semantic filtering has
|
||
also proved important in speech recognition, because it reduces the
|
||
ambiguity of the results.
|
||
</P>
|
||
<P>
|
||
Now, GSL is a context-free format, so how does it cope with dependent types?
|
||
In general, dependent types can give rise to infinitely many basic types
|
||
(exercise!), whereas a context-free grammar can by definition only have
|
||
finitely many nonterminals.
|
||
</P>
|
||
<P>
|
||
This is where the flag <CODE>-conversion=finite</CODE> is needed in the <CODE>import</CODE>
|
||
command. Its effect is to convert a GF grammar with dependent types to
|
||
one without, so that each instance of a dependent type is replaced by
|
||
an atomic type. This can then be used as a nonterminal in a context-free
|
||
grammar. The <CODE>finite</CODE> conversion presupposes that every
|
||
dependent type has only finitely many instances, which is in fact
|
||
the case in the <CODE>Smart</CODE> grammar.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. If you have access to the Nuance speech recognizer,
|
||
test it with GF-generated language models for <CODE>SmartEng</CODE>. Do this
|
||
both with and without <CODE>-conversion=finite</CODE>.
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Construct an abstract syntax with infinitely many instances
|
||
of dependent types.
|
||
</P>
|
||
<A NAME="toc146"></A>
|
||
<H3>Statistical language models</H3>
|
||
<P>
|
||
An alternative to grammar-based language models are
|
||
<B>statistical language models</B> (<B>SLM</B>s). An SLM is
|
||
built from a <B>corpus</B>, i.e. a set of utterances. It specifies the
|
||
probability of each <B>n-gram</B>, i.e. sequence of <I>n</I> words. The
|
||
typical value of <I>n</I> is 2 (bigrams) or 3 (trigrams).
|
||
</P>
|
||
<P>
|
||
One advantage of SLMs over grammar-based models is that they are
|
||
<B>robust</B>, i.e. they can be used to recognize sequences that would
|
||
be out of the grammar or the corpus. Another advantage is that
|
||
an SLM can be built "for free" if a corpus is available.
|
||
</P>
|
||
<P>
|
||
However, collecting a corpus can require a lot of work, and writing
|
||
a grammar can be less demanding, especially with tools such as GF or
|
||
Regulus. This advantage of grammars can be combined with robustness
|
||
by creating a back-up SLM from a <B>synthesized corpus</B>. This means
|
||
simply that the grammar is used for generating such a corpus.
|
||
In GF, this can be done with the <CODE>generate_trees</CODE> command.
|
||
As with grammar-based models, the quality of the SLM is better
|
||
if meaningless utterances are excluded from the corpus. Thus
|
||
a good way to generate an SLM from a GF grammar is by using
|
||
dependent types and filter the results through the type checker:
|
||
</P>
|
||
<PRE>
|
||
> generate_trees | put_trees -transform=solve | linearize
|
||
</PRE>
|
||
<P>
|
||
The method of creating statistical language model from corpora synthesized
|
||
from GF grammars is applied and evaluated in (Jonson 2006).
|
||
</P>
|
||
<P>
|
||
<B>Exercise</B>. Measure the size of the corpus generated from
|
||
<CODE>SmartEng</CODE> (defined <a href="#secsmarthouse">here</a>), with and without type checker filtering.
|
||
</P>
|
||
|
||
<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) -->
|
||
<!-- cmdline: txt2tags -thtml -\-toc gf-tutorial.txt -->
|
||
</BODY></HTML>
|