GF Resource Grammar Library Author: Aarne Ranta Last update: %%date(%c) % NOTE: this is a txt2tags file. % Create an html file from this file using: % txt2tags --toc gf-resource.txt %!target:html %!postproc(html): #NEW #NEW ==GF = Grammatical Framework== GF is a grammar formalism based on functional programming and type theory. GF was designed to be nice for //ordinary programmers// to use: by this we mean programmers without training in linguistics. The mission of GF is to make natural-language applications available for ordinary programmers, in tasks like - software documentation - domain-specific translation - human-computer interaction - dialogue systems Thus GF is //not// primarily another theoretical framework for linguists. #NEW ==Multilingual grammars== A GF grammar consists of an abstract syntax and a set of concrete syntaxes. **Abstract syntax**: language-independent representation ``` cat Prop ; Nat ; fun Even : Nat -> Prop ; fun NInt : Int -> Nat ; ``` **Concrete syntax**: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...) ``` lin Even x = {s = x.s ++ "is" ++ "even"} ; lin Even x = {s = x.s ++ "est" ++ "pair"} ; lin Even x = {s = x.s ++ "ist" ++ "gerade"} ; lin Even x = {s = x.s ++ "är" ++ "jämnt"} ; ``` We can **translate** between languages via the abstract syntax: ``` 4 is even 4 ist gerade \ / Even (NInt 4) / \ 4 est pair 4 är jämnt ``` But is it really so simple? #NEW ==Difficulties with concrete syntax== Most languages have rules of **inflection**, **agreement**, and **word order**, which have to be obeyed when putting together expressions. The previous multilingual grammar breaks these rules in many situations: // 2 and 3 is even la somme de 3 et de 5 est pair wenn 2 ist gerade, dann 2+2 ist gerade om 2 är jämnt, 2+2 är jämnt // All these sentences are grammatically incorrect. #NEW ==Solving the difficulties== GF has tools for expressing the linguistic rules that are needed to produce correct translations in different languages. Instead of just strings, we need parameters**, **tables**, and **record types**. For instance, French: ``` param Mod = Ind | Subj ; param Gen = Masc | Fem ; lincat Nat = {s : Str ; g : Gen} ; lincat Prop = {s : Mod => Str} ; lin Even x = {s = table { m => x.s ++ case m of {Ind => "est" ; Subj => "soit"} ++ case x.g of {Masc => "pair" ; Fem => "paire"} } } ; ``` To learn more about these constructs, consult GF documentation, e.g. the [../../../doc/tutorial/gf-tutorial2.html New Grammarian's Tutorial]. However, in what follows we will show how to avoid learning them and still write linguistically correct grammars. #NEW ==Language + Libraries== Writing natural language grammars still requires theoretical knowledge about the language. Which kind of a programmer is it easier to find? - one who can write a sorting algorithm - one who can write a grammar for Swedish determiners In main-stream programming, sorting algorithms are not written by hand but taken from **libraries**. In the same way, we want to create grammar libraries that encapsulate basic linguistic facts. Cf. the Java success story: the language is just a half of the success - libraries are another half. #NEW ==Example of library-based grammar writing== To define a Swedish expression of a mathematical predicate from scratch: ``` Even x = let jämn = case <x.n,x.g> of { <Sg,Utr> => "jämn" ; <Sg,Neutr> => "jämnt" ; <Pl,_> => "jämna" } in {s = table { Main => x.s ! Nom ++ "är" ++ jämn ; Inv => "är" ++ x.s ! Nom ++ jämn ; Sub => x.s ! Nom ++ "är" ++ jämn } } ``` To use library functions for syntax and morphology: ``` Even = predA (regA "jämn") ; ``` For the French version, we write ``` Even = predA (regA "pair") ; ``` #NEW ==Questions in grammar library design== What should there be in the library? - morphology, lexicon, syntax, semantics,... How do we organize and present the library? - division into modules, level of granularity - "school grammar" vs. sophisticated linguistic concepts Where do we get the data from? - automatic extraction or hand-writing? - reuse of existing resources? Extra constraint: we want open-source free software and hence cannot use existing proprietary resources. #NEW ==Answers to questions in grammar library design== The current GF resource grammar library has made the following decisions: The library has, for each language - complete morphology, some lexicon (500 words), representative fragment of syntax, very little semantics, Organization and presentation: - division into top-level (API) modules, and internal modules (only interesting for resource implementors) - the API is, as much as possible, common in different languages - we favour "school grammar" concepts rather than innovative linguistic theory Where do we get the data from? - morphology and syntax are hand-written - the 500-word lexicon is hand-written, but a tool is provided for automatic lexicon extraction - we have not reused existing resources The resource grammar library is entirely open-source free software (under GNU GPL license). #NEW ==The scope of a resource grammar library for a language== All morphological paradigms Basic lexicon of structural, common, and irregular words Basic syntactic structures Currently, - //no// semantics, - //no// language-specific structures if not necessary for expressivity. #NEW ==Success criteria== Grammatical correctness Semantic coverage: you can express whatever you want. Usability as library for non-linguists. (Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF. #NEW ==These are not our success criteria== Language coverage: to be able to parse all expressions. Example: the French //passé simple// tense, although covered by the morphology, is not used in the language-independent API, but only the //passé composé// is. However, an application accessing the French-specific (or Romance-specific) modules can use the passé simple. Semantic correctness: only to produce meaningful expressions. Example: the following sentences can be generated ``` colourless green ideas sleep furiously the time is seventy past forty-two ``` However, an applicatio grammar can use a domain-specific semantics to guarantee semantic well-formedness. (Warning for linguists:) theoretical innovation in syntax is not among the goals (and it would be hidden from users anyway!). #NEW ==So where is semantics?== GF incorporates a **Logical Framework** and is therefore capable of expressing logical semantics //ā la// Montague or any other flavour, including anaphora and discourse. But we do //not// try to give semantics once and for all for the whole language. Instead, we expect semantics to be given in **application grammars** built on semantic models of different domains. Example application: number theory ``` fun Even : Nat -> Prop ; -- a mathematical predicate lin Even = predA (regA "even") ; -- English translation lin Even = predA (regA "pair") ; -- French translation lin Even = predA (regA "jämn") ; -- Swedish translation ``` How could the resource predict that just //these// translations are correct in this domain? Application grammars are built by experts of these domains who - thanks to resource grammars - do no more need to be experts in linguistics. #NEW ==Languages== The current GF Resource Project covers ten languages: -``Dan``ish -``Eng``lish -``Fin``nish -``Fre``nch -``Ger``man -``Ita``lian -``Nor``wegian -``Rus``sian -``Spa``nish -``Swe``dish The first three letters (``Dan`` etc) are used in grammar module names #NEW ==Library structure 1: language-independent API== - ``Lang`` is the top module collecting all of the following. - syntactic ``Categories`` (parts of speech, word classes), e.g. ``` V ; NP ; CN ; Det ; -- verb, noun phrase, common noun, determiner ``` - ``Rules`` for combining words and phrases, e.g. ``` DetNP : Det -> CN -> NP ; -- combine Det and CN into NP ``` - the most common ``Structural`` words (determiners, conjunctions, pronouns) (now 83), e.g. ``` and_Conj : Conj ; ``` - ``Numerals``, number words from 1 to 999,999 with their inflections, e.g. ``` n8 : Digit ; ``` - ``Basic`` lexicon of (now 218) frequent everyday words ``` man_N : N ; ``` In addition, and not included in ``Lang``, there is - ``SwadeshLex``, lexicon of (now 206) words from the [http://en.wiktionary.org/wiki/Swadesh_List Swadesh list], e.g. ``` squeeze_V : V ; ``` Of course, there is some overlap between ``SwadeshLex`` and the other modules. #NEW ==Library structure 2: language-dependent modules== - morphological ``Paradigms``, e.g. Swedish ``` mkN : Str -> Str -> Str -> Str -> Gender -> N ; -- worst-case nouns mkN : Str -> N ; -- regular nouns ``` - (in some languages) irregular ``Verbs``, e.g. ``` angripa_V = irregV "angripa" "angrep" "angripit" ; ``` - (not yet available) ``Ext``ended syntax with language-specific rules ``` PassBli : V2 -> NP -> VP ; -- bli överkörd av ngn ``` #NEW ==How much can be language-independent?== For the ten languages we have considered, it //is// possible to implement the current API. Reservations: - this does not necessarily extend to all other languages - this does not necessarily cover the most idiomatic expressions of each language - this may not be the easiest API to implement (e.g. negation and inversion with //do// in English suggest that some other structure would be more natural) - it is not guaranteed that same structure has the same semantics in all different languages #NEW ==Library structure: language-independent API== %#center [src="Lang.gif] %#center #NEW ==API documentation== [Categories.html Categories] [Rules.html Rules] Two alternative views on sentence formation by predication: [Clause.html Clause], [Verbphrase.html Verbphrase] [Structural.html Structural] [Time.html Time] [Basic.html Basic] [Lang.html Lang] See also [../../resource-1.0/doc/gfdoc resource v 1.0 documentation], now implemented for English, German, and Swedish. #NEW ==Paradigms documentation== [ParadigmsEng.html English paradigms] [BasicEng.html example use of English oaradigms] [VerbsEng.html English verbs] [ParadigmsFin.html Finnish paradigms] [BasicFin.html example use of Finnish oaradigms] [ParadigmsFre.html French paradigms] [BasicFre.html example use of French paradigms] [VerbsFre.html French verbs] [ParadigmsIta.html Italian paradigms] [BasicIta.html example use of Italian paradigms] [BeschIta.html Italian verb conjugations] [ParadigmsNor.html Norwegian paradigms] [BasicNor.html example use of Norwegian paradigms] [VerbsNor.html Norwegian verbs] [ParadigmsSpa.html Spanish paradigms] [BasicSpa.html example use of Spanish paradigms] [BeschSpa.html Spanish verb conjugations] [ParadigmsSwe.html Swedish paradigms] [BasicSwe.html example use of Swedish paradigms] [VerbsSwe.html Swedish verbs] #NEW ==Use as top-level grammar: testing== Import a set of ``LangX`` grammars: ``` i english/LangEng.gf i swedish/LangSwe.gf ``` Alternatively, you can ``make`` a precompiled package of all the languages by using ``lib/resource/Makefile``: ``` make gf langs.gfcm ``` Then you can test with translation, random generation, morphological analysis... ``` > p -lang=LangEng "I have loved her." | l -lang=LangFre Je l' ai aimée. > gr -cat=NP | l -multi The sock Strumpan Strømpen La media La calza La chaussette Sukka ``` #NEW ==Use as top-level grammar: language learning quizzes== Morpho quiz with words (e.g. French verbs): ``` i french/VerbsFre.gf mq -cat=V ``` Morpho quiz with phrases (e.g. Swedish clauses): ``` i swedish/LangSwe.gf mq -cat=Cl ``` Translation quiz with sentences (e.g. sentences from English to Swedish): ``` i swedish/LangEng.gf i swedish/LangSwe.gf tq -cat=S LangEng LangSwe ``` #NEW ==Use as library== Import directly by ``open``: ``` concrete AppNor of App = open LangNor, ParadigmsNor in {...} ``` (Note for the users of GF 2.1 and older: the dummy ``reuse`` modules and their bulky ``.gfr`` versions are no longer needed!) If you need to convert resource records to strings, and don't want to know the concrete type (as you never should), you can use ``` Predef.toStr : (L : Type) -> L -> Str ; ``` ``L`` must be a linearization type. For instance, ``` toStr LangNor.CN (ModAP (PositADeg old_ADeg) (UseN car_N)) ---> "gammel bil" ``` #NEW ==Use as library through parser== You can use the parser with a ``LangX`` grammar when developing a resource. Using the ``-v`` option shows if the parser fails because of unknown words. ``` > p -cat=S -v -lexer=words "jag ska åka till Chalmers" unknown tokens [TS "åka",TS "Chalmers"] ``` Then try to select words that ``LangX`` recognizes: ``` > p -cat=S "jag ska springa till Danmark" UseCl (PosTP TFuture ASimul) (AdvCl (SPredV i_NP run_V) (AdvPP (PrepNP to_Prep (UsePN (PNCountry Denmark))))) ``` Use these API structures and extend vocabulary to match your need. ``` åka_V = lexV "åker" ; Chalmers = regPN "Chalmers" neutrum ; ``` #NEW ==Syntax editor as library browser== You can run the syntax editor on ``LangX`` to find resource API functions through context-sensitive menus. For instance, the shell command ``` gfeditor LangEng.gf LangFre.gf ``` opens the editor with English and French views. The [http://www.cs.chalmers.se/%7Eaarne/GF2.0/doc/javaGUImanual/javaGUImanual.htm Editor User Manual] gives more information on the use of the editor. A restriction of the editor is that it does not give access to ``ParadigmsX`` modules. An IDE environment extending the editor to a grammar programming tool is work in progress. #NEW ==Example application: a small translation system== In this system, you can express questions and answers of the following forms: ``` Who chases mice ? Whom does the lion chase ? The dog chases cats. ``` We build the abstract syntax in two phases: - [example/Questions.gf>Questions] defines question and answer forms independently of domain - [example/Animals.gf>Animals] defines a lexicon with animals and things that animals do. The concrete syntax of English is built in three phases: - [example/HandQuestionsI.gf QuestionsI] is a parametrized module using the API module ``Resource``. - [example/QuestionsEng.gf QuestionsEng] is an instantiation of the API with ``ResourceEng``. - [example/AnimalsEng.gf AnimalsEng] is a concrete syntax of ``Animals`` using ``ParadigmsEng`` and ``VerbsEng``. The concrete syntax of Swedish is built upon ``QuestionsI`` in a similar way, with the modules [example/QuestionsSwe.gf QuestionsSwe] and. [example/AnimalsSwe.gf AnimalsSwe]. The concrete syntax of French consists similarly of the modules [example/QuestionsFre.gf QuestionsFre] and [example/AnimalsFre.gf AnimalsFre]. #NEW ==Compiling the example application== The resources are bulky, and it takes a therefore a lot of time and memory to load the grammars. However, they can be compiled into the ``gfcm`` (**GF canonical multilingual**) format, which is almost one thousand times smaller and faster to load for this set of grammars. To produce an end-user multilingual grammar ``animals.gfcm``, write the sequence of compilation commands in a ``gfs`` (**GF script**) file, say [example/mkAnimals.gfs ``mkAnimals.gfs``], and then call GF with ``` gf <mkAnimals.gfs ``` To try out the grammar, ``` > i animals.gfcm > gr | l -multi vem jagar hundar ? qui chasse des chiens ? who chases dogs ? ``` #NEW ==Grammar writing by examples== (New in GF 2.3) You can use the resource grammar as a parser on a special file format, ``.gfe`` ("GF examples"). Here is the real source, [example/QuestionsI.gfe QuestionsI.gfe], which generated [example/QuestionsI.gf QuestionsI.gf]. when you executed the GF command ``` i -ex AnimalsEng.gf ``` Since ``QuestionsI`` is an incomplete module ("functor"), it need only be built once. This is why only the first command in ``mkAnimals.gfs`` needs the flag ``-ex``. Of course, the grammar of any language can be created by parsing any language, as long as they have a common resource API. The use of English resource is generally recommended, because it is smaller and faster to parse than the other languages. #NEW ==Constants and variables in examples== The file [example/QuestionsI.gfe QuestionsI.gfe] uses as resource ``LangEng``, which contains all resource syntax and a lexicon of ca. 300 words. A linearization rule, such as ``` lin Who love_V2 man_N = in Phr "who loves men ?" ; ``` uses as argument variables constants for words that can be found in the lexicon. It is due to this that the example can be parsed. When the resulting rule, ``` lin Who love_V2 man_N = QuestPhrase (UseQCl (PosTP TPresent ASimul) (QPredV2 who8one_IP love_V2 (IndefNumNP NoNum (UseN man_N)))) ; ``` is read by the GF compiler, the identifiers ``love_V2`` and ``man_N`` are not treated as constants, but, following the normal binding rules of functional languages, as bound variables. This is what gives the example method the generality that is needed. To write linearization rules by examples one thus has to know at least one abstract syntax constant for each category for which one needs a variable. #NEW ==Extending the lexicon on the fly== The greatest limitation of the example method is that the lexicon may lack many of the words that are needed in examples. If parsing fails because of this, the compiler gives a list of unknown words in its error message. An obvious solution is, of course, to extend the resource lexicon and try again. A more light-weight solution is to add a **substitution** to the example. For instance, if you want the example "the pope" but the lexicon does not have the word "pope", you can write ``` lin Pope = in NP "the man" {man_N = regN "pope"} ; ``` The resulting linearization rule is initially ``` lin Pope = DefOneNP (UseN man_N) ; ``` but the substitution changes this to ``` lin Pope = DefOneNP (UseN (regN "pope")) ; ``` In this way, you do not have to extend the resource lexicon, but you need to open the Paradigms module to compile the resulting term. Of course, the substituted expressions may come from another language than the main language of the example: ``` lin Pope = in NP "the man" {man_N = regN "pape" masculine} ; ``` If many substitutions are needed, semicolons are used as separators: ``` {man_N = regN "pope" ; walk_N = regV "pray"} ; ``` #NEW ==Implementation details: low-level files== **For developers of resource grammars.** The modules listed in this section should never be imported in application grammars. Each of the API implementations uses the following auxiliary resource modules: - ``Types``, the morphological paradigms and word classes - ``Morpho``, inflection machinery - ``Syntax``, complex categories and their combinations In addition, the following language-independent modules from ``lib/prelude`` are used. - ``Predef``, operations whose definitions are hard-coded in GF - ``Prelude``, generic string and boolean operations - ``Coordination``, coordination structures for arbitrary categories #NEW ==Implementation details: the structure of low-level files== %#center [Low.gif] %#center #NEW ==How to change a resource grammar?== In many cases, the source of a bug is in one of the low-level modules. Try to trace it back there by starting from the high-level module. (Much more to be written...) #NEW ==How to write a resource grammar?== Start with a more limited goal, e.g. to implement the ``stoneage`` grammar (``examples/stoneage``) for your language. For this, you need - most of ``Types`` - most of ``Morpho`` - some of ``Syntax`` - most of ``Paradigms`` A useful command to test ``oper``s: ``` i -retain MorphoRot.gf cc regNoun "foo" ``` See also [../../resource-1.0/doc/Resource-HOWTO.html Resource-HOWTO] (under construction). #NEW ==The use of parametrized modules== In two language families, a lot of code is shared. - Romance: French, Italian, Spanish - Scandinavian: Danish, Norwegian, Swedish The structure looks like this. [] #NEW ==Current status== | Language | v0.6 | v0.9 | v1.0 | Paradigms | Lexicon | Verbs | | Arabic | - | - | + | X | X | - | Danish | - | X | X | X | X | X | English | X | X | X | X | X | X | Finnish | X | + | X | X | X | 0 | French | X | X | X | X | X | X | German | X | - | X | X | X | X | Italian | X | X | X | X | X | X | Norwegian | - | X | X | X | X | X | Russian | X | X | X | X | X | - | Spanish | - | X | X | X | X | X | Swedish | X | X | X | X | X | X X = implemented (few exceptions may occur) + = implemented for a large part * = linguistic material ready for implementation - = not implemented 0 = not applicable #NEW ==Known bugs and limitations== (//The listed limitations are ones that do not follow from the table on the previous page//.) Danish English Finnish: compiling the heuristic regular paradigms is slow; possessive and interrogative suffixes have no proper lexer. French: no inverted questions German Italian: no binding of clitics with infinitive Norwegian Russian: missing rules for ordinal numbers Spanish Swedish #NEW ==Obtaining it== Get the grammar package from [http://sourceforge.net/project/showfiles.php?group_id=132285 GF Download Page]. The current libraries are in ``lib/resource-1.0``. Version 0.9 is in ``lib/resource-0.9``. Version 0.6 is in ``lib/resource-0.6``. The very very latest version of GF and its libraries is in the [Darcs repository http://www.cs.chalmers.se/Cs/Research/Language-technology/darcs/GF/doc/darcs.html].