Grammars as Software Libraries Author: Aarne Ranta Last update: %%date(%c) % NOTE: this is a txt2tags file. % Create an html file from this file using: % txt2tags --toc gslt-sem-2006.txt %!target:html %!postproc(html): #NEW #NEW ==Setting== Current funding - VR: Library-Based Grammar Engineering (2006-2008) - Lars Borin (Swedish) - Robin Cooper (Computational Linguistics) - Sibylle Schupp and Aarne Ranta (Computer Science) Previous funding - VR: Record Types and Dialogue Semantics (2003-2005) - VINNOVA: Interactive Language Technology (2001-2004) Main applications - TALK: multilingual and multimodal dialogue systems - WebALT: multilingual generation of mathematical teaching material - KeY: multilingual authoring of software specifications #NEW ==People== Staff contributions to grammar libraries: - Björn Bringert - Markus Forsberg - Harald Hammarström - Janna Khegai - Aarne Ranta Student projects on libraries: - Inger Andersson & Therese Söderberg: Spanish morphology - Ludmilla Bogavac: Russian morphology - Ali El Dada: Arabic morphology and syntax - Muhammad Humayoun: Urdu morphology - Michael Pellauer: Estonian morphology #NEW ==Software Libraries== The main device of **division of labour** in programming. Instead of writing a sorting algorithm over and over again, the programmers take it from a library. You write (in Haskell), ``` Data.List.sort xs ``` instead of a lot of code actually implementing sorting. Practical advantages: - division of labour - faster development of new software - quality guarantee and automatic improvements #NEW ==Abstraction== Libraries promote **abstraction**: you abstract away from details. The use of libraries is therefore a good programming style. It is also **scientifically interesting** to create libraries: you have to think about abstractions on your domain of expertise. Notice: libraries can bring abstraction to almost any language, if it just has a support for functions or macros. #NEW ==Grammars as libraries?== Example: we want to create a GUI (Graphical User Interface) button that says //yes//, and **localize** it to different languages: ``` Yes Ja Kyllä Oui Ja Sė ``` Possible ways to do this: + Go around dictionaries to find the word in different languages ``` yesButton english = button "Yes" yesButton swedish = button "Ja" yesButton finnish = button "Kyllä" ``` + Hire more programmers to perform localization in different languages #NEW 3. Use a library ``GUIText`` such that you can write ``` yesButton lang = button (render lang GUIText.Yes) ``` #NEW ==A slightly more advanced example== This is what you often see as a feedback from a program: ``` You have 1 messages. ``` Or perhaps with a little more thought: ``` You have 1 message(s). ``` The code that should be written is of course ``` mess n = "You have" +++ show n +++ messages ++ "." where messages = if n==1 then "message" else "messages" ``` (E.g. VoiceXML gives support for this.) #NEW ==Problems with the more advanced example== The same as with "Yes": you have to know the words "you", "have", "message". //Moreover//, you have to know the inflection of the equivalent of "message": ``` if n == 1 then "meddelande" else "meddelanden" ``` //Moreover//, you have to know the congruence with different numbers (e.g. Arabic): ``` if n == 1 then "risAlaö" else if n == 2 then "risAlatAn" else if n < 11 then "rasA'il" else "risAlaö" ``` #NEW ==More problems with the advanced example== You also have to know the case required by the verb "have" (e.g. Finnish: nominative in singular, partitive in plural). //Moreover//, you have to know what is the proper way to politely address the user: ``` Du har 3 meddelanden / Ni har 3 meddelanden Vous avez 3 messages / Tu as 3 messages ``` (This can also depend on country and the kind of program.) #NEW ==A library-based solution== In analogy with the "Yes" case, you write ``` mess lang n = render lang (MailText.YouHaveMessages n) ``` Hmm, is this so smart? What about if you want to say ``` You have 4 documents. You have 5 jewels. I have 7 surprises. ``` It is time to move from **canned text** to a **grammar**. #NEW ==An improved library-based solution== You may want to write ``` mess lang n = render lang (Have PolYou (Num n Message)) sword lang n = render lang (Have FamYou (Num n Jewel)) surpr lang n = render lang (Have I (Num n Surprise)) ``` For this purpose, you need a library with the following API (Application Programmer's Interface): ``` Have : NounPhrase -> NounPhrase -> Sentence PolYou : NounPhrase FamYou : NounPhrase Num : Int -> Noun -> NounPhrase Message : Noun ``` You also need a top-level rendering function ``` render : Language -> Sentence -> String ``` #NEW ==An optimal solution?== The library API for language will certainly grow big and become difficult to use. Why couldn't I just write ``` mess lang n = render lang (parse english "you have n messages") ``` To this end, the API should provide the top-level function ``` parse : Language -> String -> Sentence ``` The library that we will present actually has this as well! #NEW The only complication is that ``parse`` does not always return just one sentence. Those may be zero: ``` you have n mesaggse ``` or many: ``` Have PolYou (Num n Message) Have FamYou (Num n Message) Have PlurYou (Num n Message) ``` #NEW ==The components of a grammar library== The library has **construction functions** like ``` Have : NounPhrase -> NounPhrase -> Sentence PolYou : NounPhrase ``` These functions build **grammatical structures**, which can have different realizations in different languages. Therefore we also need **realization functions**, ``` render : Language -> Sentence -> String parse : Language -> String -> [Sentence] ``` Both of them require major linguistic expertise to write - but, one this is done, they can be used with very little linguistic knowledge by application programmers! #NEW ==Implementing a grammar library in GF== GF = Grammatical Framework Those who know GF have already seen the introduction as a seduction argument leading to GF. In GF, - construction functions = **abstract syntax** - realization functions = **concrete syntax** #NEW Simplest possible example: ``` abstract GUIText = { cat Text ; fun Yes : Text ; } concrete GUITextEng of GUIText = { lin Yes = ss "yes" ; } concrete GUITextFin of GUIText = { lin Yes = ss "kyllä" ; } ``` #NEW ==Linearization and parsing== The realizatin function is, for each language, implemented by **linearization rules** (``lin``). The linearization rules directly give the ``render`` method: ``` render english x = GUITextEng.lin x ``` The GF formalism moreover has the property of **reversibility**: a set of linearization rules automatically generates a parser as well. %While reversibility has a minor importance for the applications %shown above, it is crucial for other applications of GF grammars. #NEW ==Applying GF== **multilingual grammar** = abstract syntax + concrete syntaxes Examples of the idea: - multilingual authoring - domain-specific translation - dialogue systems #NEW ==Domain, ontology, idiom== An abstract syntax represents - a **semantic model** - an **ontology** The concrete syntax defines how the concepts of the ontology are represented in a language. The following requirements are made: - linguistic correctness (inflection, agreement, word order,...) - semantic correctness (express the intended concepts) - conformance to the domain idiom (use proper terms and phrasing) Benefit: translation via semantic model of domain can reach high quality. Problem: the expertise of both a linguist and a domain expert are required. #NEW ==Example domain== Arithmetic of natural numbers: abstract syntax ``` cat Prop ; Nat ; fun Even : Nat -> Prop ; ``` **Concrete syntax**: mapping from abstract syntax trees to strings in a language (English, French, German, Swedish,...) ``` lin Even x = {s = x.s ++ "is" ++ "even"} ; lin Even x = {s = x.s ++ "est" ++ "pair"} ; lin Even x = {s = x.s ++ "ist" ++ "gerade"} ; lin Even x = {s = x.s ++ "är" ++ "jämnt"} ; ``` #NEW ==Translation system== We can **translate** between languages via the abstract syntax: ``` 4 is even 4 ist gerade \ / Even (NInt 4) / \ 4 est pair 4 är jämnt ``` This idea is used e.g. in the WebALT project to generate mathematical teaching material in 7 languages. But is it really so simple? #NEW ==Difficulties with concrete syntax== The previous multilingual grammar breaks these rules in many situations: ``` 2 and 3 is even la somme de 3 et de 5 est pair wenn 2 ist gerade, dann 2+2 ist gerade om 2 är jämnt, 2+2 är jämnt ``` All these sentences are grammatically incorrect. #NEW ==Solving the difficulties== GF can express the linguistic rules that are needed to produce correct translations. (Expressive power between TAG and HPSG, but the language is more high-level.) Instead of just strings, we need **parameters**, **tables**, and **record types**. For instance, French: ``` param Mod = Ind | Subj ; param Gen = Masc | Fem ; lincat Nat = {s : Str ; g : Gen} ; lincat Prop = {s : Mod => Str} ; lin Even x = {s = table { m => x.s ++ case m of {Ind => "est" ; Subj => "soit"} ++ case x.g of {Masc => "pair" ; Fem => "paire"} } } ; ``` Linguistic knowledge dominates in the size of this grammar. #NEW ==Application grammars vs. resource grammars== Application grammar ("semantic grammar") - abstract syntax: domain semantics - concrete syntax: "controlled language" - author: domain expert Resource grammar ("syntactic grammar") - abstract syntax: linguistic structures - concrete syntax: (approximation of) entire language - author: linguist #NEW ==Concrete syntax using library== Language-independent API ``` cat S ; NP ; A ; fun predA : NP -> A -> S ; oper regA : Str -> A ; ``` Implementation for four languages ``` lincat Prop = S ; Nat = NP ; lin Even = predA (regA "even") ; -- English Even = predA (regA "jämn") ; -- Swedish Even = predA (regA "pair") ; -- French Even = predA (regA "gerade") ; -- German ``` Notice: choice of adjective is domain expert knowledge. #NEW ==Design questions for grammar the library== What should there be in the library? - morphology, lexicon, syntax, semantics,... How do we organize and present the library? - division into modules, level of granularity - "school grammar" vs. sophisticated linguistic concepts Where to get the data from? - automatic extraction or hand-writing? - reuse of existing resources? Extra constraint: we want open-source free software and hence cannot use existing proprietary resources. #NEW ==Design decisions== The current GF resource grammar library has, for each language, - complete morphology - lexicon of the most important structural words - test lexicon of ca. 300 content words - representative fragment of syntax (cf. CLE (Core Language Engine)) - rather flat semantics (cf. Quasi-Logical Form of CLE) Organization and presentation: - top-level (API) modules - internal modules (only interesting for resource implementors) - we favour "school grammar" concepts rather than innovative linguistic theory - tool ``gfdoc`` for generating HTML from grammars #NEW ==Design decisions, cont'd== Where do we get the data from? - morphology and syntax are hand-written - the test lexicon is hand-written - APIs for manual lexicon extension - tool for automatic lexicon extraction - we have not reused existing resources The resource grammar library is entirely open-source free software (under GNU GPL license). #NEW ==Success criteria== Grammatical correctness of everything generated. Semantic coverage: you can express whatever you want. Usability as library for non-linguists. (Bonus for linguists:) nice generalizations w.r.t. language families, using the module system of GF. #NEW ==These are not our success criteria== Language coverage: to be able to parse all expressions. - Example: French //passé simple//, although covered by the morphology, is not available through the language-independent API. Semantic correctness: only to produce meaningful expressions. - Example: the following sentences can be generated ``` colourless green ideas sleep furiously the time is seventy past forty-two ``` (Warning for linguists:) theoretical innovation in syntax is not among the goals (and it would be hidden from users anyway!). #NEW ==So where is semantics?== Application grammars typically use domain-specific semantics to guarantee semantic well-formedness. GF incorporates a **Logical Framework** and is therefore capable of expressing logical semantics //ā la// Montague or any other flavour, including anaphora and discourse. But we do //not// try to give semantics once and for all for the whole language. Instead, we expect semantics to be given in **application grammars** built on semantic models of different domains. #NEW ==Levels of representation== No fixed set of levels; here some examples: ``` 2 is even 2 är jämnt ``` In ``Arithm`` ``` Even 2 ``` In ``Predication`` (high level resource API) ``` predA (IntNP 2) (regA "even") predA (IntNP 2) (regA "jämn") ``` In ``Lang`` (ground level resource API) ``` UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "even"))))) UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "jämn"))))) ``` #NEW ==Languages== The current GF Resource Project covers ten languages: - ``Dan``ish - ``Eng``lish - ``Fin``nish - ``Fre``nch - ``Ger``man - ``Ita``lian - ``Nor``wegian - ``Rus``sian - ``Spa``nish - ``Swe``dish The first three letters (``Dan`` etc) are used in grammar module names In addition, we have parts (morphology) of Arabic, Estonian, and Urdu #NEW ==Library structure 1: language-independent API== [Lang.png] [Resource index page index.html] [Examples of each category gfdoc/Cat.html] #NEW ==Library structure 2: language-dependent modules== - morphological paradigms, e.g. ``ParadigmsSwe`` ``` mkN : (x1,_,_,x4 : Str) -> N ; -- worst-case noun constructor regN : Str -> N ; -- regular noun constructor ``` - (in some languages) irregular verbs (and other words), e.g. ``IrregSwe`` ``` angripa_V = irregV "angripa" "angrep" "angripit" ; ``` - (not yet available) exended syntax with language-specific rules, e.g. ``ExtNor`` ``` PostPoss : CN -> Pron -> NP ; -- bilen min ``` #NEW ==How much can be language-independent?== For the ten languages we have considered, it //is// possible to implement the current API. Reservations: - does not necessarily extend to all other languages - does not necessarily cover the most idiomatic expressions of each language - may not be the easiest API to implement (e.g. negation and inversion with //do// in English suggest that some other structure would be more natural) - no guaranteed that same structure has the same semantics in all different languages #NEW ==Parametrized modules== We can go even farther than share an abstract API: we can share implementations among related languages. Exploited in two families: - Romance: French, Italian, Spanish - Scanndinavian: Danish, Norwegian, Swedish [The declarations of Scandinavian syntax differences ../scandinavian/DiffScand.gf] #NEW ==Using the library== Simplest case: use the API in the same way for all languages. - **+** grammar localization for free - **-** not the best idioms for each language In practice: use the API in different ways for different languages ``` Name x y = predNP (GenCN x (regN "name")) (StringNP y) -- Eng: x's name is y Name x y = predV2 x heta_V2 (StringNP y) -- Swe: x heter y ``` This amounts to **compile-time transfer**. Writing an application grammar requires more native-speaker knowledge than writing a resource grammar! #NEW ==Lexicon extension== We cannot anticipate all vocabulary needed in application grammars. Therefore we provide high-level paradigms to add new words. Example heuristic, from [ParadigsSwe gfdoc/ParadigmsSwe.html]: ``` regV : (leker : Str) -> V ; regV leker = case leker of { lek + ("a" | "ar") => conj1 (lek + "a") ; lek + "er" => conj2 (lek + "a") ; bo + "r" => conj3 bo } ``` #NEW ==Example low-level morphological definition== ``` decl2Noun : Str -> N = \bil -> let bb : Str * Str = case bil of { pojk + "e" => ; nyck + "e" + l@("l" | "r") => ; sock + "e" + "n" => ; _ => } ; in mkN bil bb.p2 bb.p1 (bb.p1 + "na") ; ``` #NEW ==Some formats that can be generated from GF grammars== ``` -printer=lbnf BNF Converter, thereby C/Bison, Java/JavaCup -printer=fullform full-form lexicon, short format -printer=xml XML: DTD for the pg command, object for st -printer=gsl Nuance GSL speech recognition grammar -printer=jsgf Java Speech Grammar Format -printer=srgs_xml SRGS XML format -printer=srgs_xml_prob SRGS XML format, with weights -printer=slf a finite automaton in the HTK SLF format -printer=regular a regular grammar in a simple BNF -printer=gfc-prolog gfc in prolog format (also pg) ``` #NEW ==Corpus generation== The most general format is **multilingual treebank** generation: ``` > gr -tr | l -multi UseCl TCond AAnter PPos (PredVP (DetCN (DetSg DefSg NoOrd) (AdjCN (PositA young_A) (UseN man_N))) (ComplV2 love_V2 (UsePron she_Pron))) den unga mannen skulle ha älskat henne der junge Mann würde sie geliebt haben le jeune homme l' aurait aimée the young man would have loved her ``` A special case is corpus generation, either exhaustive or random with or without probability weights attached to constructors. Cf. Rebecca Jonson this afternoon. #NEW ==Use as program components== Haskell, Java, Prolog Parsing, generation, translation Push-button creation of spoken language translators (using Nuance) #NEW ==Related work== CLE = Core Language Engine - the closest point of comparison as for coverage and purpose - resource API similar to "Quasi-Logical Form" - parametrized modules instead of grammar porting via macro packages - grammar specialization via partial evaluation instead of explanation-based learning - therefore, transfer at compile time as often as possible Lingo Matrix project (HPSG) - methodology rather than formal discipline for multilingual grammars - wider coverage - not aimed as library, no grammar specialization? %http://www.boost.org/