GSLT sem, final version

This commit is contained in:
aarne
2006-01-30 15:23:35 +00:00
parent 1ba06050ef
commit fb281a33ca

View File

@@ -44,7 +44,7 @@ Staff contributions to grammar libraries:
- Aarne Ranta - Aarne Ranta
Student projects on libraries: Student projects on grammar libraries:
- Inger Andersson & Therese Söderberg: Spanish morphology - Inger Andersson & Therese Söderberg: Spanish morphology
- Ludmilla Bogavac: Russian morphology - Ludmilla Bogavac: Russian morphology
- Ali El Dada: Arabic morphology and syntax - Ali El Dada: Arabic morphology and syntax
@@ -52,6 +52,12 @@ Student projects on libraries:
- Michael Pellauer: Estonian morphology - Michael Pellauer: Estonian morphology
Technology, also:
- Håkan Burden
- Hans-Joachim Daniels
- Kristofer Johannisson
- Peter Ljunglöf
#NEW #NEW
@@ -67,7 +73,6 @@ the programmers take it from a library. You write (in Haskell),
instead of a lot of code actually implementing sorting. instead of a lot of code actually implementing sorting.
Practical advantages: Practical advantages:
- division of labour
- faster development of new software - faster development of new software
- quality guarantee and automatic improvements - quality guarantee and automatic improvements
@@ -109,11 +114,20 @@ Possible ways to do this:
#NEW #NEW
3. Use a library ``GUIText`` such that you can write 3. Use a library ``Text`` such that you can write
``` ```
yesButton lang = button (render lang GUIText.Yes) yesButton lang = button (Text.render lang Text.Yes)
```
The library has an API (Application Programmer's Interface) with:
+ A repository of text elements such as
```
Yes : Text
No : Text
```
+ A function rendering text elements in different languages:
```
render : Language -> Text -> String
``` ```
#NEW #NEW
@@ -134,7 +148,7 @@ The code that should be written is of course
where where
messages = if n==1 then "message" else "messages" messages = if n==1 then "message" else "messages"
``` ```
(E.g. VoiceXML gives support for this.) (E.g. VoiceXML supports this.)
#NEW #NEW
@@ -163,8 +177,11 @@ of "message":
==More problems with the advanced example== ==More problems with the advanced example==
You also have to know the case required by the verb "have" You also have to know the case required by the verb "have"
(e.g. Finnish: nominative in singular, partitive in plural). e.g. Finnish:
```
1 viesti -- nominative
4 viestiä -- partitive
```
//Moreover//, you have to know what is the proper way to politely //Moreover//, you have to know what is the proper way to politely
address the user: address the user:
``` ```
@@ -180,7 +197,7 @@ address the user:
In analogy with the "Yes" case, you write In analogy with the "Yes" case, you write
``` ```
mess lang n = render lang (MailText.YouHaveMessages n) mess lang n = render lang (Text.YouHaveMessages n)
``` ```
Hmm, is this so smart? What about if you want to say Hmm, is this so smart? What about if you want to say
``` ```
@@ -202,8 +219,7 @@ You may want to write
sword lang n = render lang (Have FamYou (Num n Jewel)) sword lang n = render lang (Have FamYou (Num n Jewel))
surpr lang n = render lang (Have I (Num n Surprise)) surpr lang n = render lang (Have I (Num n Surprise))
``` ```
For this purpose, you need a library with the following API For this purpose, you need a library with the API
(Application Programmer's Interface):
``` ```
Have : NounPhrase -> NounPhrase -> Sentence Have : NounPhrase -> NounPhrase -> Sentence
@@ -213,16 +229,13 @@ For this purpose, you need a library with the following API
Num : Int -> Noun -> NounPhrase Num : Int -> Noun -> NounPhrase
Message : Noun Message : Noun
``` Jewel : Noun
You also need a top-level rendering function
```
render : Language -> Sentence -> String
``` ```
#NEW #NEW
==An optimal solution?== ==The ultimate solution?==
The library API for language will certainly grow big and become The library API for language will certainly grow big and become
difficult to use. Why couldn't I just write difficult to use. Why couldn't I just write
@@ -241,14 +254,18 @@ The library that we will present actually has this as well!
The only complication is that ``parse`` does not always return The only complication is that ``parse`` does not always return
just one sentence. Those may be zero: just one sentence. Those may be zero:
``` ```
you have n mesaggse "you have n mesaggse"
``` ```
or many: or many:
``` ```
"you have n messages"
Have PolYou (Num n Message) Have PolYou (Num n Message)
Have FamYou (Num n Message) Have FamYou (Num n Message)
Have PlurYou (Num n Message) Have PlurYou (Num n Message)
``` ```
Thus some amount of interaction is needed.
#NEW #NEW
@@ -268,7 +285,7 @@ Therefore we also need **realization functions**,
render : Language -> Sentence -> String render : Language -> Sentence -> String
parse : Language -> String -> [Sentence] parse : Language -> String -> [Sentence]
``` ```
Both of them require major linguistic expertise to write - but, Both of them require linguistic expertise to write - but,
one this is done, they can be used with very little linguistic one this is done, they can be used with very little linguistic
knowledge by application programmers! knowledge by application programmers!
@@ -291,17 +308,20 @@ In GF,
Simplest possible example: Simplest possible example:
``` ```
abstract GUIText = { abstract Text = {
cat Text ; cat Text ;
fun Yes : Text ; fun Yes : Text ;
fun No : Text ;
} }
concrete GUITextEng of GUIText = { concrete TextEng of Text = {
lin Yes = ss "yes" ; lin Yes = ss "yes" ;
lin No = ss "no" ;
} }
concrete GUITextFin of GUIText = { concrete TextFin of Text = {
lin Yes = ss "kyllä" ; lin Yes = ss "kyllä" ;
lin No = ss "ei" ;
} }
``` ```
@@ -315,11 +335,11 @@ The realizatin function is, for each language, implemented by
The linearization rules directly give the ``render`` method: The linearization rules directly give the ``render`` method:
``` ```
render english x = GUITextEng.lin x render english x = TextEng.lin x
``` ```
The GF formalism moreover has the property of **reversibility**: The GF formalism moreover has the property of **reversibility**:
a set of linearization rules automatically generates a parser as - a set of linearization rules automatically generates a parser.
well.
%While reversibility has a minor importance for the applications %While reversibility has a minor importance for the applications
%shown above, it is crucial for other applications of GF grammars. %shown above, it is crucial for other applications of GF grammars.
@@ -332,8 +352,8 @@ well.
**multilingual grammar** = abstract syntax + concrete syntaxes **multilingual grammar** = abstract syntax + concrete syntaxes
Examples of the idea: Examples of the idea:
- multilingual authoring
- domain-specific translation - domain-specific translation
- multilingual authoring
- dialogue systems - dialogue systems
@@ -342,17 +362,17 @@ Examples of the idea:
==Domain, ontology, idiom== ==Domain, ontology, idiom==
An abstract syntax represents An abstract syntax has other names:
- a **semantic model** - a **semantic model**
- an **ontology** - an **ontology**
The concrete syntax defines how the concepts of the ontology The concrete syntax defines how the ontology
are represented in a language. is represented in a language.
The following requirements are made: The following requirements are made:
- linguistic correctness (inflection, agreement, word order,...) - linguistic correctness (inflection, agreement, word order,...)
- semantic correctness (express the intended concepts) - semantic correctness (express the concepts properly)
- conformance to the domain idiom (use proper terms and phrasing) - conformance to the domain idiom (use proper terms and phrasing)
@@ -373,17 +393,17 @@ Arithmetic of natural numbers: abstract syntax
**Concrete syntax**: mapping from abstract syntax trees to strings in a language **Concrete syntax**: mapping from abstract syntax trees to strings in a language
(English, French, German, Swedish,...) (English, French, German, Swedish,...)
``` ```
lin Even x = {s = x.s ++ "is" ++ "even"} ; lin Even x = {s = x.s ++ "is" ++ "even"} ;
lin Even x = {s = x.s ++ "est" ++ "pair"} ; lin Even x = {s = x.s ++ "est" ++ "pair"} ;
lin Even x = {s = x.s ++ "ist" ++ "gerade"} ; lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;
lin Even x = {s = x.s ++ "är" ++ "jämnt"} ; lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
``` ```
#NEW #NEW
==Translation system== ==Translation system==
We can **translate** between languages via the abstract syntax: We can translate using the abstract syntax as interlingua:
``` ```
4 is even 4 ist gerade 4 is even 4 ist gerade
\ / \ /
@@ -405,7 +425,7 @@ The previous multilingual grammar breaks these rules in many situations:
2 and 3 is even 2 and 3 is even
la somme de 3 et de 5 est pair la somme de 3 et de 5 est pair
wenn 2 ist gerade, dann 2+2 ist gerade wenn 2 ist gerade, dann 2+2 ist gerade
om 2 är jämnt, 2+2 är jämnt om x är jämnt, summan av x och 2 är jämnt
``` ```
All these sentences are grammatically incorrect. All these sentences are grammatically incorrect.
@@ -415,11 +435,10 @@ All these sentences are grammatically incorrect.
==Solving the difficulties== ==Solving the difficulties==
GF can express the linguistic rules that are needed to GF //can// express the linguistic rules that are needed to
produce correct translations. (Expressive power produce correct translations:
between TAG and HPSG, but the language is more high-level.)
Instead of just strings, we need **parameters**, **tables**, In addition to strings, we use **parameters**, **tables**,
and **record types**. For instance, French: and **record types**. For instance, French:
``` ```
param Mod = Ind | Subj ; param Mod = Ind | Subj ;
@@ -455,20 +474,33 @@ Resource grammar ("syntactic grammar")
- author: linguist - author: linguist
#NEW
==GF as programming language==
The expressive power is between TAG and HPSG.
The language is more high-level: a modern, **typed functional programming language**.
It enables linguistic generalizations and abstractions.
But we don't want to bother application grammarians with these details.
We have built a **module system** that can hide details.
#NEW #NEW
==Concrete syntax using library== ==Concrete syntax using library==
Language-independent API Assume the following API
``` ```
cat S ; NP ; A ; cat S ; NP ; A ;
fun predA : NP -> A -> S ; fun predA : A -> NP -> S ;
oper regA : Str -> A ; oper regA : Str -> A ;
``` ```
Implementation for four languages Now implement ``Even`` for four languages
``` ```
lincat lincat
Prop = S ; Prop = S ;
@@ -479,11 +511,11 @@ Implementation for four languages
Even = predA (regA "pair") ; -- French Even = predA (regA "pair") ; -- French
Even = predA (regA "gerade") ; -- German Even = predA (regA "gerade") ; -- German
``` ```
Notice: choice of adjective is domain expert knowledge. Notice: the choice of adjective is domain expert knowledge.
#NEW #NEW
==Design questions for grammar the library== ==Design questions for the grammar library==
What should there be in the library? What should there be in the library?
- morphology, lexicon, syntax, semantics,... - morphology, lexicon, syntax, semantics,...
@@ -506,7 +538,7 @@ hence cannot use existing proprietary resources.
#NEW #NEW
==Design decisions== ==Design decisions==
The current GF resource grammar library has, for each language, Coverage, for each language:
- complete morphology - complete morphology
- lexicon of the most important structural words - lexicon of the most important structural words
- test lexicon of ca. 300 content words - test lexicon of ca. 300 content words
@@ -514,13 +546,16 @@ The current GF resource grammar library has, for each language,
- rather flat semantics (cf. Quasi-Logical Form of CLE) - rather flat semantics (cf. Quasi-Logical Form of CLE)
Organization and presentation: Organization:
- top-level (API) modules - top-level (API) modules
- internal modules (only interesting for resource implementors) - Ground API + special-purpose APIs ("macro packages")
- we favour "school grammar" concepts rather than innovative linguistic theory - "school grammar" concepts rather than advanced linguistic theory
- tool ``gfdoc`` for generating HTML from grammars
Presentation:
- tool ``gfdoc`` for generating HTML from grammars
- example collections
#NEW #NEW
==Design decisions, cont'd== ==Design decisions, cont'd==
@@ -533,17 +568,14 @@ Where do we get the data from?
- we have not reused existing resources - we have not reused existing resources
The resource grammar library is entirely The resource grammar library is entirely open-source free software (under GNU GPL license).
open-source free software (under GNU GPL license).
#NEW #NEW
==Success criteria== ==Success criteria and evaluation==
Grammatical correctness of everything generated. Grammatical correctness of everything generated.
@@ -551,56 +583,58 @@ Semantic coverage: you can express whatever you want.
Usability as library for non-linguists. Usability as library for non-linguists.
(Bonus for linguists:) nice generalizations w.r.t. language Evaluation: tested in third-party projects.
families, using the module system of GF.
#NEW #NEW
==These are not our success criteria== ==These are not our success criteria==
Language coverage: to be able to parse all expressions. Language coverage:
- to be able to parse all expressions.
- Example: French //passé simple//, although covered by the - Example: French //passé simple//, although covered by the
morphology, is not available through the language-independent API. morphology, is not available through the language-independent API.
- But: reconsidered to improve example-based grammar writing
Semantic correctness: only to produce meaningful expressions. Semantic correctness:
- only to produce meaningful expressions.
- Example: the following sentences can be generated - Example: the following sentences can be generated
``` ```
colourless green ideas sleep furiously colourless green ideas sleep furiously
the time is seventy past forty-two the time is seventy past forty-two
``` ```
(Warning for linguists:) theoretical innovation in Linguistic innovation in syntax:
syntax is not among the goals - rather a presentation of "known facts"
(and it would be hidden from users anyway!). - innovation would be hidden from users anyway...
#NEW #NEW
==So where is semantics?== ==Where is semantics?==
Application grammars typically use domain-specific Application grammars use domain-specific
semantics to guarantee semantic well-formedness. semantics to guarantee semantic well-formedness.
GF incorporates a **Logical Framework** and is therefore GF incorporates a **Logical Framework** and can express
capable of expressing logical semantics //à la// Montague - logical semantics //à la// Montague
or any other flavour, including anaphora and discourse. - anaphora and discourse using dependent types
Language-independent API is a rough semantic model.
But we do //not// try to give semantics once and But we do //not// try to give semantics once and
for all for the whole language. for all for the whole language.
Instead, we expect semantics to be given in
**application grammars** built on semantic models
of different domains.
#NEW #NEW
==Levels of representation== ==Representations in different APIs==
No fixed set of levels; here some examples: **Grammar composition**: any grammar can serve as resource to another one.
No fixed set of representation levels; here some examples for
``` ```
2 is even 2 is even
2 är jämnt 2 är jämnt
@@ -616,8 +650,10 @@ In ``Predication`` (high level resource API)
``` ```
In ``Lang`` (ground level resource API) In ``Lang`` (ground level resource API)
``` ```
UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "even"))))) UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2))
UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2)) (UseComp (CompAP (PositA (regA "jämn"))))) (UseComp (CompAP (PositA (regA "even")))))
UseCl TPres ASimul PPos (PredVP (UsePN (IntPN 2))
(UseComp (CompAP (PositA (regA "jämn")))))
``` ```
@@ -632,15 +668,15 @@ The current GF Resource Project covers ten languages:
- ``Fre``nch - ``Fre``nch
- ``Ger``man - ``Ger``man
- ``Ita``lian - ``Ita``lian
- ``Nor``wegian - ``Nor``wegian (bokmål)
- ``Rus``sian - ``Rus``sian
- ``Spa``nish - ``Spa``nish
- ``Swe``dish - ``Swe``dish
The first three letters (``Dan`` etc) are used in grammar module names Implementation of API v 1.0 projected for the end of February.
In addition, we have parts (morphology) of Arabic, Estonian, and Urdu In addition, we have parts (morphology) of Arabic, Estonian, Latin, and Urdu
#NEW #NEW
@@ -652,26 +688,44 @@ In addition, we have parts (morphology) of Arabic, Estonian, and Urdu
[Examples of each category gfdoc/Cat.html] [Examples of each category gfdoc/Cat.html]
Cf. "matrix" in BLARK, LinGo
#NEW #NEW
==Library structure 2: language-dependent modules== ==Library structure 2: language-dependent APIs==
- morphological paradigms, e.g. ``ParadigmsSwe`` - morphological paradigms, e.g. ``ParadigmsSwe``
``` ```
mkN : (x1,_,_,x4 : Str) -> N ; -- worst-case noun constructor mkN : (man,mannen,män,männen : Str) -> N ; -- worst-case nouns
regN : Str -> N ; -- regular noun constructor regV : (leker : Str) -> V ; -- regular verbs
``` ```
- (in some languages) irregular verbs (and other words), e.g. ``IrregSwe`` - irregular words esp. verbs, e.g. ``IrregSwe``
``` ```
angripa_V = irregV "angripa" "angrep" "angripit" ; angripa_V = irregV "angripa" "angrep" "angripit" ;
``` ```
- (not yet available) exended syntax with language-specific rules, e.g. ``ExtNor`` - exended syntax with language-specific rules, e.g. ``ExtNor``
``` ```
PostPoss : CN -> Pron -> NP ; -- bilen min PostPoss : CN -> Pron -> NP ; -- bilen min
``` ```
#NEW
==Difficulties encountered==
English: negation and auxiliary vs. non-auxiliary verbs
Finnish: object case
German: double infinitives
Romance: clitic pronouns
Scandinavian: determiners
//In particular//: how to make the grammars efficient
#NEW #NEW
==How much can be language-independent?== ==How much can be language-independent?==
@@ -682,10 +736,33 @@ Reservations:
- does not necessarily extend to all other languages - does not necessarily extend to all other languages
- does not necessarily cover the most idiomatic expressions of each language - does not necessarily cover the most idiomatic expressions of each language
- may not be the easiest API to implement (e.g. negation and - may not be the easiest API to implement
inversion with //do// in English suggest that some other - e.g. negation and inversion with //do// in English suggest that some other
structure would be more natural) structure would be more natural
- no guaranteed that same structure has the same semantics in all different languages
- the structures may not have the same semantics in all different languages
#NEW
==Using the library==
Simplest case: use the API in the same way for all languages.
- **+** grammar localization for free
- **-** not the best idioms for each language
In practice: use the API in different ways for different languages
```
-- Eng: x's name is y
Name x y = predNP (GenCN x (regN "name")) (StringNP y)
-- Swe: x heter y
Name x y = predV2 x heta_V2 (StringNP y)
```
This amounts to **compile-time transfer**.
Surprisingly, writing an application grammar requires more native-speaker knowledge
than writing a resource grammar!
#NEW #NEW
@@ -703,23 +780,6 @@ Exploited in two families:
#NEW
==Using the library==
Simplest case: use the API in the same way for all languages.
- **+** grammar localization for free
- **-** not the best idioms for each language
In practice: use the API in different ways for different languages
```
Name x y = predNP (GenCN x (regN "name")) (StringNP y) -- Eng: x's name is y
Name x y = predV2 x heta_V2 (StringNP y) -- Swe: x heter y
```
This amounts to **compile-time transfer**.
Writing an application grammar requires more native-speaker knowledge
than writing a resource grammar!
@@ -774,29 +834,6 @@ Example heuristic, from [ParadigsSwe gfdoc/ParadigmsSwe.html]:
``` ```
#NEW
==Corpus generation==
The most general format is **multilingual treebank** generation:
```
> gr -tr | l -multi
UseCl TCond AAnter PPos (PredVP (DetCN (DetSg DefSg NoOrd)
(AdjCN (PositA young_A) (UseN man_N))) (ComplV2 love_V2 (UsePron she_Pron)))
den unga mannen skulle ha älskat henne
der junge Mann würde sie geliebt haben
le jeune homme l' aurait aimée
the young man would have loved her
```
A special case is corpus generation, either exhaustive or random with
or without probability weights attached to constructors.
Cf. Rebecca Jonson this afternoon.
#NEW #NEW
==Use as program components== ==Use as program components==
@@ -807,6 +844,49 @@ Parsing, generation, translation
Push-button creation of spoken language translators (using Nuance) Push-button creation of spoken language translators (using Nuance)
#NEW
==Grammar library as linguistic resource==
Can we use the libraries outside domain-specific fragments?
We seem to be approaching full coverage from below.
The resource API is not good for heavy-duty parsing (too abstract and
therefore too inefficient).
Two ideas:
- write shallow parsers as application grammars
- generate corpora and use statistic parsing methods
#NEW
==Corpus generation==
The most general format is **multilingual treebank** generation:
```
> gr -tr | l -multi
UseCl TCond AAnter PNeg (PredVP (DetCN (DetSg DefSg NoOrd)
(AdjCN (PositA young_A) (UseN woman_N))) (ComplV2 love_V2 (UsePron he_Pron)))
The young woman wouldn't have loved him
Den unga kvinnan skulle inte ha älskat honom
Den unge kvinna ville ikke ha elska ham
La joven mujer no lo habría amado
La giovane donna non lo avrebbe amato
La jeune femme ne l' aurait pas aimé
Nuori nainen ei olisi rakastanut häntä
```
This is either exhaustive or random, possibly
with probability weights attached to constructors.
A special case is **corpus generation**: just leave one language.
Can this be useful? Cf. Rebecca Jonson this afternoon.
#NEW #NEW
==Related work== ==Related work==
@@ -818,10 +898,23 @@ CLE = Core Language Engine
- therefore, transfer at compile time as often as possible - therefore, transfer at compile time as often as possible
Lingo Matrix project (HPSG) LinGo Matrix project (HPSG)
- methodology rather than formal discipline for multilingual grammars - methodology rather than formal discipline for multilingual grammars
- wider coverage
- not aimed as library, no grammar specialization? - not aimed as library, no grammar specialization?
- wider coverage - parsing real texts
Parsing detached from grammar (Nivre) - grammar detached from parsing
#NEW
==Demo==
Stoneage grammar, based on the Swadesh word list.
Implemented as application on top of the resource grammar.
Illustrate generation and spoken-language parsing.
%http://www.boost.org/ %http://www.boost.org/