mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-23 19:42:50 -06:00
629 lines
11 KiB
Plaintext
629 lines
11 KiB
Plaintext
The GF Resource Grammar Library Version 1.0
|
|
Author: Aarne Ranta <aarne (at) cs.chalmers.se>
|
|
Last update: %%date(%c)
|
|
|
|
% NOTE: this is a txt2tags file.
|
|
% Create an html file from this file using:
|
|
% txt2tags --toc clt2006.txt
|
|
|
|
%!target:html
|
|
|
|
%!postproc(html): #NEW <!-- NEW -->
|
|
|
|
|
|
#NEW
|
|
|
|
==Plan==
|
|
|
|
Purpose
|
|
|
|
Background
|
|
|
|
Coverage
|
|
|
|
Structure
|
|
|
|
How to use
|
|
|
|
How to implement a new language
|
|
|
|
How to extend the API
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
==Purpose==
|
|
|
|
===Library for applications===
|
|
|
|
High-level access to grammatical rules
|
|
|
|
E.g. //You have k new messages// rendered in ten languages //X//
|
|
```
|
|
render X (Have (You (Number (k (New Message)))))
|
|
```
|
|
|
|
Usability for different purposes
|
|
- translation systems
|
|
- software localization
|
|
- dialogue systems
|
|
- language teaching
|
|
|
|
|
|
#NEW
|
|
|
|
===Grammar as parser===
|
|
|
|
Often in NLP, a grammar is just high-level code for a parser.
|
|
|
|
But writing a grammar can be inadequate for parsing:
|
|
- too much manual work
|
|
- too inefficient
|
|
- not robust
|
|
- too ambiguous
|
|
|
|
|
|
Moreover, a grammar fine-tuned for parsing may not be reusable
|
|
- for generation
|
|
- for specialized grammars
|
|
- as library
|
|
|
|
|
|
#NEW
|
|
|
|
===Grammar as language definition===
|
|
|
|
Linguistic ontology: **abstract syntax**
|
|
|
|
E.g. adjectival modification
|
|
```
|
|
AdjCN : AP -> CN -> CN ;
|
|
```
|
|
|
|
Rendering in different languages: **concrete syntax**
|
|
|
|
Resource grammars have generation perspective, rather than parsing
|
|
- abstract syntax serves as a key to expressions in different languages
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Usability by non-linguists===
|
|
|
|
Division of labour: resource grammars hide linguistic details
|
|
|
|
Presentation: "school grammar" concepts, dictionary-like conventions
|
|
|
|
API = Application Programmer's Interface
|
|
|
|
Documentation: ``gfdoc``
|
|
|
|
IDE = Interactive Development Environment (forthcoming)
|
|
|
|
Example-based grammar writing
|
|
```
|
|
render Ita (parse Eng "you have k messages")
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
===Scientific interest===
|
|
|
|
Linguistics
|
|
- definition of linguistic ontology
|
|
- coping with different problems in different languages
|
|
- sharing concrete-syntax code between languages
|
|
- creating a resource for other NLP applications
|
|
|
|
|
|
Computer science
|
|
- datastructures for grammar rules
|
|
- type systems for grammars
|
|
- algorithms: parsing, generation, grammar compilation
|
|
- domain-specific programming language (GF)
|
|
- module system
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
==Background==
|
|
|
|
===History===
|
|
|
|
2002: v. 0.2
|
|
- English, French, German, Swedish
|
|
|
|
|
|
2003: v. 0.6
|
|
- module system
|
|
- added Finnish, Italian, Russian
|
|
- used in KeY
|
|
|
|
|
|
2005: v. 0.9
|
|
- tenses
|
|
- added Danish, Norwegian, Spanish; no German
|
|
- used in WebALT
|
|
|
|
|
|
2006: v. 1.0
|
|
- approximate CLE coverage
|
|
- reorganized module system and implementation
|
|
- not yet (4/3/2006) for Danish and Russian
|
|
|
|
|
|
#NEW
|
|
|
|
===Authors===
|
|
|
|
Janna Khegai (Russian modules, forthcoming),
|
|
Bjorn Bringert (many Swadesh lexica),
|
|
Carlos Gonzalia (Spanish cardinals),
|
|
Partik Jansson (Swedish cardinals),
|
|
Aarne Ranta.
|
|
|
|
We are grateful for contributions and
|
|
comments to several other people who have used this and
|
|
the previous versions of the resource library, including
|
|
Ana Bove,
|
|
David Burke,
|
|
Lauri Carlson,
|
|
Gloria Casanellas,
|
|
Karin Cavallin,
|
|
Hans-Joachim Daniels,
|
|
Kristofer Johannisson,
|
|
Anni Laine,
|
|
Wanjiku Ng'ang'a,
|
|
Jordi Saludes.
|
|
|
|
|
|
#NEW
|
|
|
|
===Related work===
|
|
|
|
CLE (Core Language Engine,
|
|
[Book 1992 http://mitpress.mit.edu/catalog/item/default.asp?tid=7739&ttype=2])
|
|
- English, Swedish, French, Danish
|
|
- uses Definita Clause Grammars, implementation in Prolog
|
|
- coverage for SACTI corpus,
|
|
[Spoken Language Translator (2001) http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521770777]
|
|
- grammar specialization via explanation-based learning
|
|
|
|
|
|
[LinGO Grammar Matrix http://www.delph-in.net/matrix/]
|
|
- English, German, Japanese, Spanish, ...
|
|
- uses HPSG, implementation in LKB
|
|
- a check list for parallel grammar implementations
|
|
|
|
|
|
[Pargram http://www2.parc.com/istl/groups/nltt/pargram/]
|
|
- Aimed: Arabic, Chinese, English, French, German, Hungarian, Japanese,
|
|
Malagasy, Norwegian, Turkish, Urdu, Vietnamese, and Welsh
|
|
- uses LFG
|
|
- one set of big grammars, transfer rules
|
|
|
|
|
|
Rosetta Machine Translation ([Book 1994 http://citeseer.ist.psu.edu/181924.html])
|
|
- Dutch, English, French
|
|
- uses M-grammars, compositional translation inspired by Montague
|
|
- compositional transfer rules
|
|
|
|
|
|
#NEW
|
|
|
|
==Coverage==
|
|
|
|
===Languages===
|
|
|
|
The current GF Resource Project covers ten languages:
|
|
- ``Dan``ish
|
|
- ``Eng``lish
|
|
- ``Fin``nish
|
|
- ``Fre``nch
|
|
- ``Ger``man
|
|
- ``Ita``lian
|
|
- ``Nor``wegian (bokmål)
|
|
- ``Rus``sian
|
|
- ``Spa``nish
|
|
- ``Swe``dish
|
|
|
|
|
|
In addition, parts (morphology) of Arabic, Estonian, Latin, and Urdu
|
|
|
|
API 1.0 not yet implemented for Danish and Russian
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Morphology and lexicon===
|
|
|
|
Complete inflection engine
|
|
- all word classes
|
|
- all forms
|
|
- all inflectional paradigms
|
|
|
|
|
|
Basic lexicon
|
|
- 100 structural words
|
|
- 350 content words, mainly for testing
|
|
- these include the 207 [Swadesh words http://en.wiktionary.org/wiki/Swadesh_List]
|
|
|
|
|
|
It is more important to enable lexicon extensions than to
|
|
provide a huge lexicon.
|
|
- technical lexica can have very special words, which tend to be regular
|
|
|
|
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Syntactic structures===
|
|
|
|
Texts:
|
|
sequences of phrases with punctuation
|
|
|
|
Phrases:
|
|
declaratives, questions, imperatives, vocatives
|
|
|
|
Tense, mood, and polarity:
|
|
present, past, future, conditional ; similtaneous, anterior ; positive, negative
|
|
|
|
Questions:
|
|
yes-no, "wh" ; direct, indirect
|
|
|
|
Clauses:
|
|
main, relative, embedded (subject, object, adverbial)
|
|
|
|
Verb phrases:
|
|
intransitive, transitive, ditransitive, prepositional
|
|
|
|
Noun phrases:
|
|
proper names, pronouns, determiners, possessives, cardinals and ordinals
|
|
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Quantitative measures===
|
|
|
|
67 categories
|
|
|
|
150 abstract syntax combination rules
|
|
|
|
100 structural words
|
|
|
|
350 content words in a test lexicon
|
|
|
|
Lines of source code (4/3/2006):
|
|
```
|
|
abstract 1131
|
|
english 2344
|
|
german 2386
|
|
finnish 3396
|
|
norwegian 1257
|
|
swedish 1465
|
|
scandinavian 1023
|
|
french 3246 -- Besch + Irreg + Morpho 2111
|
|
italian 7797 -- Besch 6512
|
|
spanish 7120 -- Besch 5877
|
|
romance 1066
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
==Structure of the API==
|
|
|
|
===Language-independent ground API===
|
|
|
|
[Lang.png]
|
|
|
|
|
|
#NEW
|
|
|
|
===The structure of a text sentence===
|
|
|
|
```
|
|
John walks.
|
|
|
|
TFullStop : Phr -> Text -> Text
|
|
(PhrUtt : PConj -> Utt -> Voc -> Phr
|
|
NoPConj
|
|
(UttS : S -> Utt
|
|
(UseCl : Tense -> Anter -> Pol -> Cl -> S
|
|
TPres
|
|
ASimul
|
|
PPos
|
|
(PredVP : NP -> VP -> Cl
|
|
(UsePN : PN -> NP
|
|
john_PN)
|
|
(UseV : V -> VP
|
|
walk_V))))
|
|
NoVoc)
|
|
TEmpty
|
|
```
|
|
|
|
#NEW
|
|
|
|
===Structure in syntax editor===
|
|
|
|
[editor.png]
|
|
|
|
|
|
#NEW
|
|
|
|
===Language-dependent paradigm modules===
|
|
|
|
====Regular paradigms====
|
|
|
|
Every language implements these regular patterns that take
|
|
"dictionary forms" as arguments.
|
|
```
|
|
regN : Str -> N
|
|
regA : Str -> A
|
|
regV : Str -> V
|
|
```
|
|
Their usefulness varies. For instance, they
|
|
all are quite good in Finnish and English.
|
|
In Swedish, less so:
|
|
```
|
|
regN "val" ---> val, valen, valar, valarna
|
|
```
|
|
Initializing a lexicon with ``regX``s is
|
|
usually a good starting point in grammar development.
|
|
|
|
|
|
#NEW
|
|
|
|
====Regular paradigms====
|
|
|
|
In Swedish, giving the gender of ``N`` improves a lot
|
|
```
|
|
regGenN "val" neutrum ---> val, valet, val, valen
|
|
```
|
|
|
|
There are also special constructs taking other forms:
|
|
```
|
|
mk2N : (nyckel,nycklar : Str) -> N
|
|
mk1N : (bilarna : Str) -> N
|
|
|
|
irregV : (dricka, drack, druckit : Str) -> V
|
|
```
|
|
|
|
Regular verbs are actually implemented the
|
|
[Lexin http://lexin.nada.kth.se/sve-sve.shtml] way
|
|
```
|
|
regV : (talar : Str) -> N
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
====Worst-case paradigms====
|
|
|
|
To cover all situations, worst-case paradigms are given. E.g. Swedish
|
|
```
|
|
mkN : (apa,apan,apor,aporna : Str) -> N
|
|
mkA : (liten, litet, lilla, sma, mindre, minst, minsta : Str) -> A
|
|
mkV : (supa,super,sup,söp,supit,supen : Str) -> V
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
====Irregular words====
|
|
|
|
Iregular words in ``IrregX``, e.g. Swedish:
|
|
```
|
|
draga_V : V =
|
|
mkV (variants { "dra"; "draga"}) (variants { "drar" ; "drager"})
|
|
(variants { "dra" ; "drag" }) "drog" "dragit" "dragen" ;
|
|
```
|
|
Goal: eliminate the user's need of worst-case functions.
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Language-dependent syntax extensions===
|
|
|
|
#NEW
|
|
|
|
===Special-purpose APIs===
|
|
|
|
Syntactic structures that are not shared by all languages.
|
|
|
|
Not implemented yet.
|
|
|
|
Candidates:
|
|
- ``Nor`` post-possessives: ``bilen min``
|
|
- ``Fre`` question forms: ``est-ce que tu dors ?``
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===How to use as top-level grammar===
|
|
|
|
#NEW
|
|
|
|
===Compiling===
|
|
|
|
It is a good idea to compile the library, so that it can be opened faster
|
|
```
|
|
GF/lib/resource-1.0% make
|
|
|
|
writes GF/lib/alltenses
|
|
GF/lib/present
|
|
GF/lib/resource-1.0/langs.gfcm
|
|
```
|
|
If you don't intend to change the library, you never need to process the source
|
|
files again. Just do some of
|
|
```
|
|
gf -nocf langs.gfcm -- all 8 languages
|
|
|
|
gf -nocf -path=alltenses:prelude alltenses/LangSwe.gfc -- Swedish only
|
|
|
|
gf -nocf -path=alltenses:prelude present/LangSwe.gfc -- Swedish only, present tense only
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
===Parsing===
|
|
|
|
The default parser does not work!
|
|
|
|
The MCFG parser works in some languages, after waiting appr. 20 seconds
|
|
```
|
|
p -mcfg -lang=LangEng -cat=S "I would see her"
|
|
|
|
p -mcfg -lang=LangSwe -cat=S "jag skulle se henne"
|
|
|
|
p -mcfg -lang=LangNor -cat=S "jeg ville se henne"
|
|
|
|
```
|
|
Parsing in ``present/`` versions is quicker.
|
|
|
|
|
|
#NEW
|
|
|
|
===Treebank generation===
|
|
|
|
Multilingual treebank entry = tree + linearizations
|
|
|
|
Some examples on treebank generation, assuming ``langs.gfcm``
|
|
```
|
|
gr -cat=S -number=10 -cf | tb -- 10 random S
|
|
|
|
gt -cat=Phr -depth=4 | tb -xml | wf ex.xml -- all Phr to depth 4, into file ex.xml
|
|
```
|
|
Regression testing
|
|
```
|
|
rf ex.xml | tb -c -- read treebank from file and compare to present grammars
|
|
```
|
|
Updating a treebank
|
|
```
|
|
rf old.xml | tb -trees | tb -xml | wf new.xml -- read old from file, write new to file
|
|
```
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
===Treebank-based parsing===
|
|
|
|
Brute-force method that helps if real parsing is more expensive.
|
|
```
|
|
make treebank -- make treebank with all languages
|
|
|
|
gf -treebank langs.xml -- start GF by reading the treebank
|
|
|
|
> ut -strings -treebank=LangIta -- show all Ita strings
|
|
|
|
> ut -treebank=LangIta -raw "Quello non si romperebbe" -- look up a string
|
|
|
|
> i -nocf langs.gfcm -- read grammar to be able to linearize
|
|
|
|
> ut -treebank=LangIta "Quello non si romperebbe" | l -multi -- translate to all
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
===Morphology===
|
|
|
|
Use morphological analyser
|
|
```
|
|
gf -nocf -retain -path=alltenses:prelude alltenses/LangSwe.gf
|
|
> ma "jag kan inte höra vad du säger"
|
|
```
|
|
|
|
Try out a morphology quiz
|
|
```
|
|
> mq -cat=V
|
|
```
|
|
|
|
Try out inflection patterns
|
|
```
|
|
gf -retain -path=alltenses:prelude alltenses/ParadigmsSwe.gfr
|
|
> cc regV "lyser"
|
|
```
|
|
|
|
|
|
#NEW
|
|
|
|
|
|
#NEW
|
|
|
|
===Syntax editing===
|
|
|
|
We start a demo by
|
|
``` gfeditor langs.gfcm
|
|
|
|
[editor.png]
|
|
|
|
|
|
#NEW
|
|
|
|
===Efficient parsing via application grammar===
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
==How to use as library==
|
|
|
|
===Specialization through parametrized modules===
|
|
|
|
#NEW
|
|
|
|
===Compile-time transfer===
|
|
|
|
#NEW
|
|
|
|
===A natural division into modules===
|
|
|
|
#NEW
|
|
|
|
===Example-based grammar writing===
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
==How to implement a new language==
|
|
|
|
===Ordinary modules===
|
|
|
|
#NEW
|
|
|
|
===Parametrized modules===
|
|
|
|
#NEW
|
|
|
|
===The kernel of the API===
|
|
|
|
#NEW
|
|
|
|
===How to proceed===
|
|
|
|
|
|
|
|
#NEW
|
|
|
|
==How to extend the API==
|
|
|
|
#NEW
|
|
|
|
===Extend old modules or add a new one?===
|
|
|