mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-16 08:19:31 -06:00
616 lines
12 KiB
HTML
616 lines
12 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
|
<html><head><title></title></head>
|
|
<body bgcolor="#ffffff" text="#000000">
|
|
<center>
|
|
|
|
<img src="gf-logo.gif">
|
|
|
|
<h1>GF Resources for Swedish</h1>
|
|
|
|
<p>
|
|
|
|
Språkdata Seminar, Gothenburg, 1 March 2005
|
|
|
|
</p><p>
|
|
|
|
Aarne Ranta
|
|
|
|
</p><p>
|
|
|
|
<tt>aarne@cs.chalmers.se</tt>
|
|
</p></center>
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
<h2>Plan</h2>
|
|
|
|
<a href="01-gf-resource.html">Introduction to resource grammars</a> (pp. 1-16)
|
|
|
|
<p>
|
|
|
|
Swedish morphology and lexicon in GF
|
|
|
|
<p>
|
|
|
|
Syntax case study: Swedish sentence structure
|
|
|
|
<p>
|
|
|
|
Danish and Norwegian through parametrization
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
<h2>Swedish morphology and lexicon</h2>
|
|
|
|
Lexicon: list of words with inflection and other morphological
|
|
information (gender of nouns etc).
|
|
|
|
<p>
|
|
|
|
Paradigms: set of functions for extending the lexicon.
|
|
|
|
|
|
<!-- NEW -->
|
|
<h3>Parts of speech</h3>
|
|
|
|
A <b>word class</b> is a record type, with
|
|
<b>parametric</b> and <b>inherent</b> features (<tt>param</tt>eters).
|
|
For example, nouns are the type
|
|
<pre>
|
|
N = {s : Number => Species => Case => Str ; g : Gender} ;
|
|
</pre>
|
|
where
|
|
<pre>
|
|
param
|
|
Species = Indef | Def ;
|
|
Number = Sg | Pl ;
|
|
Case = Nom | Gen ;
|
|
</pre>
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
<h3>Defining a lexical unit</h3>
|
|
|
|
Every lexical unit has a word class as its type.
|
|
The <b>type checker</b> of GF verifies that all and only the
|
|
relevant information of the unit is given. For instance,
|
|
an entry for the noun <i>bil</i> ("car") looks as follows.
|
|
<pre>
|
|
bil =
|
|
{s = table {
|
|
Sg => table {
|
|
Indef => table {Nom => "bil" ; Gen => "bils" } ;
|
|
Def => table {Nom => "bilen" ; Gen => "bilens" }
|
|
} ;
|
|
Pl => table {
|
|
Indef => table {Nom => "bilar" ; Gen => "bilars" } ;
|
|
Def => table {Nom => "bilarna" ; Gen => "bilarnas" }
|
|
}
|
|
} ;
|
|
g = Utr
|
|
}
|
|
</pre>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>The Golden Rule of Functional Programming</h3>
|
|
|
|
Whenever you find yourself programming by "copy and paste", write
|
|
a <b>function</b> instead.
|
|
|
|
<p>
|
|
|
|
Thus do <i>not</i> write
|
|
<pre>
|
|
gran =
|
|
{s = table {
|
|
Sg => table {
|
|
Indef => table {Nom => "gran" ; Gen => "grans" } ;
|
|
Def => table {Nom => "granen" ; Gen => "granens" }
|
|
} ;
|
|
Pl => table {
|
|
Indef => table {Nom => "granar" ; Gen => "granars" } ;
|
|
Def => table {Nom => "granarna" ; Gen => "granarnas" }
|
|
}
|
|
} ;
|
|
g = Utr
|
|
}
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Inflectional paradigms as functions</h3>
|
|
|
|
Instead, write a <b>paradigm</b> that can be applied to any word
|
|
that is "inflected in the same way":
|
|
<pre>
|
|
decl2 : Str -> N = \bil ->
|
|
{s = table {
|
|
Sg => table {
|
|
Indef => table {Nom => bil + "" ; Gen => bil + "s" } ;
|
|
Def => table {Nom => bil + "en" ; Gen => bil + "ens" }
|
|
} ;
|
|
Pl => table {
|
|
Indef => table {Nom => bil + "ar" ; Gen => bil + "ars" } ;
|
|
Def => table {Nom => bil + "arna" ; Gen => bil + "arnas" }
|
|
}
|
|
} ;
|
|
g = Utr
|
|
}
|
|
</pre>
|
|
This function can be used over and over again:
|
|
<pre>
|
|
bil = decl2 "bil" ;
|
|
gran = decl2 "gran" ;
|
|
dag = decl2 "dag" ;
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>High-level definition of paradigms</h3>
|
|
|
|
Recall: functions instead of copy-and-paste!
|
|
|
|
<p>
|
|
|
|
First define (for each word class) a <b>worst-case function</b>:
|
|
<pre>
|
|
mkN : (apa,apan,apor,aporna : Str) -> Noun =
|
|
{s = table {
|
|
Sg => table {
|
|
Indef => mkCase apa ;
|
|
Def => mkCase apan
|
|
} ;
|
|
Pl => table {
|
|
Indef => mkCase apor ;
|
|
Def => mkCase aporna
|
|
}
|
|
} ;
|
|
g = case last apan of {
|
|
"n" => Utr ;
|
|
_ => Neutr
|
|
}
|
|
</pre>
|
|
where we uniformly produce the genitive by
|
|
<pre>
|
|
mkCase : Str -> Case => Str = \f -> table {
|
|
Nom => f ;
|
|
Gen => f + case last f of {
|
|
"s" | "x" => [] ;
|
|
_ => "s"
|
|
}
|
|
} ;
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>High-level definition of paradigms</h3>
|
|
|
|
Then define, for instance, the five declensions as follows:
|
|
<pre>
|
|
decl1 : Str -> N = \apa -> let ap = init apa in
|
|
mkN apa (apa + "n") (ap + "or") (ap + "orna") ;
|
|
|
|
decl2 : Str -> N = \bil ->
|
|
mkN bil (bil + "en") (bil + "ar") (bil + "arna") ;
|
|
|
|
decl3 : Str -> N = \fil ->
|
|
mkN fil (fil + "en") (fil + "er") (fil + "erna") ;
|
|
|
|
decl4 : Str -> N = \rike ->
|
|
mkN rike (rike + "t") (rike + "n") (rik + "ena") ;
|
|
|
|
decl5 : Str -> N = \lik ->
|
|
mkN lik (lik + "et") lik (lik + "en") ;
|
|
</pre>
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>What paradigms are there?</h3>
|
|
|
|
Swedish nouns traditionally have 5 declensions. But each of them has
|
|
slight variations. For instance, the "2nd declension" has the following:
|
|
<pre>
|
|
gosse - gossar -- 211
|
|
nyckel - nycklar -- 231
|
|
seger - segrar -- 232
|
|
öken - öknar -- 233
|
|
hummer - humrar -- 238
|
|
kam - kammar -- 241
|
|
mun - munnar -- 243
|
|
</pre>
|
|
and many more (S. Hellberg, <i>The Morphology of Present-Day Swedish</i>,
|
|
Almqvist & Wiksell, Stockholm, 1978). In addition, we have at least
|
|
<pre>
|
|
mås - mås -- genitive form without s
|
|
sax - sax
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>High-level access to paradigms</h3>
|
|
|
|
The "naïve user" does not want to go through 500 noun paradigms and
|
|
pick the right one.
|
|
|
|
<p>
|
|
|
|
A much more efficient method is the one used in
|
|
dictionaries: give <i>two</i> (or more) forms instead of one.
|
|
Our "dictionary heuristic function" covers the following cases:
|
|
<pre>
|
|
flicka - flickor
|
|
kor - kor (koret)
|
|
ko - kor (kon)
|
|
ros - rosor (rosen)
|
|
bil - bilar
|
|
nyckel - nycklar
|
|
hummer - humrar
|
|
rike - riken
|
|
lik - lik (liket)
|
|
lärare - lärare (läraren)
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>The definition of the dictionary heuristic</h3>
|
|
|
|
<pre>
|
|
reg2Noun : Str -> Str -> Subst = \bil,bilar ->
|
|
let
|
|
l = last bil ;
|
|
b = Predef.tk 2 bil ;
|
|
ar = Predef.dp 2 bilar
|
|
in
|
|
case ar of {
|
|
"or" => case l of {
|
|
"a" => decl1Noun bil ;
|
|
"r" => sLik bil ;
|
|
"o" => mkNoun bil (bil + "n") bilar (bilar + "na") ;
|
|
_ => mkNoun bil (bil + "en") bilar (bilar + "na")
|
|
} ;
|
|
"ar" => ifTok Subst (Predef.tk 2 bilar) bil
|
|
(decl2Noun bil)
|
|
(case l of {
|
|
"e" => decl2Noun bil ;
|
|
_ => mkNoun bil (bil + "n") bilar (bilar + "na")
|
|
}
|
|
) ;
|
|
"er" => decl3Noun bil ;
|
|
"en" => ifTok Subst bil bilar (sLik bil) (sRike bil) ; -- ben-ben
|
|
_ => ifTok Subst bil bilar (
|
|
case Predef.dp 3 bil of {
|
|
"are" => sKikare (init bil) ;
|
|
_ => decl5Noun bil
|
|
}
|
|
)
|
|
(decl5Noun bil) --- rest case with lots of garbage
|
|
} ;
|
|
</pre>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>When in doubt...</h3>
|
|
|
|
Test in GF by generating the table
|
|
<pre>
|
|
> cc mk2N "öken" "öknar"
|
|
{s = table Number {
|
|
Sg => table {
|
|
Indef => table Case {
|
|
Nom => "öken" ;
|
|
Gen => "ökens"
|
|
} ;
|
|
Def => table Case {
|
|
Nom => "ökenn" ;
|
|
Gen => "ökenns"
|
|
}
|
|
...
|
|
}
|
|
</pre>
|
|
Use the worst-case function if the heuristic does not work:
|
|
<pre>
|
|
mkN "öken" "öknen" "öknar" "öknarna"
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>The module <tt>ParadigmsSwe</tt></h3>
|
|
|
|
For main category - <tt>N</tt>, <tt>A</tt>, <tt>V</tt> - a worst-case
|
|
function and a couple of "regular" patterns.
|
|
<pre>
|
|
mkN : (apa,apan,apor,aporna : Str) -> N ;
|
|
mk2N : (nyckel,nycklar : Str) -> N ;
|
|
|
|
mkV : (supa,super,sup,söp,supit,supen : Str) -> V ;
|
|
regV : (tala : Str) -> V ;
|
|
mk2V : (leka,leker : Str) -> V ;
|
|
irregV : (dricka, drack, druckit : Str) -> V ;
|
|
</pre>
|
|
Construction functions for subcategorization.
|
|
<pre>
|
|
mkV2 : V -> Preposition -> V2 ;
|
|
dirV2 : V -> V2 ;
|
|
mkV3 : V -> Preposition -> Preposition -> V3 ;
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Morphology extraction</h3>
|
|
|
|
Idea: search for <b>characteristic forms</b> of paradigms in a corpus.
|
|
<pre>
|
|
paradigm decl1
|
|
= ap+"a"
|
|
{ap+"a" & ap+"or" };
|
|
</pre>
|
|
For instance, if you find <i>klocka</i> and <i>klockor</i>, add
|
|
<pre>
|
|
klocka_N = decl1 "klocka" ;
|
|
</pre>
|
|
to the lexicon.
|
|
|
|
<p>
|
|
|
|
The notation for extraction and its implementation are
|
|
developed by Markus Forsberg and Harald Hammarström.
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>False positives</h3>
|
|
|
|
Problem: false positives, e.g. <i>bra - bror</i>.
|
|
|
|
<p>
|
|
|
|
Solution: restrict stem with a regular expression
|
|
<pre>
|
|
paradigm decl1 [ap : char* vowel char*]
|
|
= ap+"a"
|
|
{ap+"a" & ap+"or" };
|
|
</pre>
|
|
In general, exclude stems shorter than 3 characters.
|
|
|
|
<p>
|
|
|
|
It is necessary to check the results manually.
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Patterns for what?</h3>
|
|
|
|
"Irregular patterns" are possible, e.g.
|
|
<pre>
|
|
paradigm vEI [sm:OneOrMore, t:OneOrMore]
|
|
= sm+"i"+t+"a"
|
|
{sm+"i"+t+"a" & (sm+"e"+t | sm+"i"+t+"it")} ;
|
|
</pre>
|
|
For rare patterns, it is more productive to build the
|
|
corresponding part of lexicon manually.
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Current Swedish resource lexicon</h3>
|
|
|
|
49,749 lemmas (1,000 manual, rest extracted),
|
|
605,453 word forms.
|
|
No subcategorization information.
|
|
|
|
<p>
|
|
|
|
Uses the
|
|
<a href="http://www.cs.chalmers.se/~markus/FM/">
|
|
Functional Morphology</a>
|
|
format, which can be translated to GF, XFST, LEXC,
|
|
MySQL,...
|
|
|
|
<p>
|
|
|
|
FM's "native" analysis engine is based on a trie
|
|
and includes compound analysis using algorithms
|
|
from G. Huet's
|
|
<a href="http://sanskrit.inria.fr/huet/ZEN/">Zen Toolkit</a>.
|
|
Analysis speed is 12,000 words per minute
|
|
with compound analysis, 50,000 without
|
|
(on an Intel M1.5 GHz laptop).
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h2>Syntax case study: Swedish noun phrases</h2>
|
|
|
|
Problem: describe agreement and inheritance of definiteness
|
|
when a determiner is added to a common noun, possibly modified by
|
|
an adjective:
|
|
<p>
|
|
<i>
|
|
en bil<br>
|
|
bilen<p>
|
|
en stor bil<br>
|
|
den stora bilen<p>
|
|
denna bil<br>
|
|
denna stora bil
|
|
</i>
|
|
</p>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Abstract syntax for noun phrases</h3>
|
|
|
|
The <b>abstract syntax</b> of a GF grammar defines what grammatical
|
|
structures there are, without telling how they are defined.
|
|
|
|
<p>
|
|
|
|
The relevant fragment consists of 4 <b>categories</b> and
|
|
3 <b>functions</b>
|
|
<pre>
|
|
cat
|
|
N ; -- simple (lexical) common noun, e.g. "bil"
|
|
CN ; -- possibly complex common noun, e.g. "stor bil"
|
|
Det ; -- determiner, e.g. "denna"
|
|
NP ; -- noun phrase, e.g. "bilen"
|
|
AP : -- adjectival phrase, e.g. "stor"
|
|
fun
|
|
UseN : N -> CN ;
|
|
UseA : A -> AP ;
|
|
DetCN : Det -> CN -> NP ;
|
|
ModA : A -> CN -> CN ;
|
|
</pre>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Types of complex nouns and noun phrases</h3>
|
|
|
|
Just like to words, we assign <b>linearization types</b> to
|
|
phrase categories. They are similar to the lexical types,
|
|
but often with some extra information.
|
|
<pre>
|
|
lincat
|
|
CN = {s : Number => SpeciesP => Case => Str ; g : Gender ; isComplex : Bool} ;
|
|
NP = {s : NPForm => Str ; g : Gender ; n : Number ; p : Person} ;
|
|
Det = {s : Gender => Str ; n : Number ; b : SpeciesP} ;
|
|
AP = {s : AdjFormPos => Case => Str} ;
|
|
</pre>
|
|
Here we use some new parameter types:
|
|
<pre>
|
|
param
|
|
SpeciesP = IndefP | DefP Species ;
|
|
NPForm = PNom | PAcc | PGen GenNum ;
|
|
GenNum = ASg Gender | APl ;
|
|
AdjFormPos = Strong GenNum | Weak ;
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Building noun phrases with a determiner</h3>
|
|
|
|
Mutual agreement:
|
|
<ul>
|
|
<li> the determiner gets the gender of the noun
|
|
<li> the noun gets the number and definiteness of the determiner
|
|
</ul>
|
|
<pre>
|
|
DetCN : Det -> CN -> NP = \en, man ->
|
|
{s = \\c => en.s ! man.g ++
|
|
man.s ! en.n ! en.b ! npCase c ;
|
|
g = genNoun man.g ;
|
|
n = en.n ;
|
|
p = P3
|
|
} ;
|
|
</pre>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3></h3>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3></h3>
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h2>Syntax case study: Swedish sentence structure</h2>
|
|
|
|
Data: freedom in word order in main clause
|
|
<p>
|
|
<i>
|
|
jag har inte sett dig idag<br>
|
|
dig jag har inte sett idag<br>
|
|
idag har jag inte sett dig<br>
|
|
inte har jag sett dig idag<br>
|
|
sett har jag inte dig idag (??)<br>
|
|
sett dig har jag inte idag<p>
|
|
</i>
|
|
Rigid order in questions...
|
|
<p>
|
|
<i>
|
|
har jag inte sett dig idag
|
|
</i>
|
|
<p>
|
|
... and in subordinate clauses
|
|
<p>
|
|
<i>
|
|
jag inte har sett dig idag
|
|
</i>
|
|
<p>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>The topological model</h3>
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>The <tt>Sats</tt> data type</h3>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Building clauses from <tt>Sats</tt></h3>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Construction of <tt>Sats</tt></h3>
|
|
|
|
Notice: we want to treat <tt>Sats</tt> as an abstract data type.
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Verb subcategorization patterns formalized</h3>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Adding adverbials</h3>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Coverage of verb patterns in Swedish Academy Grammar</h3>
|
|
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h3>Remaining problems</h3>
|
|
|
|
|
|
<!-- NEW -->
|
|
|
|
<h2>Danish and Norwegian through parametrization</h2>
|
|
|
|
|
|
|
|
</body>
|
|
</html>
|