gf-core/lib/resource/doc/spraakdata2005.html

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html><head><title></title></head>
 <body bgcolor="#ffffff" text="#000000">
<center>

<img src="gf-logo.gif">

<h1>GF Resources for Swedish</h1>

<p>

Språkdata Seminar, Gothenburg, 1 March 2005

</p><p>

Aarne Ranta

</p><p>

<tt>aarne@cs.chalmers.se</tt>
</p></center>


<!-- NEW -->
<h2>Plan</h2>

Introduction to resource grammars

<p>

Swedish morphology and lexicon in GF

<p>

Syntax case study: Swedish determiners

<p>

Syntax case study: Swedish sentence structure

<p>

Danish and Norwegian through parametrization


<!-- NEW -->
<h2>GF = Grammatical Framework</h2>

A grammar formalism based on functional programming and type theory.

<p>

Designed to be nice for <i>ordinary programmers</i> to use.

<p>

Mission: to make natural-language applications available for
ordinary programmers, in tasks like
<ul>
<li> software documentation
<li> domain-specific translation
<li> human-computer interaction
<li> dialogue systems
</ul>
Thus <i>not</i> primarily another theoretical framework for
linguists.


<!-- NEW -->
<h2>Multilingual grammars</h2>

<b>Abstract syntax</b>: language-independent representation
<pre>
  cat Prop ; Nat ;
  fun Even : Nat -> Prop ;
</pre>
<b>Concrete syntax</b>: mapping from abstract syntax trees to strings in a language
(English, French, German, Swedish,...)
<pre>
  lin Even x = {s = x.s ++ "is" ++ "even"} ;

  lin Even x = {s = x.s ++ "est" ++ "pair"} ;

  lin Even x = {s = x.s ++ "ist" ++ "gerade"} ;

  lin Even x = {s = x.s ++ "är" ++ "jämnt"} ;
</pre>
We can <b>translate</b> between language via the abstract syntax.

<p>

Is it really so simple?


<!-- NEW -->
<h2>Difficulties with concrete syntax</h2>

Most languages have rules of <b>inflection</b>, <b>agreement</b>,
and <b>word order</b>, which have to be obeyed when putting together
expressions.

<p>

The previous multilingual grammar breaks these rules in many situations:
<p><i>
2 and 3 is even<br>
la somme de 3 et de 5 est pair<br>
wenn 2 ist gerade, dann 2+2 ist gerade<br>
om 2 är jämnt, 2+2 är jämnt<br>
</i>

<!-- NEW -->
<h2>Solving the difficulties</h2>

GF has tools for expressing the linguistic rules that are needed to
produce correct translations in different languages.

<p>

Instead of just strings, we need <p>parameters</b>, <b>tables</b>,
and <b>record types</b>. For instance, French:
<pre>
  param Mod = Ind | Subj ;
  param Gen = Masc | Fem ;

  lincat Nat = {s : Str ; g : Gen} ;
  lincat Prop = {s : Mod => Str} ;

  lin Even x = {s =
      table {
        m => x.s ++
             case m   of {Ind  => "est" ;  Subj => "soit"} ++
             case x.g of {Masc => "pair" ; Fem  => "paire"}
      }
    } ;
</pre>


<!-- NEW -->
<h2>Language + Libraries</h2>

Writing natural language grammars still requires
theoretical knowledge about the language.

<p>

Which kind of a programmer is easier to find?
<ul>
<li> one who can write a sorting algorithm
<li> one who can write a grammar for Swedish determiners
</ul>

<p>

In main-stream programming, sorting algorithms are not
written by hand but taken from <b>libraries</b>.

<p>

In the same way, we want to create grammar libraries that encapsulate
basic linguistic facts.

<p>

Cf. the Java success story: the language is just a half of the
success - libraries are another half.


<!-- NEW -->
<h2>Example of library-based grammar writing</h2>

To define a Swedish expression of a mathematical predicate from scratch:
<pre>
  Even x =
    let jämn = case &lt;x.n,x.g> of {
      &lt;Sg,Utr>   => "jämn" ;
      &lt;Sg,Neutr> => "jämnt" ;
      &lt;Pl,_>     => "jämna"
      }
    in
    {s = table {
      Main => x.s ! Nom ++ "är" ++ jämn ;
      Inv  => "är" ++ x.s ! Nom ++ jämn ;
      Sub  => x.s ! Nom ++ "är" ++ jämn
      }
    }
</pre>
To use library functions for syntax and morphology:
<pre>
  Even = predA (regA "jämn") ;
</pre>


<!-- NEW -->
<h2>Questions in grammar library design</h2>

What should there be in the library?
<br>
<li> morphology, lexicon, syntax, semantics,...

<p>

How do we organize and present the library?
<br>
<li> division into modules, level of granularity
<br>
<li> "school grammar" vs. sophisticated linguistic concepts

<p>

Where do we get the data from?
<br>
<li> automatic extraction or hand-writing?
<br>
<li> reuse of existing resources?

<p>

Extra constraint: we want open-source free software.


<!-- NEW -->
<h2>The scope of the resource grammar library</h2>

All morphological paradigms

<p>

Basic lexicon of structural, common, and irregular words

<p>

Basic syntactic structures

<p>

Currently,<br>
<li> <i>no</i> semantics,<br>
<li> <i>no</i> language-specific structures if not necessary for expressivity.


<!-- NEW -->
<h2>Success criteria</h2>

Grammatical correctness

<p>

Semantic coverage: you can express whatever you want.

<p>

Usability as library for non-linguists.

<p>

(Bonus for linguists:) nice generalizations w.r.t. language
families, using the module system of GF.


<!-- NEW -->
<h2>These are not our success criteria</h2>

Language coverage: you can parse all expressions. Example:
the French <i>passé simple</i> tense, although covered by the
morhology, is not used in the language-independent API, but
only the <i>passé composé</i> is.

<p>

Semantic correctness
<pre>
  colourless green ideas sleep furiously

  the time is seventy past forty-two
</pre>

<p>

(Warning for linguists:) theoretical innovation in
syntax (and it will all be hidden anyway!)


<!-- NEW -->
<h2>So where is semantics?</h2>

GF incorporates a <b>Logical Framework</b> and is therefore
capable of expressing logical semantics <i>à la</i> Montague
or any other flavour, including anaphora and discourse.

<p>

But we do <i>not</i> try to give semantics once and
for all for the whole language.

<p>

Instead, we expect semantics to be given in
<b>application grammars</b> built on semantic models
of different domains.

<p>

Example application: number theory
<pre>
  fun Even : Nat -> Prop ;         -- a mathematical predicate

  lin Even = predA (regA "even") ; -- English translation
  lin Even = predA (regA "pair") ; -- French translation
  lin Even = predA (regA "jämn") ; -- Swedish translation
</pre>
How could the resource predict that just <i>these</i>
translations are correct in this domain?


<!-- NEW -->
<h2>Languages</h2>

The current GF Resource Project covers ten languages:
<ul>
<li><tt>Dan</tt>ish
<li><tt>Eng</tt>lish
<li><tt>Fin</tt>nish
<li><tt>Fre</tt>nch
<li><tt>Ger</tt>man
<li><tt>Ita</tt>lian
<li><tt>Nor</tt>wegian
<li><tt>Rus</tt>sian
<li><tt>Spa</tt>nish
<li><tt>Swe</tt>dish
</ul>
The first three letters (<tt>Dan</tt> etc) are used in grammar module names


<!-- NEW -->
<h2>Library structure 1: language-independent API</h2>

<li> syntactic <tt>Categories</tt> (parts of speech, word classes)

<p>

<li> <tt>Rules</tt> for combining words and phrases, e.g.

<p>

<li> the most common <tt>Structural</tt> words (determiners,
conjunctions, pronouns), e.g.

<!-- NEW -->
<h2>Library structure 2: language-dependent modules</h2>

<li> morphological <tt>Paradigms</tt>, e.g.

<p>

<li> <tt>Lexicon</tt> of frequent words


<p>

<li> <tt>Ext</tt>ended syntax with language-specific rules


<!-- NEW -->
<h2>How much can be language-independent?</h2>

For the ten languages we have considered, it <i>is</i> possible
to implement the current API.

<p>

Reservations:
<ul>
<li> does not necessarily extend to all other languages
<li> does not necessarily cover the most idiomatic expressions
     of each language
<li> may not be the easiest API to implement (e.g. negation and
inversion with  <i>do</i> in English suggest that some other
structure would be more natural)
<li> does not guarantee that same structure has the same semantics
in different languages


<!-- NEW -->
<h2>Swedish morphology and lexicon</h2>

Lexicon: list of words with inflection and other morphological
information (gender of nouns etc).

<p>

Paradigms: set of functions for extending the lexicon.


<!-- NEW -->
<h3>Parts of speech</h3>

A <b>word class</b> is a record type, with
<b>parametric</b> and <b>inherent</b> features (<tt>param</tt>eters).
For example, nouns are the type
<pre>
  N = {s : Number => Species => Case => Str ; g : Gender} ;
</pre>
where
<pre>
  param
    Species = Indef | Def ;
    Number  = Sg | Pl ;
    Case    = Nom | Gen ;
</pre>


<!-- NEW -->
<h3>Defining a lexical unit</h3>

Every lexical unit has a word class as its type.
The <b>type checker</b> of GF verifies that all and only the
relevant information of the unit is given. For instance,
an entry for the noun <i>bil</i> ("car") looks as follows.
<pre>
  bil =
  {s = table {
     Sg => table {
       Indef => table {Nom => "bil"     ; Gen => "bils" } ;
       Def   => table {Nom => "bilen"   ; Gen => "bilens" }
       } ;
     Pl => table {
       Indef => table {Nom => "bilar"   ; Gen => "bilars" } ;
       Def   => table {Nom => "bilarna" ; Gen => "bilarnas" }
       }
     } ;
   g = Utr
  }
</pre>

<!-- NEW -->

<h3>The Golden Rule of Functional Programming</h3>

Whenever you find yourself programming by "copy and paste", write
a <b>function</b> instead.

<p>

Thus do <i>not</i> write
<pre>
  gran =
  {s = table {
     Sg => table {
       Indef => table {Nom => "gran"     ; Gen => "grans" } ;
       Def   => table {Nom => "granen"   ; Gen => "granens" }
       } ;
     Pl => table {
       Indef => table {Nom => "granar"   ; Gen => "granars" } ;
       Def   => table {Nom => "granarna" ; Gen => "granarnas" }
       }
     } ;
   g = Utr
  }
</pre>


<!-- NEW -->

<h3>Inflectional paradigms as  functions</h3>

Instead, write a <b>paradigm</b> that can be applied to any word
that is "inflected in the same way":
<pre>
  decl2 : Str -> N = \bil ->
  {s = table {
     Sg => table {
       Indef => table {Nom => bil + ""     ; Gen => bil + "s" } ;
       Def   => table {Nom => bil + "en"   ; Gen => bil + "ens" }
       } ;
     Pl => table {
       Indef => table {Nom => bil + "ar"   ; Gen => bil + "ars" } ;
       Def   => table {Nom => bil + "arna" ; Gen => bil + "arnas" }
       }
     } ;
   g = Utr
  }
</pre>
This function can be used over and over again:
<pre>
  bil  = decl2 "bil" ;
  gran = decl2 "gran" ;
  dag  = decl2 "dag" ;
</pre>


<!-- NEW -->

<h3>High-level definition of paradigms</h3>

Recall: functions instead of copy-and-paste!

<p>

First define (for each word class) a <b>worst-case function</b>:
<pre>
  mkN : (apa,apan,apor,aporna : Str) -> Noun =
  {s = table {
     Sg => table {
       Indef => mkCase apa ;
       Def   => mkCase apan
       } ;
     Pl => table {
       Indef => mkCase apor ;
       Def   => mkCase aporna
       }
     } ;
   g = case last apan of {
         "n" => Utr ;
         _   => Neutr
  }
</pre>
where we uniformly produce the genitive by
<pre>
  mkCase : Str -> Case => Str = \f -> table {
      Nom => f ;
      Gen => f + case last f of {
        "s" | "x" => [] ;
        _ => "s"
        }
      } ;
</pre>


<!-- NEW -->

<h3>High-level definition of paradigms</h3>

Then define, for instance, the five declensions as follows:
<pre>
  decl1 : Str -> N = \apa -> let ap = init apa in
    mkN apa (apa + "n") (ap + "or") (ap + "orna") ;

  decl2 : Str -> N = \bil ->
    mkN bil (bil + "en") (bil + "ar") (bil + "arna") ;

  decl3 : Str -> N = \fil ->
    mkN fil (fil + "en") (fil + "er") (fil + "erna") ;

  decl4 : Str -> N = \rike ->
    mkN rike (rike + "t") (rike + "n") (rik + "ena") ;

  decl5 : Str -> N = \lik ->
    mkN lik (lik + "et")  lik  (lik + "en") ;
</pre>


<!-- NEW -->

<h3>What paradigms are there?</h3>

Swedish nouns traditionally have 5 declensions. But each of them has
slight variations. For instance, the "2nd declension" has the following:
<pre>
  gosse  - gossar  -- 211
  nyckel - nycklar -- 231
  seger  - segrar  -- 232
  öken   - öknar   -- 233
  hummer - humrar  -- 238
  kam    - kammar  -- 241
  mun    - munnar  -- 243
</pre>
and many more (S. Hellberg, <i>The Morphology of Present-Day Swedish</i>,
Almqvist & Wiksell, Stockholm, 1978). In addition, we have at least
<pre>
  mås - mås -- genitive form without s
  sax - sax
</pre>


<!-- NEW -->

<h3>High-level access to paradigms</h3>

The "naïve user" does not want to go through 500 noun paradigms and
pick the right one.

<p>

A much more efficient method is the one used in
dictionaries: give <i>two</i> (or more) forms instead of one.
Our "dictionary heuristic function" covers the following cases:
<pre>
  flicka - flickor
  kor    - kor     (koret)
  ko     - kor     (kon)
  ros    - rosor   (rosen)
  bil    - bilar
  nyckel - nycklar
  hummer - humrar
  rike   - riken
  lik    - lik     (liket)
  lärare - lärare  (läraren)
</pre>


<!-- NEW -->

<h3>The definition of the dictionary heuristic</h3>

<pre>
reg2Noun : Str -> Str -> Subst = \bil,bilar ->
   let
     l  = last bil ;
     b  = Predef.tk 2 bil ;
     ar = Predef.dp 2 bilar
   in
   case ar of {
      "or" => case l of {
         "a" => decl1Noun bil ;
         "r" => sLik bil ;
         "o" => mkNoun bil (bil + "n")  bilar (bilar + "na") ;
         _   => mkNoun bil (bil + "en") bilar (bilar + "na")
         } ;
      "ar" => ifTok Subst (Predef.tk 2 bilar) bil
                 (decl2Noun bil)
                 (case l of {
                    "e" => decl2Noun bil ;
                    _   => mkNoun bil (bil + "n") bilar (bilar + "na")
                    }
                 ) ;
      "er" => decl3Noun bil ;
      "en" => ifTok Subst bil bilar (sLik bil) (sRike bil) ; -- ben-ben
      _ => ifTok Subst bil bilar (
             case Predef.dp 3 bil of {
                "are" => sKikare (init bil) ;
                _ => decl5Noun bil
                }
             )
             (decl5Noun bil) --- rest case with lots of garbage
      } ;
</pre>

<!-- NEW -->

<h3>When in doubt...</h3>

Test in GF by generating the table
<pre>
  > cc mk2N "öken" "öknar"
  {s = table Number {
    Sg => table {
      Indef => table Case {
        Nom => "öken" ;
        Gen => "ökens"
      } ;
      Def => table Case {
        Nom => "ökenn" ;
        Gen => "ökenns"
      }
    ...
  }
</pre>
Use the worst-case function if the heuristic does not work:
<pre>
  mkN "öken" "öknen" "öknar" "öknarna"
</pre>


<!-- NEW -->

<h3>The module <tt>ParadigmsSwe</tt></h3>

For each main category - <tt>N</tt>, <tt>A</tt>, <tt>V</tt> - a worst-case
function and a couple of "regular" patterns.
<pre>
  mkN    : (apa,apan,apor,aporna : Str) -> N ;
  mk2N   : (nyckel,nycklar : Str) -> N ;

  mkV    : (supa,super,sup,söp,supit,supen : Str) -> V ;
  regV   : (tala : Str) -> V ;
  mk2V   : (leka,leker : Str) -> V ;
  irregV : (dricka, drack, druckit  : Str) -> V ;
</pre>
Construction functions for subcategorization.
<pre>
  mkV2   : V -> Preposition -> V2 ;
  dirV2  : V -> V2 ;
  mkV3   : V -> Preposition -> Preposition -> V3 ;
</pre>


<!-- NEW -->

<h3>Morphology extraction</h3>

Idea: search for <b>characteristic forms</b> of paradigms in a corpus.
<pre>
  paradigm decl1
    = ap+"a"
      {ap+"a" & ap+"or" };
</pre>
For instance, if you find <i>klocka</i> and <i>klockor</i>, add
<pre>
  klocka = decl1 "klocka" ;
</pre>
to the lexicon.

<p>

The notation for extraction and its implementation are
developed by Markus Forsberg and Harald Hammarström.


<!-- NEW -->

<h3>False positives</h3>

Problem: false positives, e.g. <i>bra - bror</i>.

<p>

Solution: restrict stem with a regular expression
<pre>
  paradigm decl1 [ap : char* vowel char*]
    = ap+"a"
      {ap+"a" & ap+"or" };
</pre>
In general, exclude stems shorter than 3 characters.

<p>

To guarantee quality, it is necessary to check the results manually.


<!-- NEW -->

<h3>Patterns for what?</h3>

"Irregular patterns" are possible, e.g.
<pre>
  paradigm vEI [sm:OneOrMore, t:OneOrMore]
    = sm+"i"+t+"a"
      {sm+"i"+t+"a" & (sm+"e"+t | sm+"i"+t+"it")} ;
</pre>
For rare patterns, it is more productive to build the
corresponding part of lexicon manually.


<!-- NEW -->

<h3>Current Swedish resource lexicon</h3>

49,749 lemmas (1,000 manual, rest extracted),
605,453 word forms.
No subcategorization information.

<p>

Uses the
<a href="http://www.cs.chalmers.se/~markus/FM/">
Functional Morphology</a>
format, which can be translated to GF, XFST, LEXC,
MySQL,...

<p>

FM's "native" analysis engine is based on a trie
and includes compound analysis using algorithms
from G. Huet's
<a href="http://sanskrit.inria.fr/huet/ZEN/">Zen Toolkit</a>.
Analysis speed is 12,000 words per minute
with compound analysis, 50,000 without
(on an Intel M1.5 GHz laptop).


<!-- NEW -->

<h2>Syntax case study: Swedish determiners</h2>

Problem: describe agreement and inheritance of definiteness
when a determiner is added to a common noun, possibly modified by
an adjective:
<p>
<i>
en bil<br>
bilen<p>
en stor bil<br>
den stora bilen<p>
denna bil<br>
denna stora bil
</i>
</p>

<!-- NEW -->

<h3>Abstract syntax for noun phrases</h3>

The <b>abstract syntax</b> of a GF grammar defines what grammatical
structures there are, without telling how they are defined.

<p>

The relevant fragment consists of 5 <b>categories</b> and
4 <b>functions</b>
<pre>
  cat
    N ;   -- simple (lexical) common noun, e.g. "bil"
    CN ;  -- possibly complex common noun, e.g. "stor bil"
    Det ; -- determiner,                   e.g. "denna"
    NP ;  -- noun phrase,                  e.g. "bilen"
    AP :  -- adjectival phrase,            e.g. "stor"
  fun
    UseN  : N -> CN ;
    UseA  : A -> AP ;
    DetCN : Det -> CN -> NP ;
    ModAP : AP -> CN -> CN ;
</pre>


<!-- NEW -->

<h3>The type of complex common nouns</h3>

Phrase categories are similar to lexical types,
but often with some extra information.

<p>

Complex common nouns have the following linearization type
<pre>
  CN = {
    s : Number => SpeciesP => Case => Str ;
    g : Gender ;
    isComplex : Bool
    } ;
</pre>
Here we use a new parameter type,
<pre>
  SpeciesP   = IndefP | DefP Species ;
</pre>
to distinguish between three forms:
<pre>
  IndefP     => "stor bil"
  DefP Indef => "stora bil"
  DefP Def   => "stora bilen"
</pre>


<!-- NEW -->

<h3>Simple and complex common nouns</h3>

The boolean feature <tt>isComplex</tt> is <tt>False</tt> for
"one-word" <tt>CN</tt>s. Adjectival modification makes it
<tt>True</tt>. Notice also the agreement of the adjective to the
noun.
<pre>
  UseN hus =
    {s = \\n,b,c => hus.s ! n ! unSpeciesP b ! c ;
     g = hus.g ;
     p = False
    } ;

  ModAP Stor Nybil =
    {s = \\n, b, c =>
           let
             stor  = Stor.s ! mkAdjForm (unSpeciesAdjP b) n Nybil.g ! Nom ;
             nybil = Nybil.s ! n ! b ! c
           in preOrPost God.p nybil god ;
     g = Nybil.g ;
     p = True
     } ;
</pre>


<!-- NEW -->

<h3>The type of noun phrases</h3>

A noun phrase must carry the <b>agreement features</b> that
are passed to a predicate: number, gender, and person.
<pre>
  NP = {
    s : NPForm => Str ;
    g : Gender ;
    n : Number ;
    p : Person
    } ;
</pre>
Since pronouns have special accusative and possessive forms,
the case of noun phrases is richer than the case of nouns.
<pre>
  NPForm = PNom | PAcc | PGen GenNum ;

  GenNum = ASg Gender | APl ;
</pre>


<!-- NEW -->

<h3>The type of determiners</h3>

A determiner agrees to the noun in gender, and determines the
number and species of the noun.
<pre>
  Det = {
    s : Gender => Str ;
    n : Number ;
    b : SpeciesP
    } ;
</pre>
Some examples:
<pre>
  en_Det    = {s = genForms "en"    "ett"   ; n = Sg ; b = IndefP} ;

  denna_Det = {s = genForms "denna" "detta" ; n = Sg ; b = DefP Indef} ;

  den_Det   = {s = genForms "den"   "det"   ; n = Sg ; b = DefP Def} ;

  dessa_Det = {s = \\ _ =>  "dessa"         ; n = Pl ; b = DefP Indef} ;
</pre>


<!-- NEW -->

<h3>Building noun phrases with a determiner</h3>

Mutual agreement:
<ul>
<li> the determiner gets the gender of the noun
<li> the noun gets the number and definiteness of the determiner
</ul>
<pre>
  DetCN : Det -> CN -> NP = \en, man ->
    {s = \\c => en.s ! man.g ++
                man.s ! en.n ! en.b ! npCase c ;
     g = genNoun man.g ;
     n = en.n ;
     p = P3
    } ;
</pre>

<!-- NEW -->

<h3>Definite noun phrases</h3>

Rather than a determiner like the English "the", we use
a primitive way of forming definite noun phrases.
<pre>
  DefNP : CN -> NP ;
</pre>
So we can deal with the fact that only complex common nouns
get a determiner word.
<pre>
  DefNP storbil = case storbil.isComplex of {
    True  => DetCN den_det   storbil ;
    False => DetCN empty_Det storbil
    }
     where
       empty_Det = {s = \\_ => [] ; n = Sg ; b = DefP Def} ;
</pre>


<!-- NEW -->

<h2>Syntax case study: Swedish sentence structure</h2>

Data: freedom in word order in main clause
<pre>
   jag har inte sett dig idag
   dig jag har inte sett idag
   idag har jag inte sett dig
   inte har jag sett dig idag
  *sett har jag inte dig idag
   sett dig har jag inte idag
</pre>
Rigid order in questions...
<pre>
  har jag inte sett dig idag
</pre>
... and in subordinate clauses
<pre>
  jag inte har sett dig idag
</pre>


<!-- NEW -->

<h3>The topological model</h3>

P. Diderichsen, for Danish, 1946; here acc. to Jörgensen & Svensson,
<i>Nusvensk grammatik</i> (Gleerups, 2001).

<p>

A sentence (<tt>Sats</tt>) consists
of different <b>fields</b>
<pre>
    Nexus Field       Content Field
    -----------       -------------
    V1   N1    A1      V2   N2    A2
    har  jag  inte    sett  dig  idag
</pre>


<!-- NEW -->

<h3>Main clause, inverted clause, subordinate clause</h3>

The declarative <b>main clause</b> has an initial <b>fundament</b>,
on which (almost) any of the fields (except V1) may occur:
<pre>
    Fundament  Nexus Field       Content Field
    ---------  -----------       -------------
               V1   N1    A1      V2   N2    A2
    jag        har  _    inte    sett  dig  idag
    inte       har  jag  _       sett  dig  idag
    dig        har  jag  inte    sett  _    idag
    idag       har  jag  inte    sett  dig  _
</pre>
The inverted clause has a rigid order
<pre>
    V1   N1    A1      V2   N2    A2
    har  jag  inte    sett  dig  idag
</pre>
The subordinate clause has another rigid order
<pre>
    N1   A1    V1      V2   N2    A2
    jag  inte  har    sett  dig  idag
</pre>


<!-- NEW -->

<h3>The <tt>Sats</tt> data type</h3>

What would be more natural than to use a record type in GF?
<pre>
  Sats = {
    s1 : Str ;  -- V1
    s2 : Str ;  -- N1
    s3 : Str ;  -- A1
    s4 : Str ;  -- V2
    s5 : Str ;  -- N2
    s6 : Str    -- A2
    } ;
</pre>
A "finished" sentence has three parameters,
<pre>
  S = {s : Order => Str} ;

  Order = Main | Inv | Sub ;
</pre>


<!-- NEW -->

<h3>Building clauses from <tt>Sats</tt></h3>

<pre>
   SSats sats =
      let
        har  = sats.s1 ;
        jag  = sats.s2 ;
        inte = sats.s3 ;
        sett = sats.s4 ;
        dig  = sats.s5 ;
        idag = sats.s6
      in {s = table {
        Main => variants {
          jag  ++ har ++ inte ++ sett ++ dig  ++ idag ;
          inte ++ har ++ jag  ++ sett ++ dig  ++ idag ;
          dig  ++ har ++ jag  ++ inte ++ sett ++ idag ;
          idag ++ har ++ jag  ++ inte ++ sett ++ dig
          } ;
        Inv => har ++ jag  ++ inte ++ sett ++ dig ++ idag ;
        Sub => jag ++ inte ++ har  ++ sett ++ dig ++ idag
        }
      } ;
</pre>


<!-- NEW -->

<h3>Some refinements</h3>

<li> Add tense variation to sentences; compound tenses have more
variants than simple ones.
<pre>
  festat har jag igår
  sova ska jag idag
</pre>
<li> Add boolean features telling which places are occupied; certain
combinations can be blocked and enabled then.
<pre>
   sovit har jag idag
  *sett har jag dig idag
   sett dig har jag idag
</pre>
<li> Add an <b>extraposition</b> field to enable subcategorization patterns
as in
<pre>
   du har sagt mig att han kommer idag
   att han kommer idag har du sagt mig
</pre>


<!-- NEW -->

<h3>Construction of <tt>Sats</tt></h3>

Following the general principle of <b>data abstraction</b>,
we treat <tt>Sats</tt> as an abstract data type.

<p>

This means that we don't explicitly write records, but
use a handful of functions for writing records:
<pre>
  mkSats : NounPhrase -> Verb -> Sats = \subj,verb ->
     let
       harsovit = verbSForm verb Act
     in
     {s1 = \\sf => (harsovit sf).fin ;
      s2 = subj.s ! PNom ;
      s3 = negation ;
      s4 = \\sf => (harsovit sf).inf ++ verb.s1 ;
      s5, s6, s7 = [] ;
      e3,e4,e5,e6,e7 = False
      } ;
</pre>


<!-- NEW -->

<h3>More constructors of <tt>Sats</tt></h3>

<pre>
  insertObject : Sats -> Str -> Sats = \sats, obj ->
     {s5 = sats.s5 ++ obj ;
      s1 = sats.s1 ; s2 = sats.s2 ; s3 = sats.s3 ; s4 = sats.s4 ; s6 = sats.s6 ; s7 = sats.s7 ;
      e5 = True ;
      e3 = sats.e3 ; e4 = sats.e4 ; e6 = sats.e6 ; e7 = sats.e7
      } ;

  insertExtrapos : Sats -> Str -> Sats = ...

  mkSatsObject : NounPhrase -> Verb -> Str -> Sats = \subj,verb,obj ->
    insertObject (mkSats subj verb) obj ;

  mkSatsCopula : NounPhrase -> Str -> Sats = \subj,obj ->
    mkSatsObject subj (verbVara ** {s1 = []}) obj ;
</pre>
N.B. these would be nicer to define if GF had record field overwriting:
<pre>
  insertObject : Sats -> Str -> Sats = \sats, obj ->
    sats ** {s5 = sats.s5 ++ obj ; e5 = True} ;
</pre>


<!-- NEW -->

<h3>Verb subcategorization patterns formalized</h3>

<pre>
    -- du sover
    SatsV = mkSats ;

    -- du ser mig
    SatsV2 subj verb obj =
      mkSatsObject subj verb (verb.s2 ++ obj.s ! PAcc) ;

    -- du föredrar honom framför mig
    SatsV3 subj verb obj1 obj2 =
      mkSatsObject subj verb (verb.s2 ++ obj1.s ! PAcc ++ verb.s3 ++ obj2.s ! PAcc) ;

    -- du säger att det regnar
    SatsVS subj verb sent =
      insertExtrapos (mkSats subj verb) (optStr infinAtt ++ sent.s ! Sub) ;

    -- du undrar vem som kommer
    SatsVQ subj verb quest =
      insertExtrapos (mkSats subj verb) (quest.s ! IndirQ) ;
</pre>


<!-- NEW -->

<h3>Verb subcategorization patterns (continued)</h3>

<pre>
    -- du berättade mig att det hade regnat
    SatsV2S subj verb obj sent =
      insertExtrapos
        (mkSatsObject subj verb (verb.s2 ++ obj.s ! PAcc))
        (optStr infinAtt ++ sent.s ! Sub) ;

    -- du frågar mig om det regnar
    SatsV2Q subj verb obj quest =
      insertExtrapos
        (mkSatsObject subj verb (verb.s2 ++ obj.s ! PAcc))
        (quest.s ! IndirQ) ;

    -- du är trött
    SatsAP subj adj =
      mkSatsCopula subj (adj.s ! predFormAdj subj.g subj.n ! Nom) ;
</pre>


<!-- NEW -->

<h3>Coverage of verb patterns in Swedish Academy Grammar</h3>

The <a href="http://www.ling.gu.se/~karinc/G3/karin.html">comparison</a>
by
<a href="http://www.ling.gu.se/~karinc/">Karin Cavallin</a>
has given us guidelines.

<p>

We have tried to add at least those patterns that are meaningful in
the language-independent API.


<!-- NEW -->

<h3>Remaining problems</h3>

Building the fundament when there are many adverbs on the A1 and A2 slots.
<pre>
  på torget har jag sett dig idag
  idag har jag sett dig på torget
  ? idag på torget har jag sett dig
</pre>
Interrogative and relative pronouns
<pre>
  som jag har sett idag
  Vem har du sett idag?
  När och var har du sett henne?
</pre>
The resource grammar has an old treatment without topology: can we
make it nicer?


<!-- NEW -->

<h2>Danish and Norwegian through parametrization</h2>

Swedish, Danish, and Norwegian are "pretty similar".
There are differences in
<ul>
<li> vocabulary: <tt>flicka - pige - jente</tt>
<li> orthography: <tt>kaka - kage - kake</tt>
<li> some parameters, e.g. Norwegian's three genders
<li> determiner syntax:
  <pre>
  den stora bilen, denna stora bil
  den store bil,   denne store bil
  den store bilen, denne store bilen
  </pre>
<li> special constructions, e.g. Norwegian's <tt>bilen min</tt>
</ul>
Things like agreement and word order are quite the same, at least in
the resource API fragment.

<p>

Can we abstract away from the differences and build the three
grammars together <i>without copy and paste</i>?


<!-- NEW -->

<h3>Parametrized modules</h3>

The ultimate technique for avoiding copy and paste.
Here's a simple linguistic example: case and agreement,
<pre>
  interface Agreement = {
    param Agr ; Case ;
    oper subject : Case
    }

  incomplete concrete PredAgr of Pred = {
    lincat
      NP = {s : Case => Str ; a : Agr} ;
      VP = {s : Agr => Str} ;
    lin
      PredVP np vp = {s = np.s ! subject ++ vp.s ! np.a} ;
    }

  instance AgreementFin of Agreement = {
    param Agr  = {n : Number ; p : Person} ;
    param Case = Nom | Gen | ... | Instr ; -- 14 values
    oper subject = Nom ;
    }

  concrete PredFin of Pred = PredAgr with (Agreement = AgreementFin) ;
</pre>


<!-- NEW -->

<h3>The Scandinavian module structure</h3>

<center>
<img src="ScanMod.gif">
</center>
<font size=2>green = instantiation (no work) ; yellow = instance (some work) ;
red = specific (full work)
</font>

<!-- NEW -->

<h3>Differences in <tt>Types</tt></h3>

Scandinavian interface:
<pre>
  param
    Gender ;
    NounGender ;
</pre>
Swedish instance:
<pre>
    Gender = Utr | Neutr ;
    NounGender = NUtr Sex | Neutr ;
</pre>
Danish instance:
<pre>
    Gender = Utr | Neutr ;
    NounGender : Type = Gender ;
</pre>
Norwegian instance:
<pre>
    Gender = Masc | Fem | Neutr ;
    NounGender : Type = Gender ;
</pre>


<!-- NEW -->

<h3>Differences in <tt>Syntax</tt></h3>

Scandinavian interface (fragment):
<pre>
  oper
    specDefPhrase : Species ;
    verbVara, verbHava, verbSkola, verbFinnas : V ;
    relPron  : RP ;
    comparÄn, infinAtt, negInte : Str ;
</pre>
Swedish instance:
<pre>
    specDefPhrase = Def ;
    verbVara      = vara_V ; ...
    relPron       = relPronForms "som" "vars" ;
    comparÄn      = "än" ;
</pre>
Danish instance:
<pre>
    specDefPhrase = Indef ;
    verbVara      = være_V ; ...
    relPron       = relPronForms "som" "hvis" ;
    comparÄn      = "end" ;
</pre>
Norwegian instance:
<pre>
    specDefPhrase = Def ;
    verbVara      = være_V ; ...
    relPron       = relPronForms "som" "hvis" ;
    comparÄn      = "enn" ;
</pre>


<!-- NEW -->

<h3>Example: definite noun phrases</h3>

Here is the indefinite article, with just one parameter in place:
<pre>
  DefNP storbil = case storbil.isComplex of {
    True  => DetCN den_det   storbil ;
    False => DetCN empty_Det storbil
    }
     where
       empty_Det = {s = \\_ => [] ; n = Sg ; b = DefP specDefPhrase} ;
</pre>
For <i>denna</i>, which is in the lexicon, we just have different entries
<pre>
  {s = genForms "denna" "detta" ; n = Sg ; b = DefP Indef}         -- Swe
  {s = genForms "denne" "dette" ; n = Sg ; b = DefP specDefPhrase} -- Dan, Nor
</pre>


<!-- NEW -->

<h3>Can we generate the lexicon?</h3>

Idea:
<pre>
  word + paradigm in Swedish ---> word + paradigm in Danish/Norwegian
</pre>
The word is transformed by "sound laws", the paradigm by a general correspondance.
Example:
<pre>
  decl1 "jacka"  ---> decl1 "jakke"
</pre>
This is computed to
<pre>
 {s : SubstForm => Str = table {
    SF Sg Indef Nom => "jakke" ;
    SF Sg Indef Gen => "jakkes" ;
    SF Sg Def Nom   => "jakka" ;
    SF Sg Def Gen   => "jakkas" ;
    SF Pl Indef Nom => "jakker" ;
    SF Pl Indef Gen => "jakkers" ;
    SF Pl Def Nom   => "jakkene" ;
    SF Pl Def Gen   => "jakkenes"
  } ;
  g = Fem
 }
</pre>
Notice: we do <i>not</i> need to assume translation equivalence.

<!-- NEW -->

<h3></h3>


</body>
</html>