GF Resources for Swedish

Språkdata Seminar, Gothenburg, 1 March 2005

Aarne Ranta

aarne@cs.chalmers.se

Plan

Introduction to resource grammars (pp. 1-16)

Swedish morphology and lexicon in GF

Syntax case study: Swedish sentence structure

Danish and Norwegian through parametrization

Swedish morphology and lexicon

Lexicon: list of words with inflection and other morphological information (gender of nouns etc).

Paradigms: set of functions for extending the lexicon.

Parts of speech

A word class is a record type, with parametric and inherent features (parameters). For example, nouns are the type
  N = {s : Number => Species => Case => Str ; g : Gender} ;
where
  param
    Species = Indef | Def ;
    Number  = Sg | Pl ;
    Case    = Nom | Gen ;

Defining a lexical unit

Every lexical unit has a word class as its type. The type checker of GF verifies that all and only the relevant information of the unit is given. For instance, an entry for the noun bil ("car") looks as follows.
  bil =
  {s = table {
     Sg => table {
       Indef => table {Nom => "bil"     ; Gen => "bils" } ;
       Def   => table {Nom => "bilen"   ; Gen => "bilens" }
       } ;
     Pl => table {
       Indef => table {Nom => "bilar"   ; Gen => "bilars" } ;
       Def   => table {Nom => "bilarna" ; Gen => "bilarnas" }
       }
     } ;
   g = Utr
  }

The Golden Rule of Functional Programming

Whenever you find yourself programming by "copy and paste", write a function instead.

Thus do not write

  gran =
  {s = table {
     Sg => table {
       Indef => table {Nom => "gran"     ; Gen => "grans" } ;
       Def   => table {Nom => "granen"   ; Gen => "granens" }
       } ;
     Pl => table {
       Indef => table {Nom => "granar"   ; Gen => "granars" } ;
       Def   => table {Nom => "granarna" ; Gen => "granarnas" }
       }
     } ;
   g = Utr
  }

Inflectional paradigms as functions

Instead, write a paradigm that can be applied to any word that is "inflected in the same way":
  decl2 : Str -> N = \bil ->
  {s = table {
     Sg => table {
       Indef => table {Nom => bil + ""     ; Gen => bil + "s" } ;
       Def   => table {Nom => bil + "en"   ; Gen => bil + "ens" }
       } ;
     Pl => table {
       Indef => table {Nom => bil + "ar"   ; Gen => bil + "ars" } ;
       Def   => table {Nom => bil + "arna" ; Gen => bil + "arnas" }
       }
     } ;
   g = Utr
  }
This function can be used over and over again:
  bil  = decl2 "bil" ;
  gran = decl2 "gran" ;
  dag  = decl2 "dag" ;

High-level definition of paradigms

Recall: functions instead of copy-and-paste!

First define (for each word class) a worst-case function:

  mkN : (apa,apan,apor,aporna : Str) -> Noun =
  {s = table {
     Sg => table {
       Indef => mkCase apa ;
       Def   => mkCase apan
       } ;
     Pl => table {
       Indef => mkCase apor ;
       Def   => mkCase aporna
       }
     } ;
   g = case last apan of {
         "n" => Utr ;
         _   => Neutr
  }
where we uniformly produce the genitive by
  mkCase : Str -> Case => Str = \f -> table {
      Nom => f ;
      Gen => f + case last f of {
        "s" | "x" => [] ;
        _ => "s"
        }
      } ;

High-level definition of paradigms

Then define, for instance, the five declensions as follows:
  decl1 : Str -> N = \apa -> let ap = init apa in
    mkN apa (apa + "n") (ap + "or") (ap + "orna") ;

  decl2 : Str -> N = \bil -> 
    mkN bil (bil + "en") (bil + "ar") (bil + "arna") ;

  decl3 : Str -> N = \fil -> 
    mkN fil (fil + "en") (fil + "er") (fil + "erna") ;

  decl4 : Str -> N = \rike -> 
    mkN rike (rike + "t") (rike + "n") (rik + "ena") ;

  decl5 : Str -> N = \lik -> 
    mkN lik (lik + "et")  lik  (lik + "en") ;

What paradigms are there?

Swedish nouns traditionally have 5 declensions. But each of them has slight variations. For instance, the "2nd declension" has the following:
  gosse  - gossar  -- 211
  nyckel - nycklar -- 231
  seger  - segrar  -- 232
  öken   - öknar   -- 233
  hummer - humrar  -- 238
  kam    - kammar  -- 241
  mun    - munnar  -- 243
and many more (S. Hellberg, The Morphology of Present-Day Swedish, Almqvist & Wiksell, Stockholm, 1978). In addition, we have at least
  mås - mås -- genitive form without s
  sax - sax 

High-level access to paradigms

The "naïve user" does not want to go through 500 noun paradigms and pick the right one.

A much more efficient method is the one used in dictionaries: give two (or more) forms instead of one. Our "dictionary heuristic function" covers the following cases:

  flicka - flickor
  kor    - kor     (koret)
  ko     - kor     (kon)
  ros    - rosor   (rosen)
  bil    - bilar
  nyckel - nycklar
  hummer - humrar
  rike   - riken
  lik    - lik     (liket)
  lärare - lärare  (läraren)

The definition of the dictionary heuristic

reg2Noun : Str -> Str -> Subst = \bil,bilar -> 
   let 
     l  = last bil ;
     b  = Predef.tk 2 bil ; 
     ar = Predef.dp 2 bilar 
   in 
   case ar of {
      "or" => case l of {
         "a" => decl1Noun bil ;
         "r" => sLik bil ;
         "o" => mkNoun bil (bil + "n")  bilar (bilar + "na") ;
         _   => mkNoun bil (bil + "en") bilar (bilar + "na")
         } ;
      "ar" => ifTok Subst (Predef.tk 2 bilar) bil 
                 (decl2Noun bil)
                 (case l of {
                    "e" => decl2Noun bil ;
                    _   => mkNoun bil (bil + "n") bilar (bilar + "na") 
                    }
                 ) ;
      "er" => decl3Noun bil ;
      "en" => ifTok Subst bil bilar (sLik bil) (sRike bil) ; -- ben-ben
      _ => ifTok Subst bil bilar (
             case Predef.dp 3 bil of {
                "are" => sKikare (init bil) ; 
                _ => decl5Noun bil
                }
             )
             (decl5Noun bil) --- rest case with lots of garbage
      } ; 

When in doubt...

Test in GF by generating the table
  > cc mk2N "öken" "öknar"
  {s = table Number {
    Sg => table {
      Indef => table Case {
        Nom => "öken" ;
        Gen => "ökens"
      } ;
      Def => table Case {
        Nom => "ökenn" ;
        Gen => "ökenns"
      }
    ...
  }
Use the worst-case function if the heuristic does not work:
  mkN "öken" "öknen" "öknar" "öknarna"

The module ParadigmsSwe

For main category - N, A, V - a worst-case function and a couple of "regular" patterns.
  mkN  : (apa,apan,apor,aporna : Str) -> N ;
  mk2N : (nyckel,nycklar : Str) -> N ;

  mkV    : (supa,super,sup,söp,supit,supen : Str) -> V ;
  regV   : (tala : Str) -> V ;
  mk2V   : (leka,leker : Str) -> V ;
  irregV : (dricka, drack, druckit  : Str) -> V ;
Construction functions for subcategorization.
  mkV2  : V -> Preposition -> V2 ;
  dirV2 : V -> V2 ;
  mkV3  : V -> Preposition -> Preposition -> V3 ;

Morphology extraction

Idea: search for characteristic forms of paradigms in a corpus.
  paradigm decl1
    = ap+"a"
      {ap+"a" & ap+"or" };
For instance, if you find klocka and klockor, add
  klocka_N = decl1 "klocka" ;
to the lexicon.

The notation for extraction and its implementation are developed by Markus Forsberg and Harald Hammarström.

False positives

Problem: false positives, e.g. bra - bror.

Solution: restrict stem with a regular expression

  paradigm decl1 [ap : char* vowel char*]
    = ap+"a"
      {ap+"a" & ap+"or" };
In general, exclude stems shorter than 3 characters.

It is necessary to check the results manually.

Patterns for what?

"Irregular patterns" are possible, e.g.
  paradigm vEI [sm:OneOrMore, t:OneOrMore]
    = sm+"i"+t+"a"
      {sm+"i"+t+"a" & (sm+"e"+t | sm+"i"+t+"it")} ;
For rare patterns, it is more productive to build the corresponding part of lexicon manually.

Current Swedish resource lexicon

49,749 lemmas (1,000 manual, rest extracted), 605,453 word forms. No subcategorization information.

Uses the Functional Morphology format, which can be translated to GF, XFST, LEXC, MySQL,...

FM's "native" analysis engine is based on a trie and includes compound analysis using algorithms from G. Huet's Zen Toolkit. Analysis speed is 12,000 words per minute with compound analysis, 50,000 without (on an Intel M1.5 GHz laptop).

Syntax case study: Swedish noun phrases

Problem: describe agreement and inheritance of definiteness when a determiner is added to a common noun, possibly modified by an adjective:

en bil
bilen

en stor bil
den stora bilen

denna bil
denna stora bil

Abstract syntax for noun phrases

The abstract syntax of a GF grammar defines what grammatical structures there are, without telling how they are defined.

The relevant fragment consists of 4 categories and 3 functions

  cat 
    N ;   -- simple (lexical) common noun, e.g. "bil"
    CN ;  -- possibly complex common noun, e.g. "stor bil"
    Det ; -- determiner,                   e.g. "denna"          
    NP ;  -- noun phrase,                  e.g. "bilen"
    AP :  -- adjectival phrase,            e.g. "stor"
  fun
    UseN  : N -> CN ;    
    UseA  : A -> AP ;    
    DetCN : Det -> CN -> NP ;
    ModA  : A -> CN -> CN ;

Types of complex nouns and noun phrases

Just like to words, we assign linearization types to phrase categories. They are similar to the lexical types, but often with some extra information.
  lincat
    CN  = {s : Number => SpeciesP => Case => Str ; g : Gender ; isComplex : Bool} ;
    NP  = {s : NPForm => Str ; g : Gender ; n : Number ; p : Person} ;
    Det = {s : Gender => Str ; n : Number ; b : SpeciesP} ;
    AP  = {s : AdjFormPos => Case => Str} ; 
Here we use some new parameter types:
  param
    SpeciesP   = IndefP | DefP Species ;  
    NPForm     = PNom | PAcc | PGen GenNum ;
    GenNum     = ASg Gender | APl ;
    AdjFormPos = Strong GenNum | Weak ;

Building noun phrases with a determiner

Mutual agreement:
  DetCN : Det -> CN -> NP = \en, man -> 
    {s = \\c => en.s ! man.g ++ 
                man.s ! en.n ! en.b ! npCase c ;
     g = genNoun man.g ; 
     n = en.n ; 
     p = P3
    } ;

Syntax case study: Swedish sentence structure

Data: freedom in word order in main clause

jag har inte sett dig idag
dig jag har inte sett idag
idag har jag inte sett dig
inte har jag sett dig idag
sett har jag inte dig idag (??)
sett dig har jag inte idag

Rigid order in questions...

har jag inte sett dig idag

... and in subordinate clauses

jag inte har sett dig idag

The topological model

The Sats data type

Building clauses from Sats

Construction of Sats

Notice: we want to treat Sats as an abstract data type.

Verb subcategorization patterns formalized

Adding adverbials

Coverage of verb patterns in Swedish Academy Grammar

Remaining problems

Danish and Norwegian through parametrization