GF Resources for Swedish

Språkdata Seminar, Gothenburg, 1 March 2005

Aarne Ranta

aarne@cs.chalmers.se

Plan

Introduction to resource grammars (pp. 1-16)

Swedish morphology and lexicon in GF

Syntax case study: Swedish sentence structure

Danish and Norwegian through parametrization

Swedish morphology and lexicon

Lexicon: list of words with inflection and other morphological information (gender of nouns etc).

Paradigms: set of functions for extending the lexicon.

Parts of speech

A word class is a record type, with parametric and inherent features (parameters). For example, nouns are the type
  N = {s : Number => Species => Case => Str ; g : Gender} ;
where
  param
    Species = Indef | Def ;
    Number  = Sg | Pl ;
    Case    = Nom | Gen ;

Defining a lexical unit

Every lexical unit has a word class as its type. The type checker of GF verifies that all and only the relevant information of the unit is given. For instance, an entry for the noun bil ("car") looks as follows.
  bil =
  {s = table {
     Sg => table {
       Indef => table {Nom => "bil"     ; Gen => "bils" } ;
       Def   => table {Nom => "bilen"   ; Gen => "bilens" }
       } ;
     Pl => table {
       Indef => table {Nom => "bilar"   ; Gen => "bilars" } ;
       Def   => table {Nom => "bilarna" ; Gen => "bilarnas" }
       }
     } ;
   g = Utr
  }

The Golden Rule of Functional Programming

Whenever you find yourself programming by "copy and paste", write a function instead.

Thus do not write

  gran =
  {s = table {
     Sg => table {
       Indef => table {Nom => "gran"     ; Gen => "grans" } ;
       Def   => table {Nom => "granen"   ; Gen => "granens" }
       } ;
     Pl => table {
       Indef => table {Nom => "granar"   ; Gen => "granars" } ;
       Def   => table {Nom => "granarna" ; Gen => "granarnas" }
       }
     } ;
   g = Utr
  }

Inflectional paradigms as functions

Instead, write a paradigm that can be applied to any word that is "inflected in the same way":
  decl2 : Str -> N = \bil ->
  {s = table {
     Sg => table {
       Indef => table {Nom => bil + ""     ; Gen => bil + "s" } ;
       Def   => table {Nom => bil + "en"   ; Gen => bil + "ens" }
       } ;
     Pl => table {
       Indef => table {Nom => bil + "ar"   ; Gen => bil + "ars" } ;
       Def   => table {Nom => bil + "arna" ; Gen => bil + "arnas" }
       }
     } ;
   g = Utr
  }
This function can be used over and over again:
  bil  = decl2 "bil" ;
  gran = decl2 "gran" ;
  dag  = decl2 "dag" ;

High-level definition of paradigms

Recall: functions instead of copy-and-paste!

First define (for each word class) a worst-case function:

  mkN : (apa,apan,apor,aporna : Str) -> Noun =
  {s = table {
     Sg => table {
       Indef => mkCase apa ;
       Def   => mkCase apan
       } ;
     Pl => table {
       Indef => mkCase apor ;
       Def   => mkCase aporna
       }
     } ;
   g = case last apan of {
         "n" => Utr ;
         _   => Neutr
  }
where we uniformly produce the genitive by
  mkCase : Str -> Case => Str = \f -> table {
      Nom => f ;
      Gen => f + case last f of {
        "s" | "x" => [] ;
        _ => "s"
        }
      } ;

High-level definition of paradigms

Then define, for instance, the five declensions as follows:
  decl1 : Str -> N = \apa -> let ap = init apa in
    mkN apa (apa + "n") (ap + "or") (ap + "orna") ;

  decl2 : Str -> N = \bil -> 
    mkN bil (bil + "en") (bil + "ar") (bil + "arna") ;

  decl3 : Str -> N = \fil -> 
    mkN fil (fil + "en") (fil + "er") (fil + "erna") ;

  decl4 : Str -> N = \rike -> 
    mkN rike (rike + "t") (rike + "n") (rik + "ena") ;

  decl5 : Str -> N = \lik -> 
    mkN lik (lik + "et")  lik  (lik + "en") ;

What paradigms are there?

Swedish nouns traditionally have 5 declensions. But each of them has slight variations. For instance, the "2nd declension" has the following:
  gosse  - gossar  -- 211
  nyckel - nycklar -- 231
  seger  - segrar  -- 232
  öken   - öknar   -- 233
  hummer - humrar  -- 238
  kam    - kammar  -- 241
  mun    - munnar  -- 243
and many more (S. Hellberg, The Morphology of Present-Day Swedish, Almqvist & Wiksell, Stockholm, 1978). In addition, we have at least
  mås - mås -- genitive form without s
  sax - sax 

High-level access to paradigms

The "naïve user" does not want to go through 500 noun paradigms and pick the right one.

A much more efficient method is the one used in dictionaries: give two (or more) forms instead of one. Our "dictionary heuristic function" covers the following cases:

  flicka - flickor
  kor    - kor     (koret)
  ko     - kor     (kon)
  ros    - rosor   (rosen)
  bil    - bilar
  nyckel - nycklar
  hummer - humrar
  rike   - riken
  lik    - lik     (liket)
  lärare - lärare  (läraren)

The definition of the dictionary heuristic

reg2Noun : Str -> Str -> Subst = \bil,bilar -> 
   let 
     l  = last bil ;
     b  = Predef.tk 2 bil ; 
     ar = Predef.dp 2 bilar 
   in 
   case ar of {
      "or" => case l of {
         "a" => decl1Noun bil ;
         "r" => sLik bil ;
         "o" => mkNoun bil (bil + "n")  bilar (bilar + "na") ;
         _   => mkNoun bil (bil + "en") bilar (bilar + "na")
         } ;
      "ar" => ifTok Subst (Predef.tk 2 bilar) bil 
                 (decl2Noun bil)
                 (case l of {
                    "e" => decl2Noun bil ;
                    _   => mkNoun bil (bil + "n") bilar (bilar + "na") 
                    }
                 ) ;
      "er" => decl3Noun bil ;
      "en" => ifTok Subst bil bilar (sLik bil) (sRike bil) ; -- ben-ben
      _ => ifTok Subst bil bilar (
             case Predef.dp 3 bil of {
                "are" => sKikare (init bil) ; 
                _ => decl5Noun bil
                }
             )
             (decl5Noun bil) --- rest case with lots of garbage
      } ; 

When in doubt...

Test in GF by generating the table
  > cc mk2N "öken" "öknar"
  {s = table Number {
    Sg => table {
      Indef => table Case {
        Nom => "öken" ;
        Gen => "ökens"
      } ;
      Def => table Case {
        Nom => "ökenn" ;
        Gen => "ökenns"
      }
    ...
  }
Use the worst-case function if the heuristic does not work:
  mkN "öken" "öknen" "öknar" "öknarna"

The module ParadigmsSwe

For main category - N, A, V - a worst-case function and a couple of "regular" patterns.
  mkN  : (apa,apan,apor,aporna : Str) -> N ;
  mk2N : (nyckel,nycklar : Str) -> N ;

  mkV    : (supa,super,sup,söp,supit,supen : Str) -> V ;
  regV   : (tala : Str) -> V ;
  mk2V   : (leka,leker : Str) -> V ;
  irregV : (dricka, drack, druckit  : Str) -> V ;
Construction functions for subcategorization.
  mkV2  : V -> Preposition -> V2 ;
  dirV2 : V -> V2 ;
  mkV3  : V -> Preposition -> Preposition -> V3 ;

Morphology extraction

Idea: search for characteristic forms of paradigms in a corpus.
  paradigm decl1
    = ap+"a"
      {ap+"a" & ap+"or" };
For instance, if you find klocka and klockor, add
  klocka_N = decl1 "klocka" ;
to the lexicon.

The notation for extraction and its implementation are developed by Markus Forsberg and Harald Hammarström.

False positives

Problem: false positives, e.g. bra - bror.

Solution: restrict stem with a regular expression

  paradigm decl1 [ap : char* vowel char*]
    = ap+"a"
      {ap+"a" & ap+"or" };
In general, exclude stems shorter than 3 characters.

It is necessary to check the results manually.

Patterns for what?

"Irregular patterns" are possible, e.g.
  paradigm vEI [sm:OneOrMore, t:OneOrMore]
    = sm+"i"+t+"a"
      {sm+"i"+t+"a" & (sm+"e"+t | sm+"i"+t+"it")} ;
For rare patterns, it is more productive to build the corresponding part of lexicon manually.

Current Swedish resource lexicon

49,749 lemmas (1,000 manual, rest extracted), 605,453 word forms. No subcategorization information.

Uses the Functional Morphology format, which can be translated to GF, XFST, LEXC, MySQL,...

FM's "native" analysis engine is based on a trie and includes compound analysis using algorithms from G. Huet's Zen Toolkit. Analysis speed is 12,000 words per minute with compound analysis, 50,000 without (on an Intel M1.5 GHz laptop).

Syntax case study: Swedish sentence structure

Danish and Norwegian through parametrization