1
0
forked from GitHub/gf-rgl
Files
2025-01-24 14:44:35 +01:00
..
2024-07-17 18:00:52 +02:00
2024-07-25 12:01:19 +02:00
2024-07-25 12:01:19 +02:00

morphodict: purely morphological unilingual dictionaries

Aarne Ranta 2020-03-02 -- 2021-05-27

UNDER CONSTRUCTION, INCOMPLETE AND BUGGY

The vision

Vision 1: if you need the noun "stjärna" in Swedish, you will find it as MorphoDictSwe.stjärna_N.

Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as MorphoDictSwe.stjärna_N.

Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need ParadigmsSwe.

What is contained

The guiding principle is to provide a single source for each lemgram (i.e. linearization records, i.e. inflection table plus inherent features). Functions names should be easy to guess:

  • baseform_Category

Baseforms that have many different lemgrams are an exception. They should be disambiguated by adding the differing forms, as in

  • lie_lay_V ("lie, lay, lain")
  • lie_lied_V ("lie, lied lied")

Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:

  • learn_learned_V ("learn, learned, learned")
  • learn_learnt_V ("learn, learnt, learnt")

Hence,

  • no variants should appear in the MorphoDict
  • no entries should be duplicated if their lemgrams are the same
  • hence, in particular, sense distinctions do not result in different entries

The dictionary will also exclude multiwords consisting of several tokens. Most of the time, even compounds written as single tokens should be excluded. However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources.

Since multiwords and compounds are excluded, Paradigms and MakeStructural should for each language provide API functions for easy definitions of them, preferably of the form

 mkC : Str -> C -> C

The situation when this is not enough is when separate functions are needed for gluing and concatenation compounds.

Open question: what to do with compound prepositions that are common in e.g. English? The above principles imply

 according_to_Prep = mkPrep "according" to_Prep

defined outside MorphoDictEng, so that mkPrep comes from ParadigmsEng and to_Prep from MorphoDictEng. This may sound like against tradition, but follows the general guidelines of morphological dictionaries.

Relevant categories

In addition to sense distinctions, MorphoDict ignores subcategorizations. One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague. Another reason is that different subcategory variants overload morphological analysis.

The most numerous categories to be addressed are content words:

  • A
  • Adv
  • Interj
  • N
  • PN
  • Symb
  • V

In addition, structural words should appear here with their native lemma names:

  • Conj
  • Det
  • IAdv
  • IDet
  • IP
  • NP (special NP-like "pronouns", such as "somebody")
  • Prep
  • Pron (in the RGL only covering personal pronouns)
  • Punct
  • Subj

Additional language-specific categories can be included if the reasons are clear. They must then be importable from the Paradigms module for that language, together with mk functions. The Extend module may also put them in use in syntax.

Following the model of Universal Tagset, we add a category X for unspecified words in Extend, with the linearization type {s : Str}. Hence it can only be used for uninflected strings with unclear status.

Naming

As stated before,

  • functionname = baseform_category if there is a unique lemgram
  • = baseform_number_category if there is a need to disambiguate

The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known.

The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier. However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations. If function names are formed by the API function PGF.mkCId, these conventions are automatically followed.

Coding conventions

To enable easy ocular and automatic inspection,

  • write one entry per line, each prefixed by fun or lin keyword
  • sort the entries alphabetically
  • use paradigms with enough many arguments to make the characteristic forms explicit

To guarantee compatibility with the rest of the RGL and application grammars,

  • paradigms used should be imported from Paradigms and MakeStructural rather than defined in MorphoDict itself
  • import of low-level modules such as Res should be avoided
  • MorphoDict should be self-contained, i.e. not inherit from other modules such as Structural or Irreg. But it is OK to open them in a qualified mode to use when defining linearizations.

Bootstrapping with MkMorphoDict

Example run, English:

  gf -make ../english/DictEng.gf
  runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng

Or, if you have raw data from another source, of the format "N woman women", you can do

  runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng

The script needs a configuration file mapping legacy categories and forms lists to parts of GF code:

  N : N mkN 0 2
  A : A mkA 0 2 4 6
  V : V mkV 0 4 2
  V2 : V mkV 0 4 2
  Adv : Adv mkAdv 0
  Prep : Prep mkPrep 0

In addition, it needs header files containing lines to be prefixed to the generated files:

  concrete MorphoDictEng of MorphoDictEngAbs =
    CatEng [N,A,V,Adv,Prep] **
    open
      ParadigmsEng
    in
   {
  abstract MorphoDictEngAbs =
    Cat [N,A,V,Adv,Prep] **
  {

For more details, we refer to MkMorphodict.hs for the time being.

If the config and header files are sound, the script produces compilable GF files. They also mostly comply to the guidelines given in this document.

Some things TODO:

  • deal with multiwords such as "more regular" generated by Paradigms
  • use references to native Irreg files instead of very long smart paradigms
  • support increments in addition to overwrites

Things to do

To support the construction of a MorphoDict, the following should be provided in Paradigms:

  • explicit smart paradigms with characteristic forms and inherent features for each category
  • API constants for all inherent features that are needed
  • compound-constructing functions for all categories that need them
  • the extra categories that one wants to include in that language