From 29af125799bfbacbb4cc775926b664d8ca801016 Mon Sep 17 00:00:00 2001 From: aarneranta Date: Thu, 27 May 2021 09:42:47 +0200 Subject: [PATCH] rewrote morphodict.README.md --- src/morphodict/README | 38 -------------- src/morphodict/README.md | 109 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 109 insertions(+), 38 deletions(-) delete mode 100644 src/morphodict/README create mode 100644 src/morphodict/README.md diff --git a/src/morphodict/README b/src/morphodict/README deleted file mode 100644 index 51a4a7073..000000000 --- a/src/morphodict/README +++ /dev/null @@ -1,38 +0,0 @@ -MkMorphoDict: Extracting a minimal morphological dictionary from an existing GF dictionary. - -Aarne Ranta 2020-03-02 - -principles: - -There should be a single source for each lemgram (i.e. inflection table of a word) -Functions names should be easy to guess: baseform_Category (but avoiding accidental errors if this is not a unique key) - -Hence, - -Functions are 1-to-1 with lemgrams, i.e. inflection tables, thus - - no sense distinctions - - no subcategorizations - - no variants - -Functionname = baseform_category, with exceptions - - same baseform_Category, different inflection tables: lie_1_V, lie_2_V - - words that have non-ident characters: 'bird\'s-eye_A' - - words that start with non-letters: W_'tween_Adv - -Example run, English: - - gf -make ../english/DictEng.gf - runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng - -Result: 64923 -> 56599 functions, of which 21679 could be compounds - -Swedish, using a dump of SALDO (not available in these sources) - - cd saldo/ - runghc SaldoGF.hs - # combine abs.tmp with Saldo.header to obtain Saldo.gf - # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf - gf -make SaldoSwe.gf - cd .. - runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe - diff --git a/src/morphodict/README.md b/src/morphodict/README.md new file mode 100644 index 000000000..8717c2d0e --- /dev/null +++ b/src/morphodict/README.md @@ -0,0 +1,109 @@ +# morphodict: purely morphological unilingual dictionaries + +Aarne Ranta 2020-03-02 -- 2021-05-27 + +UNDER CONSTRUCTION, INCOMPLETE AND BUGGY + +## The vision + +Vision 1: if you need the noun "stjärna" in Swedish, you will find it +as `MorphoDictSwe.stjärna_N`. + +Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as `MorphoDictSwe.stjärna_N`. + +Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need `ParadigmsSwe`. + + +## What is contained + +The guiding principle is to provide a single source for each *lemgram* (i.e. linearization records, i.e. inflection table plus inherent features). +Functions names should be easy to guess: +- `baseform_Category` + +Baseforms that have many different lemgrams are an exception. +They should be numbered as +- `lie_1_V` ("lie, lay, lain") +- `lie_2_V` ("lie, lied lied") + +Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction: +- `learn_1_V` ("learn, learned, learned") +- `learn_1_V` ("learn, learnt, learnt") + +Hence, +- no `variants` should appear in the MorphoDict +- no entries should be duplicated if their lemgrams are the same +- hence, in particular, sense distinctions do not result in different entries + +The dictionary will also exclude *multiwords* consisting of several tokens. +Most of the time, even *compounds* written as single tokens should be excluded. +However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources. + + +## Relevant categories + +In addition to sense distinctions, MorphoDict ignores subcategorizations. +One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague. +Another reason is that different subcategory variants overload morphological analysis. + +The most numerous categories to be addressed are content words: +- `A` +- `Adv` +- `Interj` +- `N` +- `PN` +- `Symb` +- `V` + +In addition, structural words should appear here with their native lemma names: +- `Conj` +- `Det` +- `IAdv` +- `IDet` +- `IP` +- `NP` (special NP-like "pronouns", such as "somebody") +- `Prep` +- `Pron` (in the RGL only covering personal pronouns) +- `Punct` +- `Subj` + +Additional language-specific categories can be included if the reasons are clear. +They must then be defined in the `Ext` module for that language. + +Following the model of Universal Tagset, we add a category `X` for unspecified words in `Ext`, with the linearization type `{s : Str}`. +Hence it can only be used for uninflected strings with unclear status. + +## Naming + +As stated before, +- `functionname` = `baseform_category` if there is a unique lemgram +- = `baseform_number_category` if there is a need to disambiguate + +The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known. + +The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier. +However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations. +If function names are formed by the API function `PGF.mkCId`, these conventions are automatically followed. + + +## Bootstrapping with `MkMorphoDict` + +THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED + +Example run, English: + + gf -make ../english/DictEng.gf + runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng + +Result: 64923 -> 56599 functions, of which 21679 could be compounds + +Swedish, using a dump of SALDO (not available in these sources) +``` + cd saldo/ + runghc SaldoGF.hs + # combine abs.tmp with Saldo.header to obtain Saldo.gf + # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf + gf -make SaldoSwe.gf + cd .. + runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe +``` +