rewrote morphodict.README.md

2021-05-27 09:42:47 +02:00
parent cc75637704
commit 29af125799
2 changed files with 109 additions and 38 deletions
--- a/src/morphodict/README
+++ b/src/morphodict/README
@@ -1,38 +0,0 @@
-MkMorphoDict: Extracting a minimal morphological dictionary from an existing GF dictionary.
-
-Aarne Ranta 2020-03-02
-
-principles:
-
-There should be a single source for each lemgram (i.e. inflection table of a word)
-Functions names should be easy to guess: baseform_Category (but avoiding accidental errors if this is not a unique key)
-
-Hence,
-
-Functions are 1-to-1 with lemgrams, i.e. inflection tables, thus
-     - no sense distinctions
-     - no subcategorizations
-     - no variants
-
-Functionname = baseform_category, with exceptions
-     - same baseform_Category, different inflection tables: lie_1_V, lie_2_V
-     - words that have non-ident characters: 'bird\'s-eye_A'
-     - words that start with non-letters: W_'tween_Adv
-
-Example run, English:
-
-   gf -make ../english/DictEng.gf
-   runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
-
-Result: 64923 ->  56599 functions, of which 21679 could be compounds
-
-Swedish, using a dump of SALDO (not available in these sources)
-
-  cd saldo/
-  runghc SaldoGF.hs
-  # combine abs.tmp with Saldo.header to obtain Saldo.gf
-  # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
-  gf -make SaldoSwe.gf
-  cd ..
-  runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
-
--- a/src/morphodict/README.md
+++ b/src/morphodict/README.md
@@ -0,0 +1,109 @@
+# morphodict: purely morphological unilingual dictionaries
+
+Aarne Ranta 2020-03-02 -- 2021-05-27
+
+UNDER CONSTRUCTION, INCOMPLETE AND BUGGY
+
+## The vision
+
+Vision 1: if you need the noun "stjärna" in Swedish, you will find it
+as `MorphoDictSwe.stjärna_N`.
+
+Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as `MorphoDictSwe.stjärna_N`.
+
+Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need `ParadigmsSwe`.
+
+
+## What is contained
+
+The guiding principle is to provide a single source for each *lemgram* (i.e. linearization records, i.e. inflection table plus inherent features).
+Functions names should be easy to guess:
+- `baseform_Category`
+
+Baseforms that have many different lemgrams are an exception.
+They should be numbered as
+- `lie_1_V` ("lie, lay, lain") 
+- `lie_2_V` ("lie, lied lied")
+
+Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
+- `learn_1_V` ("learn, learned, learned")
+- `learn_1_V` ("learn, learnt, learnt")
+
+Hence,
+- no `variants` should appear in the MorphoDict
+- no entries should be duplicated if their lemgrams are the same
+- hence, in particular, sense distinctions do not result in different entries
+
+The dictionary will also exclude *multiwords* consisting of several tokens.
+Most of the time, even *compounds* written as single tokens should be excluded.
+However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources.
+
+
+## Relevant categories 
+
+In addition to sense distinctions, MorphoDict ignores subcategorizations.
+One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague.
+Another reason is that different subcategory variants overload morphological analysis.
+
+The most numerous categories to be addressed are content words:
+- `A`
+- `Adv`
+- `Interj`
+- `N`
+- `PN`
+- `Symb`
+- `V`
+
+In addition, structural words should appear here with their native lemma names:
+- `Conj`
+- `Det`
+- `IAdv`
+- `IDet`
+- `IP`
+- `NP` (special NP-like "pronouns", such as "somebody")
+- `Prep`
+- `Pron` (in the RGL only covering personal pronouns)
+- `Punct`
+- `Subj`
+
+Additional language-specific categories can be included if the reasons are clear.
+They must then be defined in the `Ext` module for that language.
+
+Following the model of Universal Tagset, we add a category `X` for unspecified words in `Ext`, with the linearization type `{s : Str}`.
+Hence it can only be used for uninflected strings with unclear status.
+
+## Naming
+
+As stated before,
+- `functionname` = `baseform_category` if there is a unique lemgram
+- = `baseform_number_category` if there is a need to disambiguate
+
+The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known.
+
+The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier.
+However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations.
+If function names are formed by the API function `PGF.mkCId`, these conventions are automatically followed.
+
+
+## Bootstrapping with `MkMorphoDict`
+
+THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
+
+Example run, English:
+
+   gf -make ../english/DictEng.gf
+   runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
+
+Result: 64923 ->  56599 functions, of which 21679 could be compounds
+
+Swedish, using a dump of SALDO (not available in these sources)
+```
+  cd saldo/
+  runghc SaldoGF.hs
+  # combine abs.tmp with Saldo.header to obtain Saldo.gf
+  # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
+  gf -make SaldoSwe.gf
+  cd ..
+  runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
+```
+