forked from GitHub/gf-rgl
rewrote morphodict.README.md
This commit is contained in:
@@ -1,38 +0,0 @@
|
|||||||
MkMorphoDict: Extracting a minimal morphological dictionary from an existing GF dictionary.
|
|
||||||
|
|
||||||
Aarne Ranta 2020-03-02
|
|
||||||
|
|
||||||
principles:
|
|
||||||
|
|
||||||
There should be a single source for each lemgram (i.e. inflection table of a word)
|
|
||||||
Functions names should be easy to guess: baseform_Category (but avoiding accidental errors if this is not a unique key)
|
|
||||||
|
|
||||||
Hence,
|
|
||||||
|
|
||||||
Functions are 1-to-1 with lemgrams, i.e. inflection tables, thus
|
|
||||||
- no sense distinctions
|
|
||||||
- no subcategorizations
|
|
||||||
- no variants
|
|
||||||
|
|
||||||
Functionname = baseform_category, with exceptions
|
|
||||||
- same baseform_Category, different inflection tables: lie_1_V, lie_2_V
|
|
||||||
- words that have non-ident characters: 'bird\'s-eye_A'
|
|
||||||
- words that start with non-letters: W_'tween_Adv
|
|
||||||
|
|
||||||
Example run, English:
|
|
||||||
|
|
||||||
gf -make ../english/DictEng.gf
|
|
||||||
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
|
|
||||||
|
|
||||||
Result: 64923 -> 56599 functions, of which 21679 could be compounds
|
|
||||||
|
|
||||||
Swedish, using a dump of SALDO (not available in these sources)
|
|
||||||
|
|
||||||
cd saldo/
|
|
||||||
runghc SaldoGF.hs
|
|
||||||
# combine abs.tmp with Saldo.header to obtain Saldo.gf
|
|
||||||
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
|
|
||||||
gf -make SaldoSwe.gf
|
|
||||||
cd ..
|
|
||||||
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
|
|
||||||
|
|
||||||
109
src/morphodict/README.md
Normal file
109
src/morphodict/README.md
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
# morphodict: purely morphological unilingual dictionaries
|
||||||
|
|
||||||
|
Aarne Ranta 2020-03-02 -- 2021-05-27
|
||||||
|
|
||||||
|
UNDER CONSTRUCTION, INCOMPLETE AND BUGGY
|
||||||
|
|
||||||
|
## The vision
|
||||||
|
|
||||||
|
Vision 1: if you need the noun "stjärna" in Swedish, you will find it
|
||||||
|
as `MorphoDictSwe.stjärna_N`.
|
||||||
|
|
||||||
|
Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as `MorphoDictSwe.stjärna_N`.
|
||||||
|
|
||||||
|
Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need `ParadigmsSwe`.
|
||||||
|
|
||||||
|
|
||||||
|
## What is contained
|
||||||
|
|
||||||
|
The guiding principle is to provide a single source for each *lemgram* (i.e. linearization records, i.e. inflection table plus inherent features).
|
||||||
|
Functions names should be easy to guess:
|
||||||
|
- `baseform_Category`
|
||||||
|
|
||||||
|
Baseforms that have many different lemgrams are an exception.
|
||||||
|
They should be numbered as
|
||||||
|
- `lie_1_V` ("lie, lay, lain")
|
||||||
|
- `lie_2_V` ("lie, lied lied")
|
||||||
|
|
||||||
|
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
|
||||||
|
- `learn_1_V` ("learn, learned, learned")
|
||||||
|
- `learn_1_V` ("learn, learnt, learnt")
|
||||||
|
|
||||||
|
Hence,
|
||||||
|
- no `variants` should appear in the MorphoDict
|
||||||
|
- no entries should be duplicated if their lemgrams are the same
|
||||||
|
- hence, in particular, sense distinctions do not result in different entries
|
||||||
|
|
||||||
|
The dictionary will also exclude *multiwords* consisting of several tokens.
|
||||||
|
Most of the time, even *compounds* written as single tokens should be excluded.
|
||||||
|
However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources.
|
||||||
|
|
||||||
|
|
||||||
|
## Relevant categories
|
||||||
|
|
||||||
|
In addition to sense distinctions, MorphoDict ignores subcategorizations.
|
||||||
|
One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague.
|
||||||
|
Another reason is that different subcategory variants overload morphological analysis.
|
||||||
|
|
||||||
|
The most numerous categories to be addressed are content words:
|
||||||
|
- `A`
|
||||||
|
- `Adv`
|
||||||
|
- `Interj`
|
||||||
|
- `N`
|
||||||
|
- `PN`
|
||||||
|
- `Symb`
|
||||||
|
- `V`
|
||||||
|
|
||||||
|
In addition, structural words should appear here with their native lemma names:
|
||||||
|
- `Conj`
|
||||||
|
- `Det`
|
||||||
|
- `IAdv`
|
||||||
|
- `IDet`
|
||||||
|
- `IP`
|
||||||
|
- `NP` (special NP-like "pronouns", such as "somebody")
|
||||||
|
- `Prep`
|
||||||
|
- `Pron` (in the RGL only covering personal pronouns)
|
||||||
|
- `Punct`
|
||||||
|
- `Subj`
|
||||||
|
|
||||||
|
Additional language-specific categories can be included if the reasons are clear.
|
||||||
|
They must then be defined in the `Ext` module for that language.
|
||||||
|
|
||||||
|
Following the model of Universal Tagset, we add a category `X` for unspecified words in `Ext`, with the linearization type `{s : Str}`.
|
||||||
|
Hence it can only be used for uninflected strings with unclear status.
|
||||||
|
|
||||||
|
## Naming
|
||||||
|
|
||||||
|
As stated before,
|
||||||
|
- `functionname` = `baseform_category` if there is a unique lemgram
|
||||||
|
- = `baseform_number_category` if there is a need to disambiguate
|
||||||
|
|
||||||
|
The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known.
|
||||||
|
|
||||||
|
The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier.
|
||||||
|
However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations.
|
||||||
|
If function names are formed by the API function `PGF.mkCId`, these conventions are automatically followed.
|
||||||
|
|
||||||
|
|
||||||
|
## Bootstrapping with `MkMorphoDict`
|
||||||
|
|
||||||
|
THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
|
||||||
|
|
||||||
|
Example run, English:
|
||||||
|
|
||||||
|
gf -make ../english/DictEng.gf
|
||||||
|
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
|
||||||
|
|
||||||
|
Result: 64923 -> 56599 functions, of which 21679 could be compounds
|
||||||
|
|
||||||
|
Swedish, using a dump of SALDO (not available in these sources)
|
||||||
|
```
|
||||||
|
cd saldo/
|
||||||
|
runghc SaldoGF.hs
|
||||||
|
# combine abs.tmp with Saldo.header to obtain Saldo.gf
|
||||||
|
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
|
||||||
|
gf -make SaldoSwe.gf
|
||||||
|
cd ..
|
||||||
|
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
|
||||||
|
```
|
||||||
|
|
||||||
Reference in New Issue
Block a user