mirror of
https://github.com/GrammaticalFramework/gf-rgl.git
synced 2026-05-27 08:58:55 -06:00
rewrote morphodict.README.md
This commit is contained in:
@@ -1,38 +0,0 @@
|
||||
MkMorphoDict: Extracting a minimal morphological dictionary from an existing GF dictionary.
|
||||
|
||||
Aarne Ranta 2020-03-02
|
||||
|
||||
principles:
|
||||
|
||||
There should be a single source for each lemgram (i.e. inflection table of a word)
|
||||
Functions names should be easy to guess: baseform_Category (but avoiding accidental errors if this is not a unique key)
|
||||
|
||||
Hence,
|
||||
|
||||
Functions are 1-to-1 with lemgrams, i.e. inflection tables, thus
|
||||
- no sense distinctions
|
||||
- no subcategorizations
|
||||
- no variants
|
||||
|
||||
Functionname = baseform_category, with exceptions
|
||||
- same baseform_Category, different inflection tables: lie_1_V, lie_2_V
|
||||
- words that have non-ident characters: 'bird\'s-eye_A'
|
||||
- words that start with non-letters: W_'tween_Adv
|
||||
|
||||
Example run, English:
|
||||
|
||||
gf -make ../english/DictEng.gf
|
||||
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
|
||||
|
||||
Result: 64923 -> 56599 functions, of which 21679 could be compounds
|
||||
|
||||
Swedish, using a dump of SALDO (not available in these sources)
|
||||
|
||||
cd saldo/
|
||||
runghc SaldoGF.hs
|
||||
# combine abs.tmp with Saldo.header to obtain Saldo.gf
|
||||
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
|
||||
gf -make SaldoSwe.gf
|
||||
cd ..
|
||||
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
|
||||
|
||||
109
src/morphodict/README.md
Normal file
109
src/morphodict/README.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# morphodict: purely morphological unilingual dictionaries
|
||||
|
||||
Aarne Ranta 2020-03-02 -- 2021-05-27
|
||||
|
||||
UNDER CONSTRUCTION, INCOMPLETE AND BUGGY
|
||||
|
||||
## The vision
|
||||
|
||||
Vision 1: if you need the noun "stjärna" in Swedish, you will find it
|
||||
as `MorphoDictSwe.stjärna_N`.
|
||||
|
||||
Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as `MorphoDictSwe.stjärna_N`.
|
||||
|
||||
Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need `ParadigmsSwe`.
|
||||
|
||||
|
||||
## What is contained
|
||||
|
||||
The guiding principle is to provide a single source for each *lemgram* (i.e. linearization records, i.e. inflection table plus inherent features).
|
||||
Functions names should be easy to guess:
|
||||
- `baseform_Category`
|
||||
|
||||
Baseforms that have many different lemgrams are an exception.
|
||||
They should be numbered as
|
||||
- `lie_1_V` ("lie, lay, lain")
|
||||
- `lie_2_V` ("lie, lied lied")
|
||||
|
||||
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
|
||||
- `learn_1_V` ("learn, learned, learned")
|
||||
- `learn_1_V` ("learn, learnt, learnt")
|
||||
|
||||
Hence,
|
||||
- no `variants` should appear in the MorphoDict
|
||||
- no entries should be duplicated if their lemgrams are the same
|
||||
- hence, in particular, sense distinctions do not result in different entries
|
||||
|
||||
The dictionary will also exclude *multiwords* consisting of several tokens.
|
||||
Most of the time, even *compounds* written as single tokens should be excluded.
|
||||
However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources.
|
||||
|
||||
|
||||
## Relevant categories
|
||||
|
||||
In addition to sense distinctions, MorphoDict ignores subcategorizations.
|
||||
One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague.
|
||||
Another reason is that different subcategory variants overload morphological analysis.
|
||||
|
||||
The most numerous categories to be addressed are content words:
|
||||
- `A`
|
||||
- `Adv`
|
||||
- `Interj`
|
||||
- `N`
|
||||
- `PN`
|
||||
- `Symb`
|
||||
- `V`
|
||||
|
||||
In addition, structural words should appear here with their native lemma names:
|
||||
- `Conj`
|
||||
- `Det`
|
||||
- `IAdv`
|
||||
- `IDet`
|
||||
- `IP`
|
||||
- `NP` (special NP-like "pronouns", such as "somebody")
|
||||
- `Prep`
|
||||
- `Pron` (in the RGL only covering personal pronouns)
|
||||
- `Punct`
|
||||
- `Subj`
|
||||
|
||||
Additional language-specific categories can be included if the reasons are clear.
|
||||
They must then be defined in the `Ext` module for that language.
|
||||
|
||||
Following the model of Universal Tagset, we add a category `X` for unspecified words in `Ext`, with the linearization type `{s : Str}`.
|
||||
Hence it can only be used for uninflected strings with unclear status.
|
||||
|
||||
## Naming
|
||||
|
||||
As stated before,
|
||||
- `functionname` = `baseform_category` if there is a unique lemgram
|
||||
- = `baseform_number_category` if there is a need to disambiguate
|
||||
|
||||
The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known.
|
||||
|
||||
The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier.
|
||||
However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations.
|
||||
If function names are formed by the API function `PGF.mkCId`, these conventions are automatically followed.
|
||||
|
||||
|
||||
## Bootstrapping with `MkMorphoDict`
|
||||
|
||||
THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
|
||||
|
||||
Example run, English:
|
||||
|
||||
gf -make ../english/DictEng.gf
|
||||
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
|
||||
|
||||
Result: 64923 -> 56599 functions, of which 21679 could be compounds
|
||||
|
||||
Swedish, using a dump of SALDO (not available in these sources)
|
||||
```
|
||||
cd saldo/
|
||||
runghc SaldoGF.hs
|
||||
# combine abs.tmp with Saldo.header to obtain Saldo.gf
|
||||
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
|
||||
gf -make SaldoSwe.gf
|
||||
cd ..
|
||||
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user