1
0
forked from GitHub/gf-rgl

updated morphodict/README.md with MkMorphodict help

This commit is contained in:
aarneranta
2021-05-28 14:44:52 +02:00
parent 7c4546f3c3
commit 38d852a5bb
2 changed files with 47 additions and 24 deletions

View File

@@ -20,7 +20,6 @@ import System.Environment (getArgs)
-- example:
-- gf -make ../english/DictEng.gf
-- runghc
-- runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
-- 64923 -> 56599 functions
@@ -138,7 +137,7 @@ mkMorphoDict env =
_ -> []
renames :: [RawRule] -> [RuleData]
-- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
--- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
renames fls = [((mkFun (f ++ fs ++ [c]),c),l) | (i,(((f,c),l),fs)) <- zip [1..] (zip fls (minimize fls))] -- disambiguate with different forms
minimize :: [RawRule] -> [[String]]

View File

@@ -21,13 +21,13 @@ Functions names should be easy to guess:
- `baseform_Category`
Baseforms that have many different lemgrams are an exception.
They should be numbered as
- `lie_1_V` ("lie, lay, lain")
- `lie_2_V` ("lie, lied lied")
They should be disambiguated by adding the differing forms, as in
- `lie_lay_V` ("lie, lay, lain")
- `lie_lied_V` ("lie, lied lied")
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
- `learn_1_V` ("learn, learned, learned")
- `learn_2_V` ("learn, learnt, learnt")
- `learn_learned_V` ("learn, learned, learned")
- `learn_learnt_V` ("learn, learnt, learnt")
Hence,
- no `variants` should appear in the MorphoDict
@@ -115,26 +115,50 @@ To guarantee compatibility with the rest of the RGL and application grammars,
## Bootstrapping with `MkMorphoDict`
THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
Example run, English:
gf -make ../english/DictEng.gf
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
Result: 64923 -> 56599 functions, of which 21679 could be compounds
Swedish, using a dump of SALDO (not available in these sources)
```
cd saldo/
runghc SaldoGF.hs
# combine abs.tmp with Saldo.header to obtain Saldo.gf
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
gf -make SaldoSwe.gf
cd ..
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
gf -make ../english/DictEng.gf
runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
```
Or, if you have raw data from another source, of the format "N woman women", you can do
```
runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng
```
The script needs a *configuration file* mapping legacy categories and forms lists to parts of GF code:
```
N : N mkN 0 2
A : A mkA 0 2 4 6
V : V mkV 0 4 2
V2 : V mkV 0 4 2
Adv : Adv mkAdv 0
Prep : Prep mkPrep 0
```
In addition, it needs *header files* containing lines to be prefixed to the generated files:
```
concrete MorphoDictEng of MorphoDictEngAbs =
CatEng [N,A,V,Adv,Prep] **
open
ParadigmsEng
in
{
```
```
abstract MorphoDictEngAbs =
Cat [N,A,V,Adv,Prep] **
{
```
For more details, we refer to `MkMorphodict.hs` for the time being.
If the config and header files are sound, the script produces compilable GF files.
They also mostly comply to the guidelines given in this document.
Some things TODO:
- deal with multiwords such as "more regular" generated by Paradigms
- use references to native Irreg files instead of very long smart paradigms
- support increments in addition to overwrites
## Things to do
To support the construction of a `MorphoDict`, the following should be guaranteed in the RGL: