forked from GitHub/gf-rgl
updated morphodict/README.md with MkMorphodict help
This commit is contained in:
@@ -20,7 +20,6 @@ import System.Environment (getArgs)
|
|||||||
|
|
||||||
-- example:
|
-- example:
|
||||||
-- gf -make ../english/DictEng.gf
|
-- gf -make ../english/DictEng.gf
|
||||||
-- runghc
|
|
||||||
-- runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
|
-- runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
|
||||||
-- 64923 -> 56599 functions
|
-- 64923 -> 56599 functions
|
||||||
|
|
||||||
@@ -138,7 +137,7 @@ mkMorphoDict env =
|
|||||||
_ -> []
|
_ -> []
|
||||||
|
|
||||||
renames :: [RawRule] -> [RuleData]
|
renames :: [RawRule] -> [RuleData]
|
||||||
-- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
|
--- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
|
||||||
renames fls = [((mkFun (f ++ fs ++ [c]),c),l) | (i,(((f,c),l),fs)) <- zip [1..] (zip fls (minimize fls))] -- disambiguate with different forms
|
renames fls = [((mkFun (f ++ fs ++ [c]),c),l) | (i,(((f,c),l),fs)) <- zip [1..] (zip fls (minimize fls))] -- disambiguate with different forms
|
||||||
|
|
||||||
minimize :: [RawRule] -> [[String]]
|
minimize :: [RawRule] -> [[String]]
|
||||||
|
|||||||
@@ -21,13 +21,13 @@ Functions names should be easy to guess:
|
|||||||
- `baseform_Category`
|
- `baseform_Category`
|
||||||
|
|
||||||
Baseforms that have many different lemgrams are an exception.
|
Baseforms that have many different lemgrams are an exception.
|
||||||
They should be numbered as
|
They should be disambiguated by adding the differing forms, as in
|
||||||
- `lie_1_V` ("lie, lay, lain")
|
- `lie_lay_V` ("lie, lay, lain")
|
||||||
- `lie_2_V` ("lie, lied lied")
|
- `lie_lied_V` ("lie, lied lied")
|
||||||
|
|
||||||
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
|
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
|
||||||
- `learn_1_V` ("learn, learned, learned")
|
- `learn_learned_V` ("learn, learned, learned")
|
||||||
- `learn_2_V` ("learn, learnt, learnt")
|
- `learn_learnt_V` ("learn, learnt, learnt")
|
||||||
|
|
||||||
Hence,
|
Hence,
|
||||||
- no `variants` should appear in the MorphoDict
|
- no `variants` should appear in the MorphoDict
|
||||||
@@ -115,25 +115,49 @@ To guarantee compatibility with the rest of the RGL and application grammars,
|
|||||||
|
|
||||||
## Bootstrapping with `MkMorphoDict`
|
## Bootstrapping with `MkMorphoDict`
|
||||||
|
|
||||||
THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
|
|
||||||
|
|
||||||
Example run, English:
|
Example run, English:
|
||||||
|
|
||||||
gf -make ../english/DictEng.gf
|
|
||||||
runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
|
|
||||||
|
|
||||||
Result: 64923 -> 56599 functions, of which 21679 could be compounds
|
|
||||||
|
|
||||||
Swedish, using a dump of SALDO (not available in these sources)
|
|
||||||
```
|
```
|
||||||
cd saldo/
|
gf -make ../english/DictEng.gf
|
||||||
runghc SaldoGF.hs
|
runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
|
||||||
# combine abs.tmp with Saldo.header to obtain Saldo.gf
|
```
|
||||||
# combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
|
Or, if you have raw data from another source, of the format "N woman women", you can do
|
||||||
gf -make SaldoSwe.gf
|
|
||||||
cd ..
|
|
||||||
runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
|
|
||||||
```
|
```
|
||||||
|
runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng
|
||||||
|
```
|
||||||
|
The script needs a *configuration file* mapping legacy categories and forms lists to parts of GF code:
|
||||||
|
```
|
||||||
|
N : N mkN 0 2
|
||||||
|
A : A mkA 0 2 4 6
|
||||||
|
V : V mkV 0 4 2
|
||||||
|
V2 : V mkV 0 4 2
|
||||||
|
Adv : Adv mkAdv 0
|
||||||
|
Prep : Prep mkPrep 0
|
||||||
|
```
|
||||||
|
In addition, it needs *header files* containing lines to be prefixed to the generated files:
|
||||||
|
```
|
||||||
|
concrete MorphoDictEng of MorphoDictEngAbs =
|
||||||
|
CatEng [N,A,V,Adv,Prep] **
|
||||||
|
open
|
||||||
|
ParadigmsEng
|
||||||
|
in
|
||||||
|
{
|
||||||
|
```
|
||||||
|
```
|
||||||
|
abstract MorphoDictEngAbs =
|
||||||
|
Cat [N,A,V,Adv,Prep] **
|
||||||
|
{
|
||||||
|
```
|
||||||
|
For more details, we refer to `MkMorphodict.hs` for the time being.
|
||||||
|
|
||||||
|
If the config and header files are sound, the script produces compilable GF files.
|
||||||
|
They also mostly comply to the guidelines given in this document.
|
||||||
|
|
||||||
|
Some things TODO:
|
||||||
|
- deal with multiwords such as "more regular" generated by Paradigms
|
||||||
|
- use references to native Irreg files instead of very long smart paradigms
|
||||||
|
- support increments in addition to overwrites
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Things to do
|
## Things to do
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user