updated morphodict/README.md with MkMorphodict help

2021-05-28 14:44:52 +02:00
parent 7c4546f3c3
commit 38d852a5bb
2 changed files with 47 additions and 24 deletions
--- a/src/morphodict/MkMorphodict.hs
+++ b/src/morphodict/MkMorphodict.hs
@@ -20,7 +20,6 @@ import System.Environment (getArgs)

 -- example:
 --   gf -make ../english/DictEng.gf
--   runghc
 --   runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
 -- 64923 ->  56599 functions

@@ -138,7 +137,7 @@ mkMorphoDict env =
    _ -> []

  renames :: [RawRule] -> [RuleData]
--  renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
+---  renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int
  renames fls = [((mkFun (f ++ fs ++ [c]),c),l) | (i,(((f,c),l),fs)) <- zip [1..] (zip fls (minimize fls))] -- disambiguate with different forms

  minimize :: [RawRule] -> [[String]]
--- a/src/morphodict/README.md
+++ b/src/morphodict/README.md
@@ -21,13 +21,13 @@ Functions names should be easy to guess:
 - `baseform_Category`

 Baseforms that have many different lemgrams are an exception.
-They should be numbered as
- `lie_1_V` ("lie, lay, lain") 
- `lie_2_V` ("lie, lied lied")
+They should be disambiguated by adding the differing forms, as in
+- `lie_lay_V` ("lie, lay, lain") 
+- `lie_lied_V` ("lie, lied lied")

 Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
- `learn_1_V` ("learn, learned, learned")
- `learn_2_V` ("learn, learnt, learnt")
+- `learn_learned_V` ("learn, learned, learned")
+- `learn_learnt_V` ("learn, learnt, learnt")

 Hence,
 - no `variants` should appear in the MorphoDict
@@ -115,26 +115,50 @@ To guarantee compatibility with the rest of the RGL and application grammars,

 ## Bootstrapping with `MkMorphoDict`

-THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED
-
 Example run, English:
-
-   gf -make ../english/DictEng.gf
-   runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng
-
-Result: 64923 ->  56599 functions, of which 21679 could be compounds
-
-Swedish, using a dump of SALDO (not available in these sources)
 ```
-  cd saldo/
-  runghc SaldoGF.hs
-  # combine abs.tmp with Saldo.header to obtain Saldo.gf
-  # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf
-  gf -make SaldoSwe.gf
-  cd ..
-  runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe
+  gf -make ../english/DictEng.gf
+  runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
+  ```
+Or, if you have raw data from another source, of the format "N woman women", you can do
 ```
-  
+  runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng
+  ```
+The script needs a *configuration file* mapping legacy categories and forms lists to parts of GF code:
+```
+  N : N mkN 0 2
+  A : A mkA 0 2 4 6
+  V : V mkV 0 4 2
+  V2 : V mkV 0 4 2
+  Adv : Adv mkAdv 0
+  Prep : Prep mkPrep 0
+```
+In addition, it needs *header files* containing lines to be prefixed to the generated files:
+```
+  concrete MorphoDictEng of MorphoDictEngAbs =
+    CatEng [N,A,V,Adv,Prep] **
+    open
+      ParadigmsEng
+    in
+   {
+```
+```
+  abstract MorphoDictEngAbs =
+    Cat [N,A,V,Adv,Prep] **
+  {
+```
+For more details, we refer to `MkMorphodict.hs` for the time being.
+
+If the config and header files are sound, the script produces compilable GF files.
+They also mostly comply to the guidelines given in this document.
+
+Some things TODO:
+- deal with multiwords such as "more regular" generated by Paradigms
+- use references to native Irreg files instead of very long smart paradigms
+- support increments in addition to overwrites
+
+
+
 ## Things to do

 To support the construction of a `MorphoDict`, the following should be guaranteed in the RGL: