From 38d852a5bb86f73d67b2278ceab810868c9f3b6a Mon Sep 17 00:00:00 2001 From: aarneranta Date: Fri, 28 May 2021 14:44:52 +0200 Subject: [PATCH] updated morphodict/README.md with MkMorphodict help --- src/morphodict/MkMorphodict.hs | 3 +- src/morphodict/README.md | 68 +++++++++++++++++++++++----------- 2 files changed, 47 insertions(+), 24 deletions(-) diff --git a/src/morphodict/MkMorphodict.hs b/src/morphodict/MkMorphodict.hs index 90815481d..ed63cd743 100644 --- a/src/morphodict/MkMorphodict.hs +++ b/src/morphodict/MkMorphodict.hs @@ -20,7 +20,6 @@ import System.Environment (getArgs) -- example: -- gf -make ../english/DictEng.gf --- runghc -- runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng -- 64923 -> 56599 functions @@ -138,7 +137,7 @@ mkMorphoDict env = _ -> [] renames :: [RawRule] -> [RuleData] --- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int +--- renames fls = [((mkFun (f ++ [show i,c]),c),l) | (i,((f,c),l)) <- zip [1..] fls] -- disambiguate with int renames fls = [((mkFun (f ++ fs ++ [c]),c),l) | (i,(((f,c),l),fs)) <- zip [1..] (zip fls (minimize fls))] -- disambiguate with different forms minimize :: [RawRule] -> [[String]] diff --git a/src/morphodict/README.md b/src/morphodict/README.md index 590e4c64f..b64723422 100644 --- a/src/morphodict/README.md +++ b/src/morphodict/README.md @@ -21,13 +21,13 @@ Functions names should be easy to guess: - `baseform_Category` Baseforms that have many different lemgrams are an exception. -They should be numbered as -- `lie_1_V` ("lie, lay, lain") -- `lie_2_V` ("lie, lied lied") +They should be disambiguated by adding the differing forms, as in +- `lie_lay_V` ("lie, lay, lain") +- `lie_lied_V` ("lie, lied lied") Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction: -- `learn_1_V` ("learn, learned, learned") -- `learn_2_V` ("learn, learnt, learnt") +- `learn_learned_V` ("learn, learned, learned") +- `learn_learnt_V` ("learn, learnt, learnt") Hence, - no `variants` should appear in the MorphoDict @@ -115,26 +115,50 @@ To guarantee compatibility with the rest of the RGL and application grammars, ## Bootstrapping with `MkMorphoDict` -THIS WAS AN EARLY EXPERIMENT, TO BE UPDATED - Example run, English: - - gf -make ../english/DictEng.gf - runghc MkMorphodict.hs DictEngAbs.pgf MorphoDictEng - -Result: 64923 -> 56599 functions, of which 21679 could be compounds - -Swedish, using a dump of SALDO (not available in these sources) ``` - cd saldo/ - runghc SaldoGF.hs - # combine abs.tmp with Saldo.header to obtain Saldo.gf - # combine cnc.tmp with SaldoSwe.header to obtain SaldoSwe.gf - gf -make SaldoSwe.gf - cd .. - runghc MkMorphodict.hs saldo/Saldo.pgf MorphoDictSwe + gf -make ../english/DictEng.gf + runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng + ``` +Or, if you have raw data from another source, of the format "N woman women", you can do ``` - + runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng + ``` +The script needs a *configuration file* mapping legacy categories and forms lists to parts of GF code: +``` + N : N mkN 0 2 + A : A mkA 0 2 4 6 + V : V mkV 0 4 2 + V2 : V mkV 0 4 2 + Adv : Adv mkAdv 0 + Prep : Prep mkPrep 0 +``` +In addition, it needs *header files* containing lines to be prefixed to the generated files: +``` + concrete MorphoDictEng of MorphoDictEngAbs = + CatEng [N,A,V,Adv,Prep] ** + open + ParadigmsEng + in + { +``` +``` + abstract MorphoDictEngAbs = + Cat [N,A,V,Adv,Prep] ** + { +``` +For more details, we refer to `MkMorphodict.hs` for the time being. + +If the config and header files are sound, the script produces compilable GF files. +They also mostly comply to the guidelines given in this document. + +Some things TODO: +- deal with multiwords such as "more regular" generated by Paradigms +- use references to native Irreg files instead of very long smart paradigms +- support increments in addition to overwrites + + + ## Things to do To support the construction of a `MorphoDict`, the following should be guaranteed in the RGL: