morphodict: purely morphological unilingual dictionaries
Aarne Ranta 2020-03-02 -- 2021-05-27
UNDER CONSTRUCTION, INCOMPLETE AND BUGGY
The vision
Vision 1: if you need the noun "stjärna" in Swedish, you will find it
as MorphoDictSwe.stjärna_N.
Vision 2: if you analyse a Swedish text that contains the word "stjärnornas", it will be returned as MorphoDictSwe.stjärna_N.
Vision 3: this will work for all words of Swedish and all other RGL languages. Only seldom will you need ParadigmsSwe.
What is contained
The guiding principle is to provide a single source for each lemgram (i.e. linearization records, i.e. inflection table plus inherent features). Functions names should be easy to guess:
baseform_Category
Baseforms that have many different lemgrams are an exception. They should be disambiguated by adding the differing forms, as in
lie_lay_V("lie, lay, lain")lie_lied_V("lie, lied lied")
Such distinctions are made in all cases where there are alternative inflections, even if there is no sense distinction:
learn_learned_V("learn, learned, learned")learn_learnt_V("learn, learnt, learnt")
Hence,
- no
variantsshould appear in the MorphoDict - no entries should be duplicated if their lemgrams are the same
- hence, in particular, sense distinctions do not result in different entries
The dictionary will also exclude multiwords consisting of several tokens. Most of the time, even compounds written as single tokens should be excluded. However, as the status of a compound is not always clear, and since they do not create spurious morphological analyses, they can be tolerated, in particular if extracted from legacy sources.
Since multiwords and compounds are excluded, Paradigms and MakeStructural should for each language provide API functions for easy definitions of them, preferably of the form
mkC : Str -> C -> C
The situation when this is not enough is when separate functions are needed for gluing and concatenation compounds.
Open question: what to do with compound prepositions that are common in e.g. English? The above principles imply
according_to_Prep = mkPrep "according" to_Prep
defined outside MorphoDictEng, so that mkPrep comes from ParadigmsEng and to_Prep from MorphoDictEng.
This may sound like against tradition, but follows the general guidelines of morphological dictionaries.
Relevant categories
In addition to sense distinctions, MorphoDict ignores subcategorizations. One reason is that, just like senses (although in a lesser degree), they are open-ended and sometimes vague. Another reason is that different subcategory variants overload morphological analysis.
The most numerous categories to be addressed are content words:
AAdvInterjNPNSymbV
In addition, structural words should appear here with their native lemma names:
ConjDetIAdvIDetIPNP(special NP-like "pronouns", such as "somebody")PrepPron(in the RGL only covering personal pronouns)PunctSubj
Additional language-specific categories can be included if the reasons are clear.
They must then be importable from the Paradigms module for that language, together with mk functions.
The Extend module may also put them in use in syntax.
Following the model of Universal Tagset, we add a category X for unspecified words in Extend, with the linearization type {s : Str}.
Hence it can only be used for uninflected strings with unclear status.
Naming
As stated before,
functionname=baseform_categoryif there is a unique lemgram- =
baseform_number_categoryif there is a need to disambiguate
The disambiguation numbering should reflect the frequency or probability of the lemgram, but this is just a recommendation, since the frequency is not always known.
The baseform should be the native alphabet baseform in Unicode letters, which is as such a valid GF identifier.
However, if the word contains characters that are not legal in identifiers, the function name should be simply included in single quotes, rather than inventing transliterations.
If function names are formed by the API function PGF.mkCId, these conventions are automatically followed.
Coding conventions
To enable easy ocular and automatic inspection,
- write one entry per line, each prefixed by
funorlinkeyword - sort the entries alphabetically
- use paradigms with enough many arguments to make the characteristic forms explicit
To guarantee compatibility with the rest of the RGL and application grammars,
- paradigms used should be imported from
ParadigmsandMakeStructuralrather than defined inMorphoDictitself - import of low-level modules such as
Resshould be avoided MorphoDictshould be self-contained, i.e. not inherit from other modules such asStructuralorIrreg. But it is OK toopenthem in a qualified mode to use when defining linearizations.
Bootstrapping with MkMorphoDict
Example run, English:
gf -make ../english/DictEng.gf
runghc MkMorphodict.hs pgf MorphoDictEng.config DictEngAbs.pgf MorphoDictEng
Or, if you have raw data from another source, of the format "N woman women", you can do
runghc MkMorphodict.hs raw MorphoDictEng.config raw_words_eng.txt MorphoDictEng
The script needs a configuration file mapping legacy categories and forms lists to parts of GF code:
N : N mkN 0 2
A : A mkA 0 2 4 6
V : V mkV 0 4 2
V2 : V mkV 0 4 2
Adv : Adv mkAdv 0
Prep : Prep mkPrep 0
In addition, it needs header files containing lines to be prefixed to the generated files:
concrete MorphoDictEng of MorphoDictEngAbs =
CatEng [N,A,V,Adv,Prep] **
open
ParadigmsEng
in
{
abstract MorphoDictEngAbs =
Cat [N,A,V,Adv,Prep] **
{
For more details, we refer to MkMorphodict.hs for the time being.
If the config and header files are sound, the script produces compilable GF files. They also mostly comply to the guidelines given in this document.
Some things TODO:
- deal with multiwords such as "more regular" generated by Paradigms
- use references to native Irreg files instead of very long smart paradigms
- support increments in addition to overwrites
Things to do
To support the construction of a MorphoDict, the following should be provided in Paradigms:
- explicit smart paradigms with characteristic forms and inherent features for each category
- API constants for all inherent features that are needed
- compound-constructing functions for all categories that need them
- the extra categories that one wants to include in that language