Files
gf-core/lib/src/finnish/stemmed/log.txt

119 lines
3.8 KiB
Plaintext

AR 28/3/2013
26/3 Morphology from Kotus.
27/3 Senses from Princeton.
27/3
Designed new paradigms. Filtered problematic/illegal things (PLURNOUN, ILLEGALVERB, POSTPONE, TODO).
Just 9035 lemmas missing now.
28/3
Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. First results:
561 no linearization
960 lin with unknowns
around 20 missing syntax constructions, 230 missing words
Tests generated by
gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt
with
l -treebank -bind PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)
29/3
Added most missing syntax constructions.
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721
are completely translated (but mostly not so well...)
317 no lin
182 lin with unknowns
After implementing GerundN and GerundNP, only 40 lin with unknowns. But the implementations are bad:
- applying to run-time V prevents correct vowel harmony
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit
Counting funs:
gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt
with
pt -funs PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...
From this, with some ghci commands, created freq-wsj.txt, showing
AdvVP 1174
AdvNP 1075
UsePron 749
PossNP 749
UseV 675
in_Prep 671
and_Conj 659
UseComp 651
IIDig 620
and a total of 4512 fun's used in the 3220 trees.
Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!
some_Quant 72
anyPl_Det 44
part_of_N2 34
both_Det 32
most_Det 28
ComplN2 21
several_Num 19
another_Quant 19
UseN2 16
neither_Det 11
CNNumNP 8
draw_V2 7
aware_of_A2 7
The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
constants. 314 without lin.
Attacked the first ten missing constructs down to 4 occurrences. Now 13 with unknown, 167 missing.
Thus almost 95% complete.
Defined some more, down to the 34th with 2 occurrences. Now 32 missing, 18 with unknown
(version 7, 7-eng-fin-wsj.txt). Thus over 98% complete. Soon time to fix errors in the things covered!
Fixed obvious errors in "date" (taateli -> päivämäärä) and "force" (polttaminen -> voima). Effect on
24 examples.
30/3
Version 9: Changed subcat's of 170 of 230 V2's (the ones with 3 occurrences or more). One hour's work. Changes in 1124
translations.
Also changed the default genitive of symbol (+n) to +in, to be uniform with the other cases. Works for words ending
in a consonant: Inteln -> Intelin. But a proper morphological analysis with dynamic lex extension is what would
really be needed.
Fixed NounFin.IndefArt, which erroneously added "yksi" to the substantival form of numeral determiners. This changed 125
linearizations - but there are some mistaken parses of numbers in the treebank, in particular years. Also fixed the passive
VP in the infinitive form, to better results in 95 sentences - but this structure should be different in Finnish.
Fixing passive past tenses improved 250 sentences! Incredibly, they had been missing in the RGL. As well as the correct
form of the compounds: "minut ollaan nähty" -> "minut on nähty" ("I have been seen").
Fixed the form for NPossNom and NPossGen. It had been mistakenly the Nom form. This gave "rakkausnsa" ("his love").
The proper form is the tk-2 prefix of the essive case: "rakkautensa"; the tk-1 genitive won't do ("rakkaudensa").
This changed to the better 81 sentences.
Added NCompound, or form nr 10, to nouns. This may differ from Nom Sg, e.g. käteinenvirtaus -> käteisvirtaus. 107 errors
corrected by this.