a few more words in DictFin, but the most frequent missing ones - now complete lin for over 90% of the complete trees

This commit is contained in:
aarne
2013-03-29 17:42:47 +00:00
parent 66530dec81
commit 8b97b049e0
2 changed files with 293 additions and 237 deletions
+52
View File
@@ -15,6 +15,15 @@ Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. Fi
around 20 missing syntax constructions, 230 missing words
Tests generated by
gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt
with
l -treebank -bind PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)
29/3
Added most missing syntax constructions.
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721
@@ -26,5 +35,48 @@ After implementing GerundN and GerundNP, only 40 lin with unknowns. But the impl
- applying to run-time V prevents correct vowel harmony
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit
Counting funs:
gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt
with
pt -funs PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...
From this, with some ghci commands, created freq-wsj.txt, showing
AdvVP 1174
AdvNP 1075
UsePron 749
PossNP 749
UseV 675
in_Prep 671
and_Conj 659
UseComp 651
IIDig 620
and a total of 4512 fun's used in the 3220 trees.
Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!
some_Quant 72
anyPl_Det 44
part_of_N2 34
both_Det 32
most_Det 28
ComplN2 21
several_Num 19
another_Quant 19
UseN2 16
neither_Det 11
CNNumNP 8
draw_V2 7
aware_of_A2 7
The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
constants. 314 without lin.