forked from GitHub/gf-rgl
a few more words in DictFin, but the most frequent missing ones - now complete lin for over 90% of the complete trees
This commit is contained in:
@@ -15,6 +15,15 @@ Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. Fi
|
||||
|
||||
around 20 missing syntax constructions, 230 missing words
|
||||
|
||||
Tests generated by
|
||||
|
||||
gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt
|
||||
|
||||
with
|
||||
|
||||
l -treebank -bind PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)
|
||||
|
||||
|
||||
29/3
|
||||
Added most missing syntax constructions.
|
||||
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721
|
||||
@@ -26,5 +35,48 @@ After implementing GerundN and GerundNP, only 40 lin with unknowns. But the impl
|
||||
- applying to run-time V prevents correct vowel harmony
|
||||
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit
|
||||
|
||||
Counting funs:
|
||||
|
||||
gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt
|
||||
|
||||
with
|
||||
|
||||
pt -funs PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...
|
||||
|
||||
From this, with some ghci commands, created freq-wsj.txt, showing
|
||||
|
||||
AdvVP 1174
|
||||
AdvNP 1075
|
||||
UsePron 749
|
||||
PossNP 749
|
||||
UseV 675
|
||||
in_Prep 671
|
||||
and_Conj 659
|
||||
UseComp 651
|
||||
IIDig 620
|
||||
|
||||
and a total of 4512 fun's used in the 3220 trees.
|
||||
|
||||
Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!
|
||||
|
||||
some_Quant 72
|
||||
anyPl_Det 44
|
||||
part_of_N2 34
|
||||
both_Det 32
|
||||
most_Det 28
|
||||
ComplN2 21
|
||||
several_Num 19
|
||||
another_Quant 19
|
||||
UseN2 16
|
||||
neither_Det 11
|
||||
CNNumNP 8
|
||||
draw_V2 7
|
||||
aware_of_A2 7
|
||||
|
||||
The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
|
||||
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
|
||||
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
|
||||
constants. 314 without lin.
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user