AR 28/3/2013

26/3 Morphology from Kotus.

27/3 Senses from Princeton.

27/3 
Designed new paradigms. Filtered problematic/illegal things (PLURNOUN, ILLEGALVERB, POSTPONE, TODO).
Just 9035 lemmas missing now.

28/3
Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. First results:
  561 no linearization
  960 lin with unknowns

around 20 missing syntax constructions, 230 missing words 

Tests generated by

  gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt

with 

  l -treebank -bind  PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)


29/3
Added most missing syntax constructions.
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721 
are completely translated (but mostly not so well...)
  317 no lin
  182 lin with unknowns

After implementing GerundN and GerundNP, only 40 lin with unknowns. But the implementations are bad:
- applying to run-time V prevents correct vowel harmony
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit

Counting funs:

  gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt

with 

  pt -funs   PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...

From this, with some ghci commands, created freq-wsj.txt, showing

AdvVP	1174
AdvNP	1075
UsePron	749
PossNP	749
UseV	675
in_Prep	671
and_Conj	659
UseComp	651
IIDig	620

and a total of 4512 fun's used in the 3220 trees.

Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!

some_Quant	72
anyPl_Det	44
part_of_N2	34
both_Det	32
most_Det	28
ComplN2	21
several_Num	19
another_Quant	19
UseN2	16
neither_Det	11
CNNumNP	8
draw_V2	7
aware_of_A2	7

The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
constants. 314 without lin.

Attacked the first ten missing constructs down to 4 occurrences. Now 13 with unknown, 167 missing. 
Thus almost 95% complete.

Defined some more, down to the 34th with 2 occurrences. Now 32 missing, 18 with unknown 
(version 7, 7-eng-fin-wsj.txt). Thus over 98% complete. Soon time to fix errors in the things covered!

Fixed obvious errors in "date" (taateli -> päivämäärä) and "force" (polttaminen -> voima). Effect on
24 examples.




