AR 28/3/2013

26/3 Morphology from Kotus.

27/3 Senses from Princeton.

27/3 
Designed new paradigms. Filtered problematic/illegal things (PLURNOUN, ILLEGALVERB, POSTPONE, TODO).
Just 9035 lemmas missing now.

28/3
Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. First results:
  561 no linearization
  960 lin with unknowns

around 20 missing syntax constructions, 230 missing words 

Tests generated by

  gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt

with 

  l -treebank -bind  PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)


29/3
Added most missing syntax constructions.
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721 
are completely translated (but mostly not so well...)
  317 no lin
  182 lin with unknowns

After implementing GerundN and GerundNP, only 40 lin with unknowns. But the implementations are bad:
- applying to run-time V prevents correct vowel harmony
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit

Counting funs:

  gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt

with 

  pt -funs   PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...

From this, with some ghci commands, created freq-wsj.txt, showing

AdvVP	1174
AdvNP	1075
UsePron	749
PossNP	749
UseV	675
in_Prep	671
and_Conj	659
UseComp	651
IIDig	620

and a total of 4512 fun's used in the 3220 trees.

Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!

some_Quant	72
anyPl_Det	44
part_of_N2	34
both_Det	32
most_Det	28
ComplN2	21
several_Num	19
another_Quant	19
UseN2	16
neither_Det	11
CNNumNP	8
draw_V2	7
aware_of_A2	7

The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
constants. 314 without lin.

Attacked the first ten missing constructs down to 4 occurrences. Now 13 with unknown, 167 missing. 
Thus almost 95% complete.

Defined some more, down to the 34th with 2 occurrences. Now 32 missing, 18 with unknown 
(version 7, 7-eng-fin-wsj.txt). Thus over 98% complete. Soon time to fix errors in the things covered!

Fixed obvious errors in "date" (taateli -> päivämäärä) and "force" (polttaminen -> voima). Effect on
24 examples.


30/3

Version 9: Changed subcat's of 170 of 230 V2's (the ones with 3 occurrences or more). One hour's work. Changes in 1124 
translations. 

Also changed the default genitive of symbol (+n) to +in, to be uniform with the other cases. Works for words ending
in a consonant: Inteln -> Intelin. But a proper morphological analysis with dynamic lex extension is what would
really be needed.

Fixed NounFin.IndefArt, which erroneously added "yksi" to the substantival form of numeral determiners. This changed 125
linearizations - but there are some mistaken parses of numbers in the treebank, in particular years. Also fixed the passive
VP in the infinitive form, to better results in 95 sentences - but this structure should be different in Finnish:

  


