Lab 2: Multilingual generation and translation
This lab corresponds to Chapters 5 to 9 of the Notes, but follows them only loosely. Therefore we will structure it according to the exercise sessions rather than chapters.
The abstract syntax is given in the subdirectory grammars/abstract/
-
Go to universaldependencies.org and download Version 2.7+ treebanks
-
Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g.
UD_English-PUD/ -
Select a language to compare with English.
-
Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the deptreepy or gf-ud tools. The latter is also available on the eduserv server.)
-
Draw word alignments for some non-trivial example in the PUD treebank, on paper. Use the same trees as in the previous question. What can you say about the syntactic differences between the languages?
Chapter 2: design the morpological types of the major parts of speech in your selected language
- It is enough to cover NOUN, ADJ, and VERB.
- Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features.
- Then use data from PUD to check which morphological features actually occur in the treebank for that language.
After lecture 6
- Design a morphology for the main lexical types (N, A, V) with parameters and a couple of paradigms.
- Test it by implementing the lexicon in the MicroLang module. You need to define lincat N,A,V,V2 as well as the paradigms in MicroResource.
To deliver: the lexicon part of files MicroGrammarX.gf and MicroResourceX.gf for your language of choice X. Follow the structure of MicroGrammarEng and MicroResourceEng when preparing these.
After lecture 7
- Define the linearization types of main phrasal categories - the remaining categories in MicroLang.
- Define the rest of the linearization rules in MicroLang.
To deliver: MicroLangX and MicroResourceX for your language of choice, with the lexicon part from Session 5 completed with syntax part.
After lecture 8
- Try out the applications in
../pythonand read its README carefully. - Add a concrete syntax for your language to one of the grammars
in
../python/, eitherQueryorDraw. The simplest way to do this is first to copy theEnggrammar and then to change the words; the syntax may work well as it is. Even though it can be a bit unnatural, it should be in a wide sense natural. - Compile the grammar with
gf -make Query???.gfso that your grammar gets included (the same forDraw). - Generate phrases in GF by first importing your pgf file and then
issuing the command
gt | l -treebank; fix your grammar if it looks too bad. - Test the corresponding Python application with your language.
The Python code with embedded GF grammars will be explained in a greater detail in Lecture 9.
To deliver: your grammar module.
Deadline: 29 May 2024. Demo your grammars (both Micro and this one) at the last lecture of the course!
A method for testing your Micro grammar
Since MicroLang is a proper part of the RGL, it can be easily implemented as an application grammar.
How to do this is shown in grammar/functor/, where the implementation consists of two files:
MicroLangFunctor.gfwhich is a generic implementation working for all RGL languages,MicroLangFunctorEng.gfwhich is a functor instantiation for English, easily reproduciple for other languages thanEng.
To use this for testing, you can take the following steps:
-
Build a functor instantiation for your language by copying
MicroLangFunctorEng.gfand changingEngin the file name and inside the file to your language code. -
Use GF to create a testfile by random generation:
$ echo "gr -number=1000 | l -tabtreebank" | gf english/MicroLangEng.gf functor/MicroLangFunctorEng.gf >test.tmp
- Inspect the resulting file
test.tmp. But you can also use Unixcutto create separate files for the two versions of the grammar anddiffto compare them:
$ cut -f2 test.tmp >test1.tmp
$ cut -f3 test.tmp >test2.tmp
$ diff test1.tmp test2.tmp
52c52
< the hot fire teachs her
---
> the hot fire teaches her
69c69
< the man teachs the apples
---
> the man teaches the apples
122c122
As seen from the result in this case, our implementation has a wrong inflection of the verb "teach".
The Mini grammar can be tested in the same way, by building a reference implementation using the functor in functor/.