revised lab 1 instructions

This commit is contained in:
aarneranta
2021-03-30 15:48:14 +02:00
parent b7c3877756
commit 4da15031d1

View File

@@ -6,15 +6,17 @@ The assignments are submitted via Canvas.
## Chapter 1: explore the parallel UD treebank (PUD) ## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ and download Version 2.5 treebanks 1. Go to https://universaldependencies.org/ and download Version 2.7 treebanks
2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/ 2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/
3. Select a language to compare with English. 3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English. 4. Make statistics about the frequencies of POS tags and dependency
For instance, the top-10 tags/labels and their number of occurrences. labels in your language compared with English: find the top-20 tags/labels and their number of occurrences.
What does this tell you about the language? What does this tell you about the language?
5. Convert 2x2 tree from CoNLL format to graphical tree by hand, on paper. (This can be done with shell or Python programming or with the gf-ud tool.)
Select a short English tree and its translation. 5. Convert the following four trees from CoNLL format to graphical
Then select a long English tree and its translation. trees by hand, on paper.
- a short English tree (5-10 words, of your choice) and its translation.
- a long English tree (>25 words) and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper. 6. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
Use the same trees as in the previous question. Use the same trees as in the previous question.
What can you say about the syntactic differences between the languages? What can you say about the syntactic differences between the languages?
@@ -29,6 +31,7 @@ The assignments are submitted via Canvas.
## Chapter 3: UD syntax analysis ## Chapter 3: UD syntax analysis
Take a bilingual corpus with English and your own language, and annotate with UD. Take a bilingual corpus with English and your own language, and annotate with UD.
The English text is given in the file `comp-syntax-corpus-english.txt` in this directory.
The UD annotation that you produce manually can be simplified CoNLL, with just the fields The UD annotation that you produce manually can be simplified CoNLL, with just the fields
`position word postag head label` `position word postag head label`
@@ -48,10 +51,10 @@ expands to
`7 world _ NOUN _ _ 4 nmod _ _` `7 world _ NOUN _ _ 4 nmod _ _`
(Unfortunately, the tabs are not visible in the md output.) (Unfortunately, the tabs are not visible in the md output.)
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll`
Once you have full CoNLL, you can use for instance the gfud tool to visualize it. Once you have full CoNLL, you can use for instance the gfud tool to visualize it.
The corpus is given in the file comp-syntax-corpus-english.txt in this directory.
Your task is to Your task is to
1. write an English CoNLL file analysing this corpus 1. write an English CoNLL file analysing this corpus
2. translate the corpus to your language 2. translate the corpus to your language