first version of the lab material in place

This commit is contained in:
aarneranta
2021-03-24 09:14:41 +01:00
parent 040c93bddf
commit 7afba566d5
27 changed files with 2514 additions and 0 deletions

110
lab1/README.md Normal file
View File

@@ -0,0 +1,110 @@
# Lab 1: Grammatical analysis
This lab follows Chapters 1-4 in the course notes. Each part is started after the lecture on the corresponding chapter.
The assignments are submitted via Canvas.
## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ and download Version 2.5 treebanks
2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English.
For instance, the top-10 tags/labels and their number of occurrences.
What does this tell you about the language?
5. Convert 2x2 tree from CoNLL format to graphical tree by hand, on paper.
Select a short English tree and its translation.
Then select a long English tree and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
Use the same trees as in the previous question.
What can you say about the syntactic differences between the languages?
## Chapter 2: design the morpological types of the major parts of speech in your selected language
1. It is enough to cover NOUN, ADJ, and VERB.
2. Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features.
3. Then use data from PUD to check which morphological features actually occur in the treebank for that language.
## Chapter 3: UD syntax analysis
Take a bilingual corpus with English and your own language, and annotate with UD.
The UD annotation that you produce manually can be simplified CoNLL, with just the fields
`position word postag head label`
Make sure that each field is exactly one token, so that the whole line has exactly 5 tokens.
This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already)
`position word _ postag _ _ head label _ _`
Example:
`7 world NOUN 4 nmod`
expands to
`7 world _ NOUN _ _ 4 nmod _ _`
(Unfortunately, the tabs are not visible in the md output.)
Once you have full CoNLL, you can use for instance the gfud tool to visualize it.
The corpus is given in the file comp-syntax-corpus-english.txt in this directory.
Your task is to
1. write an English CoNLL file analysing this corpus
2. translate the corpus to your language
3. write a CoNLL file analysing your translation
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD.
The first 12 sentences are POS-tagged, with each word having the form
`word:<POS>`
Hint: you can initialize the task by converting each word or word:<POS> to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course.
Extra: If you want to see the visual trees, you can build the gfud program from
`https://github.com/GrammaticalFramework/gf-ud`
and issue the command
`cat my-file.conllu | ./gfud conll2pdf`
You will need Haskell and GF libraries to build gfud, and LaTeX to show the pdf.
## Chapter 4: phrase structure analysis
1. Construct phrase structure trees for some of the sentences in the corpus used in Chapter 3, both for English and your chosen language.
2. Test the grammar
https://github.com/GrammaticalFramework/gf-ud/blob/master/grammars/English.dbnf
on last week's corpus, both for English and your own language.
3. Modify the grammar to suit your language and test it on some of the UD treebanks by using `gf-ud eval`.
The gf-ud program can be found in executable versions (once gunzipped) in
http://www.grammaticalframework.org/~aarne/software/
The source code of gf-ud can be found in
https://github.com/GrammaticalFramework/gf-ud
It can be built with `make` if you have Haskell and also have built the gf-core libraries:
https://github.com/GrammaticalFramework/gf-core
This will not be needed if you can use one of the ready-made libraries.

View File

@@ -0,0 +1,24 @@
Who:<PRON> are:<AUX> they:<PRON> ?:<PUNCT>
A:<DET> small:<ADJ> town:<NOUN> with:<ADP> two:<NUM> minarets:<NOUN> glides:<VERB> by:<ADV> .:<PUNCT>
I:<PRON> was:<AUX> just:<ADV> a:<DET> boy:<NOUN> with:<ADP> muddy:<ADJ> shoes:<NOUN> .:<PUNCT>
Shenzhen:<PROPN> 's:<PART> traffic:<NOUN> police:<NOUN> have:<AUX> opted:<VERB> for:<ADP> unconventional:<ADJ> penalties:<NOUN> before:<ADV> .:<PUNCT>.:<PUNCT>
The:<DET> study:<NOUN> of:<ADP> volcanoes:<NOUN> is:<AUX> called:<VERB> volcanology:<NOUN> ,:<PUNCT> sometimes:<ADV> spelled:<VERB> vulcanology:<NOUN> .:<PUNCT>
It:<PRON> was:<AUX> conducted:<VERB> just:<ADV> off:<ADP> the:<DET> Mexican:<ADJ> coast:<NOUN> from:<ADP> April:<PROPN> to:<ADP> June:<PROPN> .:<PUNCT>
":<PUNCT> Her:<PRON> voice:<NOUN> literally:<ADV> went:<VERB> around:<ADP> the:<DET> world:<NOUN> ,:<PUNCT> ":<PUNCT> Leive:<PROPN> said:<VERB> .:<PUNCT>
A:<DET> witness:<NOUN> told:<VERB> police:<NOUN> that:<SCONJ> the:<DET> victim:<NOUN> had:<AUX> attacked:<VERB> the:<DET> suspect:<NOUN> in:<ADP> April:<PROPN> .:<PUNCT>
It:<PRON> 's:<AUX> most:<ADV> obvious:<ADJ> when:<SSUBJ> a:<DET> celebrity:<NOUN> 's:<PART> name:<NOUN> is:<AUX> initially:<ADV> quite:<ADV> rare:<ADJ> .:<PUNCT>
This:<PRON> has:<AUX> not:<PART> stopped:<VERB> investors:<NOUN> flocking:<VERB> to:<PART> put:<VERB> their:<PRON> money:<NOUN> in:<ADP> the:<DET> funds:<NOUN> .:<PUNCT>
This:<DET> discordance:<NOUN> between:<ADP> economic:<ADJ> data:<NOUN> and:<CCONJ> political:<ADJ> rhetoric:<NOUN> is:<AUX> familiar:<ADJ> ,:<PUNCT> or:<CCONJ> should:<AUX> be:<AUX> .:<PUNCT>
The:<DET> feasibility:<NOUN> study:<NOUN> estimates:<VERB> that:<SCONJ> it:<PRON> would:<AUX> take:<VERB> passengers:<NOUN> about:<ADV> four:<NUM> minutes:<NOUN> to:<PART> cross:<VERB> the:<DET> Potomac:<PROPN> River:<PROPN> on:<ADP> the:<DET> gondola:<NOUN> .:<PUNCT>
he collected cards and traded them with the other boys
this crime carries a penalty of five years in prison
the news was carried to every village in the province
I carry these thoughts in the back of my head
Adam would have been carried over into the life eternal
the casings had rotted away and had to be replaced
she was incensed that this chit of a girl should dare to make a fool of her in front of the class
the landslide he had in the electoral college obscured the narrowness of a victory based on just 43% of the popular vote
United States troops now carry atropine and autoinjectors in their first-aid kits to use in case of organophosphate nerve agent poisoning
he may accomplish by craft in the long run what he cannot do by force and violence in the short one
it has been said that only a hierarchical society with a leisure class at the top can produce works of art
his ingenuous explanation that he would not have burned the church if he had not thought the bishop was in it