add info on sv corpus and deptreepy + minor 2024 updates

This commit is contained in:
Arianna Masciolini
2024-03-21 16:05:12 +01:00
parent 51739f91d5
commit 2c6c134a53

View File

@@ -6,18 +6,14 @@ The assignments are submitted via Canvas.
## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ and download Version 2.7+ treebanks
1. Go to [universaldependencies.org](https://universaldependencies.org/) and download Version 2.7+ treebanks
2. Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g. `UD_English-PUD/`
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency
labels in your language compared with English: find the top-20 tags/labels and their number of occurrences.
What does this tell you about the language?
(This can be done with shell or Python programming or with the gf-ud tool, which is available on the eduserv server. In Python, you can for example use the [conllu library](https://github.com/EmilStenstrom/conllu))
5. Convert the following four trees from CoNLL format to graphical
trees by hand, on paper.
- a short English tree (5-10 words, of your choice) and its translation.
- a long English tree (>25 words) and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the [deptreepy](https://github.com/aarneranta/deptreepy/) or [gf-ud](https://github.com/grammaticalFramework/gf-ud) tools. The latter is also available on the eduserv server.)
5. Convert the following four trees from CoNLL-U format to graphical trees by hand, on paper.
- a short English tree (5-10 words, of your choice) and its translation.
- a long English tree (>25 words) and its translation.
1. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
Use the same trees as in the previous question.
What can you say about the syntactic differences between the languages?
@@ -30,18 +26,27 @@ trees by hand, on paper.
## Chapter 3: UD syntax analysis
In this lab, you will annotate a bilingual corpus with English and your own language with UD.
The English text is given in the file `comp-syntax-corpus-english.txt` in this directory.
In this lab, you will annotate a bilingual corpus with UD.
You can choose between starting with an English corpus and translate it to a language of your choice, or start with a Swedish corpus to translate into English.
Your task is to
1. write an English CoNLL file analysing this corpus
2. translate the corpus to your language
Your task is to:
1. write an CoNLL file analysing your chosen corpus
2. translate it
3. write a CoNLL file analysing your translation
### Option 1: English data
The English text is given in the file [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) in this directory.
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly from your browser. These automatic analyses must of course be taken with a grain of salt.
The first 12 sentences are POS-tagged, with each word having the form
### Option 2: Swedish data
The Swedish text is given in the file [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) in this directory.
It consists of teacher-corrected sentences from the [Swedish Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but by choosing this corpus you will directly contribute to an ongoing annotation project. '
Of course, you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
In both corpora, the first few sentences are POS-tagged, with each word having the form
`word:<POS>`
@@ -68,16 +73,27 @@ expands to
(Unfortunately, the tabs are not visible in the md output.)
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll` (available on eduserv) or with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2).
Once you have full CoNLL, you can use for instance the gf-ud tool or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [gf-ud](https://github.com/grammaticalFramework/gf-ud) or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
If you use the gf-ud tool, you will need to issue the command
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you use the gf-ud tool, the command is
`cat my-file.conllu | ./gf-ud conll2pdf`
It is possible that you won't be able to visualize the trees directly on eduserv.
Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.
which generates a PDF. However, this does not support all foreign characters.
## Chapter 4: phrase structure analysis
(It is possible that you won't be able to visualize the trees directly on eduserv.
Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.)
## (Chapter 4: phrase structure analysis)
> __NOTE:__ chapter 4 is __not__ required in the 2024 edition of the course.
> You are of course welcome to try out these exercises, but they will not be graded.
### Prerequisites: get `gf-ud` to work
There are multiple ways to use `gf-ud`: