2026 modifications to lab 3

This commit is contained in:
Arianna Masciolini
2026-03-29 21:15:06 +02:00
parent f25a32e983
commit 088f52a0f6
3 changed files with 15 additions and 50 deletions

View File

@@ -12,28 +12,27 @@ Go to [universaldependencies.org](https://universaldependencies.org/) and downlo
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.
### Step 2: choose a corpus
Choose one of the two corpora provided in this folder:
Choose a corpus of 25+ sentences.
- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
If you want to start with __English__, you can use[`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form `word:<UPOS>`.
In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form
`word:<UPOS>`.
If you want to work with __Swedish__ and might be interested in contributing to an [official UD treebank](https://github.com/universaldependencies/UD_Swedish-SweLL), ask Arianna for [a sample of the Swedish Learner Language corpus](https://spraakbanken.gu.se/en/resources/swell).
If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session!
### Step 3: annotate
For each sentence in the corpus, the annotation tasks consists in:
1. analyzing the sentence in UD
2. translating it to a language of your choice
2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish)
3. analyzing your translation
The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.
In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.
To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/).
To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to `.tsv`), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/) (helpful but unstable!).
If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).
@@ -54,7 +53,7 @@ To fully comply with the CoNLL-U standard, comment lines should consist of key-v
# comment = your comment here
```
but for this assigment lines like
but for this assignment lines like
```
# your comment here
@@ -63,24 +62,14 @@ but for this assigment lines like
are perfectly acceptable too.
### Step 4: make sure your files match the CoNLL-U specification
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you can visualize your trees with any of these tools, that's a very good sign that your file _more or less_ matches the CoNNL-U format!
As a last step, validate your treebank with the official UD validator.
Check your treebank with the official UD validator.
To do that, clone or download the [UD tools repository](https://github.com/UniversalDependencies/tools), move inside the corresponding folder and run
```
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=1
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2
```
If you want to check for more subtle errors, you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator).
Level 2 should be enough for part 2, but you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator) to check for more subtle errors.
Submit the two CoNLL-U files on Canvas.
@@ -91,7 +80,7 @@ If you want to install MaChAmp on your own computer, keep in mind that very old
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).
### Step 1: setting up MaChAmp
1. optional, but recommended: create a Python virtual environment with the command
1. create a Python virtual environment with the command
```
python -m venv ENVNAME
```
@@ -124,7 +113,7 @@ python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
This replaces the contents of your input file with a "cleaned up" version of the same treebank.
### Step 3: training
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
Copy `compsyn.json` to `machamp/configs` and replace the training and development data paths with the paths to the files you selected/created in step 2.
You can now train your model by running
@@ -152,4 +141,4 @@ Then, use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the s
python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
```
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the output of the results of the evaluation.
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation.