Files
comp-syntax-gu-mlt/lab3/README.md
Arianna Masciolini f25a32e983 add missing script name
2025-06-09 13:27:16 +02:00

8.5 KiB

Lab 3: Universal Dependencies

This lab is divided into two parts. In part 1, you will create a small parallel UD treebank for English/Swedish and a language of your choice. In part 2, you will train a parsing model and evaluate it on your treebank.

Part 1: UD annotation

The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank.

Step 1: familiarize with the CoNLL-U format

Go to universaldependencies.org and download a treebank for a language of your choice. Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.

Step 2: choose a corpus

Choose one of the two corpora provided in this folder:

  • comp-syntax-corpus-english.txt is a combination of English sentences from different sources, including the Parallel UD treebank (PUD). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as UDPipe, which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
  • comp-syntax-corpus-swedish.txt consists of teacher-corrected sentences from the Swedish Learner Language (SweLL) corpus, which is currently being annotated in UD for the first time. In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with UDPipe's automatic analyses.

In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form

word:<UPOS>.

Step 3: annotate

For each sentence in the corpus, the annotation tasks consists in:

  1. analyzing the sentence in UD
  2. translating it to a language of your choice
  3. analyzing your translation

The only required fields are ID, FORM, UPOS, HEAD and DEPREL.

In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.

To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the vscode-conllu to get syntax highlighting), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as Arborator.

If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields ID, FORM, UPOS, HEAD and DEPREL, separated by tabs, and then expand it to full CoNLL-U with this script (or similar).

Example:

7 world NOUN 4 nmod

expands to

7 world _ NOUN _ _ 4 nmod _ _

We recommend that you annotate at least the first few sentences from scratch. When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation. If you are unsure about an annotation choice you made, you can add a comment line (starting with #) right before the sentence in question. To fully comply with the CoNLL-U standard, comment lines should consist of key-value pairs, e.g.

# comment = your comment here

but for this assigment lines like

# your comment here

are perfectly acceptable too.

Step 4: make sure your files match the CoNLL-U specification

Once you have full CoNLL, you can use deptreepy, STUnD or the official online CoNNL-U viewer to visualize it.

With deptreepy, you will need to issue the command

cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html

which creates an HTML file you can open in you web browser.

If you can visualize your trees with any of these tools, that's a very good sign that your file more or less matches the CoNNL-U format!

As a last step, validate your treebank with the official UD validator. To do that, clone or download the UD tools repository, move inside the corresponding folder and run

python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=1

If you want to check for more subtle errors, you can go up a few levels.

Submit the two CoNLL-U files on Canvas.

Part 2: UD parsing

In this part of the lab, you will train and evaluate a UD parsing + POS tagging model. For better performance, you are strongly encouraged to use the MLTGPU server. If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported. For more information, see here.

Step 1: setting up MaChAmp

  1. optional, but recommended: create a Python virtual environment with the command

    python -m venv ENVNAME
    

    and activate it with

    source ENVNAME/bin/activate (Linux/MacOS), or

    ENVNAME\Scripts\activate.bat (Windows)

  2. clone the MaChAmp repository, move inside it and run

    pip3 install -r requirements.txt
    

Step 2: preparing the training and development data

Choose a UD treebank for one of the two languages you annotated in part 1 and download it. If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).

If you are working on MLTGPU, you may choose a large treebank such as Swedish-Talbanken, which is already divided into a training, development and test split.

If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as Amharic-ATT, which only comes with a test set. In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development). Make sure both files end with an empty line.

To ensure that MaChAmp works correctly, preprocess all of your data (including your own test data) by running

python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT

This replaces the contents of your input file with a "cleaned up" version of the same treebank.

Step 3: training

Copy compsyn.json to machamp/configs and replace the traning and development data paths with the paths to the files you selected/created in step 2.

You can now train your model by running

python3 train.py --dataset_configs configs/compsyn.json --device N

from the MaChAmp folder. If you are working on MLTGPU, replace N with 0 (GPU). If you are using your laptop or EDUSERV, replace it with -1, which instructs MaChAmp to train the model on the CPU.

Everything you see on screen at this stage will be saved in a training log file called logs/compsyn/DATE/log.txt.

Step 4: evaluation

Run your newly trained model with

python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N

This saves your model's predictions, i.e. the trees produced by your new parser, in predictions/OUTPUT-FILE-NAME.conllu. You can take a look at this file to get a first impression of how your model performs.

Then, use the machamp/scripts/misc/conll18_ud_eval.py script to evaluate the system output against your annotations. You can run it as

python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu

On Canvas, submit the training logs, the predictions and the output of conll18_ud_eval.py, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the output of the results of the evaluation.