Files
comp-syntax-gu-mlt/lab3

Lab 3: Universal Dependencies

This lab is divided into two parts. In part 1, you will create a small parallel UD treebank for English/Swedish and a language of your choice. In part 2, you will train a parsing model and evaluate it on your treebank.

Part 1: UD annotation

The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank.

Step 1: familiarize with the CoNLL-U format

Go to universaldependencies.org and download a treebank for a language of your choice. Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.

Step 2: choose a corpus

Choose a corpus of 25+ sentences.

If you want to start with English, you can usecomp-syntax-corpus-english.txt is a combination of sentences from different sources, including the Parallel UD treebank (PUD). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as UDPipe, which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form word:<UPOS>.

If you want to work with Swedish and might be interested in contributing to an official UD treebank, ask Arianna for a sample of the Swedish Learner Language corpus.

If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session!

Step 3: annotate

For each sentence in the corpus, the annotation tasks consists in:

  1. analyzing the sentence in UD
  2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish)
  3. analyzing your translation

The only required fields are ID, FORM, UPOS, HEAD and DEPREL.

In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.

To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to .tsv), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as Arborator (helpful but unstable!).

If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields ID, FORM, UPOS, HEAD and DEPREL, separated by tabs, and then expand it to full CoNLL-U with this script (or similar).

Example:

7 world NOUN 4 nmod

expands to

7 world _ NOUN _ _ 4 nmod _ _

We recommend that you annotate at least the first few sentences from scratch. When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation. If you are unsure about an annotation choice you made, you can add a comment line (starting with #) right before the sentence in question. To fully comply with the CoNLL-U standard, comment lines should consist of key-value pairs, e.g.

# comment = your comment here

but for this assignment lines like

# your comment here

are perfectly acceptable too.

Step 4: make sure your files match the CoNLL-U specification

Check your treebank with the official UD validator. To do that, clone or download the UD tools repository, move inside the corresponding folder and run

python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2

Level 2 should be enough for part 2, but you can go up a few levels to check for more subtle errors.

Submit the two CoNLL-U files on Canvas.

Part 2: UD parsing

In this part of the lab, you will train and evaluate a UD parsing + POS tagging model. For better performance, you are strongly encouraged to use the MLTGPU server. If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported. For more information, see here.

Step 1: setting up MaChAmp

  1. create a Python virtual environment with the command

    python -m venv ENVNAME
    

    and activate it with

    source ENVNAME/bin/activate (Linux/MacOS), or

    ENVNAME\Scripts\activate.bat (Windows)

  2. clone the MaChAmp repository, move inside it and run

    pip3 install -r requirements.txt
    

Step 2: preparing the training and development data

Choose a UD treebank for one of the two languages you annotated in part 1 and download it. If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).

If you are working on MLTGPU, you may choose a large treebank such as Swedish-Talbanken, which is already divided into a training, development and test split.

If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as Amharic-ATT, which only comes with a test set. In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development). Make sure both files end with an empty line.

To ensure that MaChAmp works correctly, preprocess all of your data (including your own test data) by running

python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT

This replaces the contents of your input file with a "cleaned up" version of the same treebank.

Step 3: training

Copy compsyn.json to machamp/configs and replace the training and development data paths with the paths to the files you selected/created in step 2.

You can now train your model by running

python3 train.py --dataset_configs configs/compsyn.json --device N

from the MaChAmp folder. If you are working on MLTGPU, replace N with 0 (GPU). If you are using your laptop or EDUSERV, replace it with -1, which instructs MaChAmp to train the model on the CPU.

Everything you see on screen at this stage will be saved in a training log file called logs/compsyn/DATE/log.txt.

Step 4: evaluation

Run your newly trained model with

python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N

This saves your model's predictions, i.e. the trees produced by your new parser, in predictions/OUTPUT-FILE-NAME.conllu. You can take a look at this file to get a first impression of how your model performs.

Then, use the machamp/scripts/misc/conll18_ud_eval.py script to evaluate the system output against your annotations. You can run it as

python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu

On Canvas, submit the training logs, the predictions and the output of conll18_ud_eval.py, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation.