comp-syntax-gu-mlt/lab3/README.md

# Lab 3: Universal Dependencies

This lab is divided into two parts.
In [part 1](#part-1-ud-annotation), you will create a small parallel UD treebank for English/Swedish and a language of your choice.
In [part 2](#part-2-ud-parsing), you will train a parsing model and evaluate it on your treebank.

## Part 1: UD annotation
The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank.

### Step 1: familiarize with the CoNLL-U format
Go to [universaldependencies.org](https://universaldependencies.org/) and download a treebank for a language of your choice.
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.

### Step 2: choose a corpus
Choose a corpus of 25+ sentences.

If you want to start with __English__, you can use[`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form `word:<UPOS>`.


If you want to work with __Swedish__ and might be interested in contributing to an [official UD treebank](https://github.com/universaldependencies/UD_Swedish-SweLL), ask Arianna for [a sample of the Swedish Learner Language corpus](https://spraakbanken.gu.se/en/resources/swell).

If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session!

### Step 3: annotate
For each sentence in the corpus, the annotation tasks consists in:

1. analyzing the sentence in UD
2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish)
3. analyzing your translation

The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.

In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.

To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to `.tsv`), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/) (helpful but unstable!).

If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).

Example:

`7  world NOUN  4 nmod`

expands to

`7  world _ NOUN  _ _ 4 nmod  _ _`

We recommend that you annotate at least the first few sentences from scratch.
When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation.
If you are unsure about an annotation choice you made, you can add a comment line (starting with `#`) right before the sentence in question.
To fully comply with the CoNLL-U standard, comment lines should consist of key-value pairs, e.g.

```conllu
# comment = your comment here
```

but for this assignment lines like

```
# your comment here
```

are perfectly acceptable too.

### Step 4: make sure your files match the CoNLL-U specification
Check your treebank with the official UD validator.
To do that, clone or download the [UD tools repository](https://github.com/UniversalDependencies/tools), move inside the corresponding folder and run

```
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2
```

Level 2 should be enough for part 2, but you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator) to check for more subtle errors.

Submit the two CoNLL-U files on Canvas.

## Part 2: UD parsing
In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
For better performance, you are strongly encouraged to use the MLTGPU server.
If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported.
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).

### Step 1: setting up MaChAmp
1. create a Python virtual environment with the command
   ```
   python -m venv ENVNAME
   ```
   and activate it with

   `source ENVNAME/bin/activate` (Linux/MacOS), or

   `ENVNAME\Scripts\activate.bat` (Windows)
2. clone [the MaChAmp repository](https://github.com/machamp-nlp/machamp), move inside it and run
   ```
   pip3 install -r requirements.txt
   ```

### Step 2: preparing the training and development data
Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it.
If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).

If you are working on MLTGPU, you may choose a large treebank such as [Swedish-Talbanken](https://github.com/UniversalDependencies/UD_Swedish-Talbanken), which is already divided into a training, development and test split.

If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set.
In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).
Make sure both files end with an empty line.

To ensure that MaChAmp works correctly, preprocess __all__ of your data (including your own test data) by running

```
python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
```

This replaces the contents of your input file with a "cleaned up" version of the same treebank.

### Step 3: training
Copy `compsyn.json` to `machamp/configs` and replace the training and development data paths with the paths to the files you selected/created in step 2.

You can now train your model by running

```
python3 train.py --dataset_configs configs/compsyn.json --device N
```
from the MaChAmp folder.
If you are working on MLTGPU, replace `N` with `0` (GPU). If you are using your laptop or EDUSERV, replace it with `-1`, which instructs MaChAmp to train the model on the CPU.

Everything you see on screen at this stage will be saved in a training log file called `logs/compsyn/DATE/log.txt`.

### Step 4: evaluation
Run your newly trained model with

```
python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N
```

This saves your model's predictions, i.e. the trees produced by your new parser, in `predictions/OUTPUT-FILE-NAME.conllu`.
You can take a look at this file to get a first impression of how your model performs.

Then, use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as

```
python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
```

On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation.