minor updates for smoother lab 2 part 2

This commit is contained in:
Arianna Masciolini
2025-05-18 17:27:22 +02:00
parent a2a29f4b35
commit 003f0edbc4
2 changed files with 13 additions and 3 deletions

View File

@@ -64,7 +64,9 @@ Submit the two CoNLL-U files on Canvas.
## Part 2: UD parsing ## Part 2: UD parsing
In this part of the lab, you will train and evaluate a UD parsing + POS tagging model. In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
For better performance, you are strongly encouraged to use the MLTGPU server. For better performance, you are strongly encouraged to use the MLTGPU server.
If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported.
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).
### Step 1: setting up MaChAmp ### Step 1: setting up MaChAmp
1. optional, but recommended: create a Python virtual environment with the command 1. optional, but recommended: create a Python virtual environment with the command
@@ -81,7 +83,7 @@ For better performance, you are strongly encouraged to use the MLTGPU server.
pip3 install -r requirements.txt pip3 install -r requirements.txt
``` ```
### Step 2: selecting the training and development data ### Step 2: preparing the training and development data
Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it. Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it.
If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian). If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).
@@ -90,6 +92,14 @@ If you are working on MLTGPU, you may choose a large treebank such as [Swedish-T
If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set. If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set.
In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development). In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).
To ensure that MaChAmp works correctly, preprocess __all__ of your data (including your own test data) by running
```
python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
```
This replaces the contents of your input file with a "cleaned up" version of the same treebank.
### Step 3: training ### Step 3: training
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2. Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
@@ -110,7 +120,7 @@ python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predict
and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as
``` ```
python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
``` ```
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser. On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser.