minor updates for smoother lab 2 part 2

2026-02-08 22:41:05 -07:00 · 2025-05-18 17:27:22 +02:00
parent a2a29f4b35
commit 003f0edbc4
2 changed files with 13 additions and 3 deletions
--- a/lab3/README.md
+++ b/lab3/README.md
@@ -65,6 +65,8 @@ Submit the two CoNLL-U files on Canvas.
 ## Part 2: UD parsing
 In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
 For better performance, you are strongly encouraged to use the MLTGPU server. 
+If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported. 
+For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42). 

 ### Step 1: setting up MaChAmp
 1. optional, but recommended: create a Python virtual environment with the command
@@ -81,7 +83,7 @@ For better performance, you are strongly encouraged to use the MLTGPU server.
  pip3 install -r requirements.txt
  ```

-### Step 2: selecting the training and development data
+### Step 2: preparing the training and development data
 Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it. 
 If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian). 

@@ -90,6 +92,14 @@ If you are working on MLTGPU, you may choose a large treebank such as [Swedish-T
 If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set. 
 In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).

+To ensure that MaChAmp works correctly, preprocess __all__ of your data (including your own test data) by running 
+
+```
+python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
+```
+
+This replaces the contents of your input file with a "cleaned up" version of the same treebank.
+
 ### Step 3: training
 Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.

@@ -110,7 +120,7 @@ python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predict
 and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as

 ```
-python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
+python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
 ```

 On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser.
--- a/lab3/machamp_config.json
+++ b/lab3/machamp_config.json