diff --git a/lab3/README.md b/lab3/README.md index f22467a..f82176b 100644 --- a/lab3/README.md +++ b/lab3/README.md @@ -64,7 +64,9 @@ Submit the two CoNLL-U files on Canvas. ## Part 2: UD parsing In this part of the lab, you will train and evaluate a UD parsing + POS tagging model. -For better performance, you are strongly encouraged to use the MLTGPU server. +For better performance, you are strongly encouraged to use the MLTGPU server. +If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported. +For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42). ### Step 1: setting up MaChAmp 1. optional, but recommended: create a Python virtual environment with the command @@ -81,7 +83,7 @@ For better performance, you are strongly encouraged to use the MLTGPU server. pip3 install -r requirements.txt ``` -### Step 2: selecting the training and development data +### Step 2: preparing the training and development data Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it. If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian). @@ -90,6 +92,14 @@ If you are working on MLTGPU, you may choose a large treebank such as [Swedish-T If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set. In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development). +To ensure that MaChAmp works correctly, preprocess __all__ of your data (including your own test data) by running + +``` +python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT +``` + +This replaces the contents of your input file with a "cleaned up" version of the same treebank. + ### Step 3: training Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2. @@ -110,7 +120,7 @@ python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predict and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as ``` -python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu +python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu ``` On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser. \ No newline at end of file diff --git a/lab3/machamp_config.json b/lab3/compsyn.json similarity index 100% rename from lab3/machamp_config.json rename to lab3/compsyn.json