mirror of
https://github.com/GrammaticalFramework/comp-syntax-gu-mlt.git
synced 2026-02-08 22:41:05 -07:00
minor updates for smoother lab 2 part 2
This commit is contained in:
@@ -65,6 +65,8 @@ Submit the two CoNLL-U files on Canvas.
|
||||
## Part 2: UD parsing
|
||||
In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
|
||||
For better performance, you are strongly encouraged to use the MLTGPU server.
|
||||
If you want to install MaChAmp on your own computer, keep in mind that very old and very new Python version are not supported.
|
||||
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).
|
||||
|
||||
### Step 1: setting up MaChAmp
|
||||
1. optional, but recommended: create a Python virtual environment with the command
|
||||
@@ -81,7 +83,7 @@ For better performance, you are strongly encouraged to use the MLTGPU server.
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### Step 2: selecting the training and development data
|
||||
### Step 2: preparing the training and development data
|
||||
Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it.
|
||||
If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).
|
||||
|
||||
@@ -90,6 +92,14 @@ If you are working on MLTGPU, you may choose a large treebank such as [Swedish-T
|
||||
If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set.
|
||||
In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).
|
||||
|
||||
To ensure that MaChAmp works correctly, preprocess __all__ of your data (including your own test data) by running
|
||||
|
||||
```
|
||||
python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
|
||||
```
|
||||
|
||||
This replaces the contents of your input file with a "cleaned up" version of the same treebank.
|
||||
|
||||
### Step 3: training
|
||||
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
|
||||
|
||||
@@ -110,7 +120,7 @@ python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predict
|
||||
and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as
|
||||
|
||||
```
|
||||
python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
|
||||
python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
|
||||
```
|
||||
|
||||
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser.
|
||||
Reference in New Issue
Block a user