mirror of
https://github.com/GrammaticalFramework/comp-syntax-gu-mlt.git
synced 2026-04-28 04:42:50 -06:00
2026 modifications to lab 3
This commit is contained in:
@@ -12,28 +12,27 @@ Go to [universaldependencies.org](https://universaldependencies.org/) and downlo
|
||||
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.
|
||||
|
||||
### Step 2: choose a corpus
|
||||
Choose one of the two corpora provided in this folder:
|
||||
Choose a corpus of 25+ sentences.
|
||||
|
||||
- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
|
||||
- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
|
||||
In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
|
||||
If you want to start with __English__, you can use[`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form `word:<UPOS>`.
|
||||
|
||||
In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form
|
||||
|
||||
`word:<UPOS>`.
|
||||
If you want to work with __Swedish__ and might be interested in contributing to an [official UD treebank](https://github.com/universaldependencies/UD_Swedish-SweLL), ask Arianna for [a sample of the Swedish Learner Language corpus](https://spraakbanken.gu.se/en/resources/swell).
|
||||
|
||||
If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session!
|
||||
|
||||
### Step 3: annotate
|
||||
For each sentence in the corpus, the annotation tasks consists in:
|
||||
|
||||
1. analyzing the sentence in UD
|
||||
2. translating it to a language of your choice
|
||||
2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish)
|
||||
3. analyzing your translation
|
||||
|
||||
The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.
|
||||
|
||||
In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.
|
||||
|
||||
To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/).
|
||||
To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to `.tsv`), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/) (helpful but unstable!).
|
||||
|
||||
If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).
|
||||
|
||||
@@ -54,7 +53,7 @@ To fully comply with the CoNLL-U standard, comment lines should consist of key-v
|
||||
# comment = your comment here
|
||||
```
|
||||
|
||||
but for this assigment lines like
|
||||
but for this assignment lines like
|
||||
|
||||
```
|
||||
# your comment here
|
||||
@@ -63,24 +62,14 @@ but for this assigment lines like
|
||||
are perfectly acceptable too.
|
||||
|
||||
### Step 4: make sure your files match the CoNLL-U specification
|
||||
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
|
||||
|
||||
With deptreepy, you will need to issue the command
|
||||
|
||||
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
|
||||
|
||||
which creates an HTML file you can open in you web browser.
|
||||
|
||||
If you can visualize your trees with any of these tools, that's a very good sign that your file _more or less_ matches the CoNNL-U format!
|
||||
|
||||
As a last step, validate your treebank with the official UD validator.
|
||||
Check your treebank with the official UD validator.
|
||||
To do that, clone or download the [UD tools repository](https://github.com/UniversalDependencies/tools), move inside the corresponding folder and run
|
||||
|
||||
```
|
||||
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=1
|
||||
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2
|
||||
```
|
||||
|
||||
If you want to check for more subtle errors, you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator).
|
||||
Level 2 should be enough for part 2, but you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator) to check for more subtle errors.
|
||||
|
||||
Submit the two CoNLL-U files on Canvas.
|
||||
|
||||
@@ -91,7 +80,7 @@ If you want to install MaChAmp on your own computer, keep in mind that very old
|
||||
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).
|
||||
|
||||
### Step 1: setting up MaChAmp
|
||||
1. optional, but recommended: create a Python virtual environment with the command
|
||||
1. create a Python virtual environment with the command
|
||||
```
|
||||
python -m venv ENVNAME
|
||||
```
|
||||
@@ -124,7 +113,7 @@ python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
|
||||
This replaces the contents of your input file with a "cleaned up" version of the same treebank.
|
||||
|
||||
### Step 3: training
|
||||
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
|
||||
Copy `compsyn.json` to `machamp/configs` and replace the training and development data paths with the paths to the files you selected/created in step 2.
|
||||
|
||||
You can now train your model by running
|
||||
|
||||
@@ -152,4 +141,4 @@ Then, use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the s
|
||||
python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
|
||||
```
|
||||
|
||||
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the output of the results of the evaluation.
|
||||
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation.
|
||||
Reference in New Issue
Block a user