mirror of
https://github.com/GrammaticalFramework/comp-syntax-gu-mlt.git
synced 2026-02-08 22:41:05 -07:00
rename stuff + new ud lab draft
This commit is contained in:
116
lab3/README.md
Normal file
116
lab3/README.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Lab 3: Universal Dependencies
|
||||
|
||||
This lab is divided into two parts.
|
||||
In [part 1](#part-1-ud-annotation), you will create a small parallel UD treebank for English/Swedish and a language of your choice.
|
||||
In [part 2](#part-2-ud-parsing), you will train a parsing model and evaluate it on your treebank.
|
||||
|
||||
## Part 1: UD annotation
|
||||
The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank.
|
||||
|
||||
### Step 1: familiarize with the CoNLL-U format
|
||||
Go to [universaldependencies.org](https://universaldependencies.org/) and download a treebank for a language of your choice.
|
||||
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.
|
||||
|
||||
### Step 2: choose a corpus
|
||||
Choose one of the two corpora provided in this folder:
|
||||
|
||||
- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
|
||||
- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
|
||||
In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
|
||||
|
||||
In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form
|
||||
|
||||
`word:<UPOS>`.
|
||||
|
||||
### Step 3: annotate
|
||||
For each sentence in the corpus, the annotation tasks consists in:
|
||||
|
||||
1. analyzing the sentence in UD
|
||||
2. translating it to a language of your choice
|
||||
3. analyzing your translation
|
||||
|
||||
The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.
|
||||
|
||||
In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.
|
||||
|
||||
To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting) or use a dedicated annotation tool such as [Arborator](https://arborator.grew.fr/#/).
|
||||
|
||||
If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).
|
||||
|
||||
Example:
|
||||
|
||||
`7 world NOUN 4 nmod`
|
||||
|
||||
expands to
|
||||
|
||||
`7 world _ NOUN _ _ 4 nmod _ _`
|
||||
|
||||
We recommend that you annotate at least the first few sentences from scratch.
|
||||
When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation.
|
||||
|
||||
### Step 4: make sure your files match the CoNLL-U specification
|
||||
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
|
||||
|
||||
With deptreepy, you will need to issue the command
|
||||
|
||||
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
|
||||
|
||||
which creates an HTML file you can open in you web browser.
|
||||
|
||||
If you can visualize your trees with any of these tools, it means that they are in valid CoNLL-U format.
|
||||
If you want to check for more subtle errors, you can try to download and run [the official UD validator](https://github.com/UniversalDependencies/tools/blob/master/validate.py).
|
||||
|
||||
Submit the two CoNLL-U files on Canvas.
|
||||
|
||||
## Part 2: UD parsing
|
||||
In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
|
||||
For better performance, you are strongly encouraged to use the MLTGPU server.
|
||||
|
||||
### Step 1: setting up MaChAmp
|
||||
1. optional, but recommended: create a Python virtual environment with the command
|
||||
```
|
||||
python -m venv ENVNAME
|
||||
```
|
||||
and activate it with
|
||||
|
||||
`source ENVNAME/bin/activate` (Linux/MacOS), or
|
||||
|
||||
`ENVNAME/Scripts/activate.bat` (Windows)
|
||||
2. clone [the MaChAmp repository](https://github.com/machamp-nlp/machamp), move inside it and run
|
||||
```
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### Step 2: selecting the training and development data
|
||||
Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it.
|
||||
If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).
|
||||
|
||||
If you are working on MLTGPU, you may choose a large treebank such as [Swedish-Talbanken](https://github.com/UniversalDependencies/UD_Swedish-Talbanken), which is already divided into a training, development and test split.
|
||||
|
||||
If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set.
|
||||
In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).
|
||||
|
||||
### Step 3: training
|
||||
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
|
||||
|
||||
You can now train your model by running
|
||||
|
||||
```
|
||||
python3 train.py --dataset_configs configs/compsyn.json --device N
|
||||
```
|
||||
from the MaChAmp folder.
|
||||
If you are working on MLTGPU, replace `N` with `0` (GPU). If you are using your laptop or EDUSERV, replace it with `-1`, which instructs MaChAmp to train the model on the CPU.
|
||||
|
||||
### Step 4: evaluation
|
||||
Run your newly trained model with
|
||||
|
||||
```
|
||||
python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N
|
||||
```
|
||||
and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as
|
||||
|
||||
```
|
||||
python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
|
||||
```
|
||||
|
||||
Submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser.
|
||||
24
lab3/comp-syntax-corpus-english.txt
Normal file
24
lab3/comp-syntax-corpus-english.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
Who:<PRON> are:<AUX> they:<PRON> ?:<PUNCT>
|
||||
A:<DET> small:<ADJ> town:<NOUN> with:<ADP> two:<NUM> minarets:<NOUN> glides:<VERB> by:<ADV> .:<PUNCT>
|
||||
I:<PRON> was:<AUX> just:<ADV> a:<DET> boy:<NOUN> with:<ADP> muddy:<ADJ> shoes:<NOUN> .:<PUNCT>
|
||||
Shenzhen:<PROPN> 's:<PART> traffic:<NOUN> police:<NOUN> have:<AUX> opted:<VERB> for:<ADP> unconventional:<ADJ> penalties:<NOUN> before:<ADV> .:<PUNCT>.:<PUNCT>
|
||||
The:<DET> study:<NOUN> of:<ADP> volcanoes:<NOUN> is:<AUX> called:<VERB> volcanology:<NOUN> ,:<PUNCT> sometimes:<ADV> spelled:<VERB> vulcanology:<NOUN> .:<PUNCT>
|
||||
It:<PRON> was:<AUX> conducted:<VERB> just:<ADV> off:<ADP> the:<DET> Mexican:<ADJ> coast:<NOUN> from:<ADP> April:<PROPN> to:<ADP> June:<PROPN> .:<PUNCT>
|
||||
":<PUNCT> Her:<PRON> voice:<NOUN> literally:<ADV> went:<VERB> around:<ADP> the:<DET> world:<NOUN> ,:<PUNCT> ":<PUNCT> Leive:<PROPN> said:<VERB> .:<PUNCT>
|
||||
A:<DET> witness:<NOUN> told:<VERB> police:<NOUN> that:<SCONJ> the:<DET> victim:<NOUN> had:<AUX> attacked:<VERB> the:<DET> suspect:<NOUN> in:<ADP> April:<PROPN> .:<PUNCT>
|
||||
It:<PRON> 's:<AUX> most:<ADV> obvious:<ADJ> when:<SSUBJ> a:<DET> celebrity:<NOUN> 's:<PART> name:<NOUN> is:<AUX> initially:<ADV> quite:<ADV> rare:<ADJ> .:<PUNCT>
|
||||
This:<PRON> has:<AUX> not:<PART> stopped:<VERB> investors:<NOUN> flocking:<VERB> to:<PART> put:<VERB> their:<PRON> money:<NOUN> in:<ADP> the:<DET> funds:<NOUN> .:<PUNCT>
|
||||
This:<DET> discordance:<NOUN> between:<ADP> economic:<ADJ> data:<NOUN> and:<CCONJ> political:<ADJ> rhetoric:<NOUN> is:<AUX> familiar:<ADJ> ,:<PUNCT> or:<CCONJ> should:<AUX> be:<AUX> .:<PUNCT>
|
||||
The:<DET> feasibility:<NOUN> study:<NOUN> estimates:<VERB> that:<SCONJ> it:<PRON> would:<AUX> take:<VERB> passengers:<NOUN> about:<ADV> four:<NUM> minutes:<NOUN> to:<PART> cross:<VERB> the:<DET> Potomac:<PROPN> River:<PROPN> on:<ADP> the:<DET> gondola:<NOUN> .:<PUNCT>
|
||||
he collected cards and traded them with the other boys
|
||||
this crime carries a penalty of five years in prison
|
||||
the news was carried to every village in the province
|
||||
I carry these thoughts in the back of my head
|
||||
Adam would have been carried over into the life eternal
|
||||
the casings had rotted away and had to be replaced
|
||||
she was incensed that this chit of a girl should dare to make a fool of her in front of the class
|
||||
the landslide he had in the electoral college obscured the narrowness of a victory based on just 43% of the popular vote
|
||||
United States troops now carry atropine and autoinjectors in their first-aid kits to use in case of organophosphate nerve agent poisoning
|
||||
he may accomplish by craft in the long run what he cannot do by force and violence in the short one
|
||||
it has been said that only a hierarchical society with a leisure class at the top can produce works of art
|
||||
his ingenuous explanation that he would not have burned the church if he had not thought the bishop was in it
|
||||
24
lab3/comp-syntax-corpus-swedish.txt
Normal file
24
lab3/comp-syntax-corpus-swedish.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
Jag:<PRON> tycker:<VERB> att:<SCONJ> du:<PRON> ska:<AUX> börja:<VERB> med:<ADP> en:<DET> språkkurs:<NOUN>.:<PUNCT>
|
||||
Flerspråkighet:<NOUN> gynnar:<VERB> oss:<PRON> även:<ADV> på:<ADP> arbetsmarknaden:<NOUN>.:<PUNCT>
|
||||
Språket:<NOUN> är:<AUX> lätt:<ADJ> och:<CCONJ> jag:<PRON> kan:<AUX> läsa:<VERB> utan:<ADP> något:<DET> problem:<PRON>.:<PUNCT>
|
||||
Man:<PRON> känner:<VERB> sig:<PRON> ensam:<ADJ> när:<SCONJ> man:<PRON> inte:<PART> kan:<AUX> prata:<VERB> språket:<NOUN> bra:<ADV>.:<PUNCT>
|
||||
Det:<PRON> kan:<AUX> vara:<AUX> kroppsspråk:<NOUN> men:<CCONJ> främst:<ADV> sker:<VERB> det:<PRON> genom:<ADP> talet:<NOUN>.
|
||||
Språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> har:<AUX> vi:<PRON> hört:<VERB> flera:<ADJ> gånger:<NOUN>.:<PUNCT>
|
||||
Att:<PART> kunna:<VERB> ett:<DET> språk:<NOUN> är:<AUX> en:<DET> av:<ADP> de:<DET> viktigaste:<ADJ> och:<CCONJ> värdefullaste:<ADJ> egenskaper:<NOUN> en:<DET> människa:<NOUN> kan:<AUX> ha:<VERB> så:<SCONJ> det:<PRON> är:<AUX> värt:<ADJ> mer:<ADV> än:<ADP> vad:<PRON> man:<PRON> tror:<VERB>.:<PUNCT>
|
||||
Med:<ADP> andra:<ADJ> ord:<NOUN>,:<PUNCT> språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> men:<CCONJ> det:<PRON> finns:<VERB> viktigare:<ADJ> saker:<NOUN> att:<PART> satsa:<VERB> på:<ADP> som:<PRON> jag:<PRON> kommer:<AUX> att:<PART> nämna:<VERB> längre:<ADV> ner:<ADV>.:<PUNCT>
|
||||
Han:<PRON> kom:<VERB> till:<ADP> Sverige:<PROPN> för:<ADP> 4:<NUM> år:<NOUN> sedan:<ADV>,:<PUNCT> han:<PRON> kunde:<AUX> inte:<PART> tala:<VERB> svenska:<ADJ> språket:<NOUN>,<PUNCT> ingen:<DET> engelska:<NOUN>,:<PUNCT> han:<PRON> kunde:<AUX> i:<ADP> princip:<NOUN> inte:<PART> kommunicera:<VERB> med:<ADP> någon:<PRON> här<ADV>.:<PUNCT>
|
||||
För:<ADP> det:<DET> första:<ADJ> hänger:<VERB> språket:<NOUN> ihop:<ADV> med:<ADP> tillhörighet:<NOUN>,:<PUNCT> särskilt:<ADV> för:<ADP> de:<DET> nya:<ADJ> invandrare:<NOUN> som:<PRON> har:<AUX> bestämt:<VERB> sig:<PRON> för:<ADP> att:<PART> flytta:<VERB> och:<CCONJ> bosätta:<VERB> sig:<PRON> i:<ADP> Sverige:<PROPN>.:<PUNCT>
|
||||
Om:<SCONJ> alla:<PRON> hade:<AUX> talat:<VERB> samma:<DET> språk:<NOUN> hade:<AUX> det:<PRON> förmodligen:<ADV> inte:<PART> funnits:<VERB> något:<DET> utanförskap:<NOUN>,:<PUNCT> utan:<CCONJ> man:<PRON> hade:<AUX> fått:<VERB> en:<DET> typ:<NOUN> av:<ADP> gemenskap:<NOUN> där:<ADV> man:<PRON> delar:<VERB> samma:<DET> kultur:<NOUN>.:<PUNCT>
|
||||
Att:<PART> lära:<VERB> sig:<PRON> ett:<DET> språk:<NOUN> är:<AUX> väldigt:<ADV> svårt:<ADJ>,:<PUNCT> speciellt:<ADV> för:<ADP> vuxna:<ADJ> människor:<NOUN>,:<PUNCT> och:<CCONJ> eftersom:<SCONJ> majoritetsspråket:<NOUN> blir:<VERB> en:<DET> viktig:<ADJ> del:<NOUN> i:<ADP> en:<DET> persons:<NOUN> liv:<NOUN> räcker:<VERB> det:<PRON> inte:<PART> att:<PART> tala:<VERB> det:<PRON> på:<ADP> söndagar:<NOUN> utan:<CCONJ> det:<PRON> måste:<AUX> läras:<VERB> in:<PART> som:<SCONJ> ett:<DET> modersmål:<NOUN>,:<PUNCT> vilket:<PRON> finansieras:<VERB> av:<ADP> oss:<PRON> skattebetalare:<NOUN>.:<PUNCT>
|
||||
Avslutningsvis så vill jag förmedla att vi bör rädda världen innan språken.
|
||||
Språket är ganska enkelt, och det är lätt att förstå vad romanen handlar om.
|
||||
Det är även kostsamt för staten att se till att dessa minoritetsspråk lever kvar.
|
||||
Låt mig säga att det är inte för sent att rädda de små språken, vi måste ta steget nu.
|
||||
Att hålla dessa minoritetsspråk vid liv är både slöseri med tid och mycket ekonomiskt krävande.
|
||||
Jag tackar alla lärare på Sfi som hjälper oss för att vi ska kunna bli bättre på svenska språket.
|
||||
Språk skapades för flera tusen år sedan och vissa språk har tynat bort medan några nya har skapats.
|
||||
Samhället behöver flerspråkiga och vägen till kommunikation och till att begripa andras kulturer är ett språk.
|
||||
Om man kan fler språk har man fler möjligheter att använda sig av dem vilket leder till utveckling.
|
||||
Därför tycker jag att vi bör införa ett förbud mot främmande språk i statliga myndigheter och föreningar.
|
||||
Men jag anser först och främst att språket är som själen, det som ger oss livskraft, säregenhet och karaktär.
|
||||
På Sveriges riksdags hemsida kan man läsa om hur Sverige bidrar med att skydda dessa språk med hjälp av statligt bidrag.
|
||||
38
lab3/lecture3-examples.txt
Normal file
38
lab3/lecture3-examples.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
John prefers wine to beer.
|
||||
John definitely prefers beer to wine today if he is hungry.
|
||||
John walks.
|
||||
John does not walk.
|
||||
John has walked.
|
||||
John is old.
|
||||
John is a doctor.
|
||||
John is here.
|
||||
Probably the best beer in the world.
|
||||
Why?
|
||||
|
||||
She will.
|
||||
The black cat was seen.
|
||||
There is an elephant in the room.
|
||||
It is too cold in the room.
|
||||
She gave me a hint.
|
||||
She gave a hint to me.
|
||||
John saw a man with a telescope.
|
||||
She walks today.
|
||||
genetically modified
|
||||
|
||||
Why does she walk?
|
||||
|
||||
Why does she not walk?
|
||||
|
||||
Every man in the city works in the city.
|
||||
She is here.
|
||||
She has been here.
|
||||
The reason is that I am tired.
|
||||
He can sing.
|
||||
Does he sing?
|
||||
|
||||
Would he have sung?
|
||||
|
||||
He would not have been tired.
|
||||
He called me a "bad loser".
|
||||
|
||||
|
||||
17
lab3/machamp_config.json
Normal file
17
lab3/machamp_config.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"compsyn": {
|
||||
"train_data_path": "PATH-TO-YOUR-TRAIN-SPLIT",
|
||||
"dev_data_path": "PATH-TO-YOUR-DEV-SPLIT",
|
||||
"word_idx": 1,
|
||||
"tasks": {
|
||||
"upos": {
|
||||
"task_type": "seq",
|
||||
"column_idx": 3
|
||||
},
|
||||
"dependency": {
|
||||
"task_type": "dependency",
|
||||
"column_idx": 6
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user