2026 modifications to lab 3

This commit is contained in:
Arianna Masciolini
2026-03-29 21:15:06 +02:00
parent f25a32e983
commit 088f52a0f6
3 changed files with 15 additions and 50 deletions

View File

@@ -12,28 +12,27 @@ Go to [universaldependencies.org](https://universaldependencies.org/) and downlo
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.
### Step 2: choose a corpus
Choose one of the two corpora provided in this folder:
Choose a corpus of 25+ sentences.
- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
If you want to start with __English__, you can use[`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form `word:<UPOS>`.
In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form
`word:<UPOS>`.
If you want to work with __Swedish__ and might be interested in contributing to an [official UD treebank](https://github.com/universaldependencies/UD_Swedish-SweLL), ask Arianna for [a sample of the Swedish Learner Language corpus](https://spraakbanken.gu.se/en/resources/swell).
If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session!
### Step 3: annotate
For each sentence in the corpus, the annotation tasks consists in:
1. analyzing the sentence in UD
2. translating it to a language of your choice
2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish)
3. analyzing your translation
The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.
In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.
To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/).
To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to `.tsv`), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/) (helpful but unstable!).
If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).
@@ -54,7 +53,7 @@ To fully comply with the CoNLL-U standard, comment lines should consist of key-v
# comment = your comment here
```
but for this assigment lines like
but for this assignment lines like
```
# your comment here
@@ -63,24 +62,14 @@ but for this assigment lines like
are perfectly acceptable too.
### Step 4: make sure your files match the CoNLL-U specification
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you can visualize your trees with any of these tools, that's a very good sign that your file _more or less_ matches the CoNNL-U format!
As a last step, validate your treebank with the official UD validator.
Check your treebank with the official UD validator.
To do that, clone or download the [UD tools repository](https://github.com/UniversalDependencies/tools), move inside the corresponding folder and run
```
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=1
python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2
```
If you want to check for more subtle errors, you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator).
Level 2 should be enough for part 2, but you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator) to check for more subtle errors.
Submit the two CoNLL-U files on Canvas.
@@ -91,7 +80,7 @@ If you want to install MaChAmp on your own computer, keep in mind that very old
For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42).
### Step 1: setting up MaChAmp
1. optional, but recommended: create a Python virtual environment with the command
1. create a Python virtual environment with the command
```
python -m venv ENVNAME
```
@@ -124,7 +113,7 @@ python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT
This replaces the contents of your input file with a "cleaned up" version of the same treebank.
### Step 3: training
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
Copy `compsyn.json` to `machamp/configs` and replace the training and development data paths with the paths to the files you selected/created in step 2.
You can now train your model by running
@@ -152,4 +141,4 @@ Then, use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the s
python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
```
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the output of the results of the evaluation.
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation.

View File

@@ -6,7 +6,7 @@ The:<DET> study:<NOUN> of:<ADP> volcanoes:<NOUN> is:<AUX> called:<VERB> volcanol
It:<PRON> was:<AUX> conducted:<VERB> just:<ADV> off:<ADP> the:<DET> Mexican:<ADJ> coast:<NOUN> from:<ADP> April:<PROPN> to:<ADP> June:<PROPN> .:<PUNCT>
":<PUNCT> Her:<PRON> voice:<NOUN> literally:<ADV> went:<VERB> around:<ADP> the:<DET> world:<NOUN> ,:<PUNCT> ":<PUNCT> Leive:<PROPN> said:<VERB> .:<PUNCT>
A:<DET> witness:<NOUN> told:<VERB> police:<NOUN> that:<SCONJ> the:<DET> victim:<NOUN> had:<AUX> attacked:<VERB> the:<DET> suspect:<NOUN> in:<ADP> April:<PROPN> .:<PUNCT>
It:<PRON> 's:<AUX> most:<ADV> obvious:<ADJ> when:<SSUBJ> a:<DET> celebrity:<NOUN> 's:<PART> name:<NOUN> is:<AUX> initially:<ADV> quite:<ADV> rare:<ADJ> .:<PUNCT>
It:<PRON> 's:<AUX> most:<ADV> obvious:<ADJ> when:<SCONJ> a:<DET> celebrity:<NOUN> 's:<PART> name:<NOUN> is:<AUX> initially:<ADV> quite:<ADV> rare:<ADJ> .:<PUNCT>
This:<PRON> has:<AUX> not:<PART> stopped:<VERB> investors:<NOUN> flocking:<VERB> to:<PART> put:<VERB> their:<PRON> money:<NOUN> in:<ADP> the:<DET> funds:<NOUN> .:<PUNCT>
This:<DET> discordance:<NOUN> between:<ADP> economic:<ADJ> data:<NOUN> and:<CCONJ> political:<ADJ> rhetoric:<NOUN> is:<AUX> familiar:<ADJ> ,:<PUNCT> or:<CCONJ> should:<AUX> be:<AUX> .:<PUNCT>
The:<DET> feasibility:<NOUN> study:<NOUN> estimates:<VERB> that:<SCONJ> it:<PRON> would:<AUX> take:<VERB> passengers:<NOUN> about:<ADV> four:<NUM> minutes:<NOUN> to:<PART> cross:<VERB> the:<DET> Potomac:<PROPN> River:<PROPN> on:<ADP> the:<DET> gondola:<NOUN> .:<PUNCT>

View File

@@ -1,24 +0,0 @@
Jag:<PRON> tycker:<VERB> att:<SCONJ> du:<PRON> ska:<AUX> börja:<VERB> med:<ADP> en:<DET> språkkurs:<NOUN>.:<PUNCT>
Flerspråkighet:<NOUN> gynnar:<VERB> oss:<PRON> även:<ADV> på:<ADP> arbetsmarknaden:<NOUN>.:<PUNCT>
Språket:<NOUN> är:<AUX> lätt:<ADJ> och:<CCONJ> jag:<PRON> kan:<AUX> läsa:<VERB> utan:<ADP> något:<DET> problem:<PRON>.:<PUNCT>
Man:<PRON> känner:<VERB> sig:<PRON> ensam:<ADJ> när:<SCONJ> man:<PRON> inte:<PART> kan:<AUX> prata:<VERB> språket:<NOUN> bra:<ADV>.:<PUNCT>
Det:<PRON> kan:<AUX> vara:<AUX> kroppsspråk:<NOUN> men:<CCONJ> främst:<ADV> sker:<VERB> det:<PRON> genom:<ADP> talet:<NOUN>.
Språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> har:<AUX> vi:<PRON> hört:<VERB> flera:<ADJ> gånger:<NOUN>.:<PUNCT>
Att:<PART> kunna:<VERB> ett:<DET> språk:<NOUN> är:<AUX> en:<DET> av:<ADP> de:<DET> viktigaste:<ADJ> och:<CCONJ> värdefullaste:<ADJ> egenskaper:<NOUN> en:<DET> människa:<NOUN> kan:<AUX> ha:<VERB> så:<SCONJ> det:<PRON> är:<AUX> värt:<ADJ> mer:<ADV> än:<ADP> vad:<PRON> man:<PRON> tror:<VERB>.:<PUNCT>
Med:<ADP> andra:<ADJ> ord:<NOUN>,:<PUNCT> språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> men:<CCONJ> det:<PRON> finns:<VERB> viktigare:<ADJ> saker:<NOUN> att:<PART> satsa:<VERB> på:<ADP> som:<PRON> jag:<PRON> kommer:<AUX> att:<PART> nämna:<VERB> längre:<ADV> ner:<ADV>.:<PUNCT>
Han:<PRON> kom:<VERB> till:<ADP> Sverige:<PROPN> för:<ADP> 4:<NUM> år:<NOUN> sedan:<ADV>,:<PUNCT> han:<PRON> kunde:<AUX> inte:<PART> tala:<VERB> svenska:<ADJ> språket:<NOUN>,<PUNCT> ingen:<DET> engelska:<NOUN>,:<PUNCT> han:<PRON> kunde:<AUX> i:<ADP> princip:<NOUN> inte:<PART> kommunicera:<VERB> med:<ADP> någon:<PRON> här<ADV>.:<PUNCT>
För:<ADP> det:<DET> första:<ADJ> hänger:<VERB> språket:<NOUN> ihop:<ADV> med:<ADP> tillhörighet:<NOUN>,:<PUNCT> särskilt:<ADV> för:<ADP> de:<DET> nya:<ADJ> invandrare:<NOUN> som:<PRON> har:<AUX> bestämt:<VERB> sig:<PRON> för:<ADP> att:<PART> flytta:<VERB> och:<CCONJ> bosätta:<VERB> sig:<PRON> i:<ADP> Sverige:<PROPN>.:<PUNCT>
Om:<SCONJ> alla:<PRON> hade:<AUX> talat:<VERB> samma:<DET> språk:<NOUN> hade:<AUX> det:<PRON> förmodligen:<ADV> inte:<PART> funnits:<VERB> något:<DET> utanförskap:<NOUN>,:<PUNCT> utan:<CCONJ> man:<PRON> hade:<AUX> fått:<VERB> en:<DET> typ:<NOUN> av:<ADP> gemenskap:<NOUN> där:<ADV> man:<PRON> delar:<VERB> samma:<DET> kultur:<NOUN>.:<PUNCT>
Att:<PART> lära:<VERB> sig:<PRON> ett:<DET> språk:<NOUN> är:<AUX> väldigt:<ADV> svårt:<ADJ>,:<PUNCT> speciellt:<ADV> för:<ADP> vuxna:<ADJ> människor:<NOUN>,:<PUNCT> och:<CCONJ> eftersom:<SCONJ> majoritetsspråket:<NOUN> blir:<VERB> en:<DET> viktig:<ADJ> del:<NOUN> i:<ADP> en:<DET> persons:<NOUN> liv:<NOUN> räcker:<VERB> det:<PRON> inte:<PART> att:<PART> tala:<VERB> det:<PRON> på:<ADP> söndagar:<NOUN> utan:<CCONJ> det:<PRON> måste:<AUX> läras:<VERB> in:<PART> som:<SCONJ> ett:<DET> modersmål:<NOUN>,:<PUNCT> vilket:<PRON> finansieras:<VERB> av:<ADP> oss:<PRON> skattebetalare:<NOUN>.:<PUNCT>
Avslutningsvis så vill jag förmedla att vi bör rädda världen innan språken.
Språket är ganska enkelt, och det är lätt att förstå vad romanen handlar om.
Det är även kostsamt för staten att se till att dessa minoritetsspråk lever kvar.
Låt mig säga att det är inte för sent att rädda de små språken, vi måste ta steget nu.
Att hålla dessa minoritetsspråk vid liv är både slöseri med tid och mycket ekonomiskt krävande.
Jag tackar alla lärare på Sfi som hjälper oss för att vi ska kunna bli bättre på svenska språket.
Språk skapades för flera tusen år sedan och vissa språk har tynat bort medan några nya har skapats.
Samhället behöver flerspråkiga och vägen till kommunikation och till att begripa andras kulturer är ett språk.
Om man kan fler språk har man fler möjligheter att använda sig av dem vilket leder till utveckling.
Därför tycker jag att vi bör införa ett förbud mot främmande språk i statliga myndigheter och föreningar.
Men jag anser först och främst att språket är som själen, det som ger oss livskraft, säregenhet och karaktär.
På Sveriges riksdags hemsida kan man läsa om hur Sverige bidrar med att skydda dessa språk med hjälp av statligt bidrag.