From 088f52a0f6390e22ab1be8a235019a490082a636 Mon Sep 17 00:00:00 2001 From: Arianna Masciolini Date: Sun, 29 Mar 2026 21:15:06 +0200 Subject: [PATCH] 2026 modifications to lab 3 --- lab3/README.md | 39 +++++++++++------------------ lab3/comp-syntax-corpus-english.txt | 2 +- lab3/comp-syntax-corpus-swedish.txt | 24 ------------------ 3 files changed, 15 insertions(+), 50 deletions(-) delete mode 100644 lab3/comp-syntax-corpus-swedish.txt diff --git a/lab3/README.md b/lab3/README.md index 6f57e6d..b705a81 100644 --- a/lab3/README.md +++ b/lab3/README.md @@ -12,28 +12,27 @@ Go to [universaldependencies.org](https://universaldependencies.org/) and downlo Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand. ### Step 2: choose a corpus -Choose one of the two corpora provided in this folder: +Choose a corpus of 25+ sentences. -- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt -- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time. -In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses. +If you want to start with __English__, you can use[`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt. Note that the first few sentences of this corpus are pre-tokenized and POS-tagged. Each token is in the form `word:`. -In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form -`word:`. +If you want to work with __Swedish__ and might be interested in contributing to an [official UD treebank](https://github.com/universaldependencies/UD_Swedish-SweLL), ask Arianna for [a sample of the Swedish Learner Language corpus](https://spraakbanken.gu.se/en/resources/swell). + +If you have other data in mind that you think would be interesting to annotate in UD (not necessarily in English or Swedish), don't hesitate to bring it up during a lab session! ### Step 3: annotate For each sentence in the corpus, the annotation tasks consists in: 1. analyzing the sentence in UD -2. translating it to a language of your choice +2. translating it to a language of your choice (as long as one of the two versions is in English or Swedish) 3. analyzing your translation The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`. In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations. -To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/). +To produce the CoNLL-U files, you may work in your text editor (you can usually get syntax highlighting by changing the extension to `.tsv`), use a spreadsheet program and then export to TSV, or use a dedicated graphical annotation tool such as [Arborator](https://arborator.grew.fr/#/) (helpful but unstable!). If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar). @@ -54,7 +53,7 @@ To fully comply with the CoNLL-U standard, comment lines should consist of key-v # comment = your comment here ``` -but for this assigment lines like +but for this assignment lines like ``` # your comment here @@ -63,24 +62,14 @@ but for this assigment lines like are perfectly acceptable too. ### Step 4: make sure your files match the CoNLL-U specification -Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it. - -With deptreepy, you will need to issue the command - -`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html` - -which creates an HTML file you can open in you web browser. - -If you can visualize your trees with any of these tools, that's a very good sign that your file _more or less_ matches the CoNNL-U format! - -As a last step, validate your treebank with the official UD validator. +Check your treebank with the official UD validator. To do that, clone or download the [UD tools repository](https://github.com/UniversalDependencies/tools), move inside the corresponding folder and run ``` -python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=1 +python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE --level=2 ``` -If you want to check for more subtle errors, you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator). +Level 2 should be enough for part 2, but you can [go up a few levels](https://harisont.github.io/gfaqs.html#ud-validator) to check for more subtle errors. Submit the two CoNLL-U files on Canvas. @@ -91,7 +80,7 @@ If you want to install MaChAmp on your own computer, keep in mind that very old For more information, see [here](https://github.com/machamp-nlp/machamp/issues/42). ### Step 1: setting up MaChAmp -1. optional, but recommended: create a Python virtual environment with the command +1. create a Python virtual environment with the command ``` python -m venv ENVNAME ``` @@ -124,7 +113,7 @@ python scripts/misc/cleanconl.py PATH-TO-A-DATASET-SPLIT This replaces the contents of your input file with a "cleaned up" version of the same treebank. ### Step 3: training -Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2. +Copy `compsyn.json` to `machamp/configs` and replace the training and development data paths with the paths to the files you selected/created in step 2. You can now train your model by running @@ -152,4 +141,4 @@ Then, use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the s python scripts/misc/conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu ``` -On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the output of the results of the evaluation. \ No newline at end of file +On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser, based on the predictions themselves and on the automatic evaluation. \ No newline at end of file diff --git a/lab3/comp-syntax-corpus-english.txt b/lab3/comp-syntax-corpus-english.txt index 3d0624a..77c1136 100644 --- a/lab3/comp-syntax-corpus-english.txt +++ b/lab3/comp-syntax-corpus-english.txt @@ -6,7 +6,7 @@ The: study: of: volcanoes: is: called: volcanol It: was: conducted: just: off: the: Mexican: coast: from: April: to: June: .: ": Her: voice: literally: went: around: the: world: ,: ": Leive: said: .: A: witness: told: police: that: the: victim: had: attacked: the: suspect: in: April: .: -It: 's: most: obvious: when: a: celebrity: 's: name: is: initially: quite: rare: .: +It: 's: most: obvious: when: a: celebrity: 's: name: is: initially: quite: rare: .: This: has: not: stopped: investors: flocking: to: put: their: money: in: the: funds: .: This: discordance: between: economic: data: and: political: rhetoric: is: familiar: ,: or: should: be: .: The: feasibility: study: estimates: that: it: would: take: passengers: about: four: minutes: to: cross: the: Potomac: River: on: the: gondola: .: diff --git a/lab3/comp-syntax-corpus-swedish.txt b/lab3/comp-syntax-corpus-swedish.txt deleted file mode 100644 index 002d99e..0000000 --- a/lab3/comp-syntax-corpus-swedish.txt +++ /dev/null @@ -1,24 +0,0 @@ -Jag: tycker: att: du: ska: börja: med: en: språkkurs:.: -Flerspråkighet: gynnar: oss: även: på: arbetsmarknaden:.: -Språket: är: lätt: och: jag: kan: läsa: utan: något: problem:.: -Man: känner: sig: ensam: när: man: inte: kan: prata: språket: bra:.: -Det: kan: vara: kroppsspråk: men: främst: sker: det: genom: talet:. -Språket: är: nyckeln: till: alla: låsta: dörrar:,: har: vi: hört: flera: gånger:.: -Att: kunna: ett: språk: är: en: av: de: viktigaste: och: värdefullaste: egenskaper: en: människa: kan: ha: så: det: är: värt: mer: än: vad: man: tror:.: -Med: andra: ord:,: språket: är: nyckeln: till: alla: låsta: dörrar:,: men: det: finns: viktigare: saker: att: satsa: på: som: jag: kommer: att: nämna: längre: ner:.: -Han: kom: till: Sverige: för: 4: år: sedan:,: han: kunde: inte: tala: svenska: språket:, ingen: engelska:,: han: kunde: i: princip: inte: kommunicera: med: någon: här.: -För: det: första: hänger: språket: ihop: med: tillhörighet:,: särskilt: för: de: nya: invandrare: som: har: bestämt: sig: för: att: flytta: och: bosätta: sig: i: Sverige:.: -Om: alla: hade: talat: samma: språk: hade: det: förmodligen: inte: funnits: något: utanförskap:,: utan: man: hade: fått: en: typ: av: gemenskap: där: man: delar: samma: kultur:.: -Att: lära: sig: ett: språk: är: väldigt: svårt:,: speciellt: för: vuxna: människor:,: och: eftersom: majoritetsspråket: blir: en: viktig: del: i: en: persons: liv: räcker: det: inte: att: tala: det: på: söndagar: utan: det: måste: läras: in: som: ett: modersmål:,: vilket: finansieras: av: oss: skattebetalare:.: -Avslutningsvis så vill jag förmedla att vi bör rädda världen innan språken. -Språket är ganska enkelt, och det är lätt att förstå vad romanen handlar om. -Det är även kostsamt för staten att se till att dessa minoritetsspråk lever kvar. -Låt mig säga att det är inte för sent att rädda de små språken, vi måste ta steget nu. -Att hålla dessa minoritetsspråk vid liv är både slöseri med tid och mycket ekonomiskt krävande. -Jag tackar alla lärare på Sfi som hjälper oss för att vi ska kunna bli bättre på svenska språket. -Språk skapades för flera tusen år sedan och vissa språk har tynat bort medan några nya har skapats. -Samhället behöver flerspråkiga och vägen till kommunikation och till att begripa andras kulturer är ett språk. -Om man kan fler språk har man fler möjligheter att använda sig av dem vilket leder till utveckling. -Därför tycker jag att vi bör införa ett förbud mot främmande språk i statliga myndigheter och föreningar. -Men jag anser först och främst att språket är som själen, det som ger oss livskraft, säregenhet och karaktär. -På Sveriges riksdags hemsida kan man läsa om hur Sverige bidrar med att skydda dessa språk med hjälp av statligt bidrag.