rename stuff + new ud lab draft

This commit is contained in:
Arianna Masciolini
2025-03-21 13:50:23 +01:00
parent 3d9659d987
commit 19fb44c64d
72 changed files with 148 additions and 184 deletions

View File

@@ -1,120 +0,0 @@
# Lab 1: Grammatical analysis
This lab follows Chapters 1-4 in the course notes. Each part is started after the lecture on the corresponding chapter.
The assignments are submitted via Canvas.
## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to [universaldependencies.org](https://universaldependencies.org/) and download Version 2.7+ treebanks
2. Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g. `UD_English-PUD/`
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the [deptreepy](https://github.com/aarneranta/deptreepy/) or [gf-ud](https://github.com/grammaticalFramework/gf-ud) tools. The latter is also available on the eduserv server.)
5. Convert the following four trees from CoNLL-U format to graphical trees by hand, on paper.
- a short English tree (5-10 words, of your choice) and its translation.
- a long English tree (>25 words) and its translation.
1. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
Use the same trees as in the previous question.
What can you say about the syntactic differences between the languages?
## Chapter 2: design the morpological types of the major parts of speech in your selected language
1. It is enough to cover NOUN, ADJ, and VERB.
2. Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features.
3. Then use data from PUD to check which morphological features actually occur in the treebank for that language.
## Chapter 3: UD syntax analysis
In this lab, you will annotate a bilingual corpus with UD.
You can choose between starting with an English corpus and translate it to a language of your choice, or start with a Swedish corpus to translate into English.
Your task is to:
1. write an CoNLL file analysing your chosen corpus
2. translate it
3. write a CoNLL file analysing your translation
### Option 1: English data
The English text is given in the file [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) in this directory.
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly from your browser. These automatic analyses must of course be taken with a grain of salt.
### Option 2: Swedish data
The Swedish text is given in the file [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) in this directory.
It consists of teacher-corrected sentences from the [Swedish Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold)[^1], which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but by choosing this corpus you will directly contribute to an ongoing annotation effort.
Of course, you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
In both corpora, the first few sentences are POS-tagged, with each word having the form
`word:<POS>`
Hint: you can initialize the task by converting each word or word:<POS> to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course.
The UD annotation that you produce manually can be simplified CoNLL, with just the fields
`position word postag head label`
Make sure that each field is exactly one token, so that the whole line has exactly 5 tokens.
This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already).
`position word _ postag _ _ head label _ _`
Example:
`7 world NOUN 4 nmod`
expands to
`7 world _ NOUN _ _ 4 nmod _ _`
(Unfortunately, the tabs are not visible in the md output.)
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll` (available on eduserv) or with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2).
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [gf-ud](https://github.com/grammaticalFramework/gf-ud) or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you use the gf-ud tool, the command is
`cat my-file.conllu | ./gf-ud conll2pdf`
which generates a PDF. However, this does not support all foreign characters.
(It is possible that you won't be able to visualize the trees directly on eduserv.
Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.)
## (Chapter 4: phrase structure analysis)
> __NOTE:__ chapter 4 is __not__ required in the 2024 edition of the course.
> You are of course welcome to try out these exercises, but they will not be graded.
### Prerequisites: get `gf-ud` to work
There are multiple ways to use `gf-ud`:
- using the version that is installed on eduserv
- installing a pre-compiled executable, available for Mac and Ubuntu machines at http://www.grammaticalframework.org/~aarne/software/
- compiling the source code, available at https://github.com/GrammaticalFramework/gf-ud. `gf-ud` can be built:
- with `make` provided that you have the GHC Haskell compiler and the gf-core libraries (available at https://github.com/GrammaticalFramework/gf-core) installed
- with the Haskell Stack tool, by running `stack install`. This will install all the necessary dependency automatically.
### Tasks
1. Construct (by hand) phrase structure trees for some of the sentences in the corpus used in Chapter 3, both for English and your chosen language.
2. Test the grammar at
https://github.com/GrammaticalFramework/gf-ud/blob/master/grammars/English.dbnf
on last week's corpus, both for English and your own language.
In practice, this means:
- running `gf-ud`'s `dbnf` command on (possibly POS-tagged) versions of the sentences in Chapter 3's corpus.
- comparing the CoNNL-U and parse trees obtained in this way with, respectively, your hand-drawn parse trees and the CoNNL-U trees from Chapter 3. Parse tree comparison can be qualitative, while CoNNL-U trees are to be compared quantitatively via `gf-ud eval`.
3. Modify the grammar to suit your language and test it on some of the UD treebanks by using `gf-ud eval`. Try to obtain a `udScore` above 0.60. You are welcome to explain the changes you make.
[^1]: to be precise, the sentences you will use have been extracted from [DaLAJ-GED-SuperLim 2.0](https://spraakbanken.gu.se/en/resources/dalaj-ged-superlim), a publicly available spinoff of the main SweLL corpus.

View File

@@ -1,24 +0,0 @@
Who:<PRON> are:<AUX> they:<PRON> ?:<PUNCT>
A:<DET> small:<ADJ> town:<NOUN> with:<ADP> two:<NUM> minarets:<NOUN> glides:<VERB> by:<ADV> .:<PUNCT>
I:<PRON> was:<AUX> just:<ADV> a:<DET> boy:<NOUN> with:<ADP> muddy:<ADJ> shoes:<NOUN> .:<PUNCT>
Shenzhen:<PROPN> 's:<PART> traffic:<NOUN> police:<NOUN> have:<AUX> opted:<VERB> for:<ADP> unconventional:<ADJ> penalties:<NOUN> before:<ADV> .:<PUNCT>.:<PUNCT>
The:<DET> study:<NOUN> of:<ADP> volcanoes:<NOUN> is:<AUX> called:<VERB> volcanology:<NOUN> ,:<PUNCT> sometimes:<ADV> spelled:<VERB> vulcanology:<NOUN> .:<PUNCT>
It:<PRON> was:<AUX> conducted:<VERB> just:<ADV> off:<ADP> the:<DET> Mexican:<ADJ> coast:<NOUN> from:<ADP> April:<PROPN> to:<ADP> June:<PROPN> .:<PUNCT>
":<PUNCT> Her:<PRON> voice:<NOUN> literally:<ADV> went:<VERB> around:<ADP> the:<DET> world:<NOUN> ,:<PUNCT> ":<PUNCT> Leive:<PROPN> said:<VERB> .:<PUNCT>
A:<DET> witness:<NOUN> told:<VERB> police:<NOUN> that:<SCONJ> the:<DET> victim:<NOUN> had:<AUX> attacked:<VERB> the:<DET> suspect:<NOUN> in:<ADP> April:<PROPN> .:<PUNCT>
It:<PRON> 's:<AUX> most:<ADV> obvious:<ADJ> when:<SSUBJ> a:<DET> celebrity:<NOUN> 's:<PART> name:<NOUN> is:<AUX> initially:<ADV> quite:<ADV> rare:<ADJ> .:<PUNCT>
This:<PRON> has:<AUX> not:<PART> stopped:<VERB> investors:<NOUN> flocking:<VERB> to:<PART> put:<VERB> their:<PRON> money:<NOUN> in:<ADP> the:<DET> funds:<NOUN> .:<PUNCT>
This:<DET> discordance:<NOUN> between:<ADP> economic:<ADJ> data:<NOUN> and:<CCONJ> political:<ADJ> rhetoric:<NOUN> is:<AUX> familiar:<ADJ> ,:<PUNCT> or:<CCONJ> should:<AUX> be:<AUX> .:<PUNCT>
The:<DET> feasibility:<NOUN> study:<NOUN> estimates:<VERB> that:<SCONJ> it:<PRON> would:<AUX> take:<VERB> passengers:<NOUN> about:<ADV> four:<NUM> minutes:<NOUN> to:<PART> cross:<VERB> the:<DET> Potomac:<PROPN> River:<PROPN> on:<ADP> the:<DET> gondola:<NOUN> .:<PUNCT>
he collected cards and traded them with the other boys
this crime carries a penalty of five years in prison
the news was carried to every village in the province
I carry these thoughts in the back of my head
Adam would have been carried over into the life eternal
the casings had rotted away and had to be replaced
she was incensed that this chit of a girl should dare to make a fool of her in front of the class
the landslide he had in the electoral college obscured the narrowness of a victory based on just 43% of the popular vote
United States troops now carry atropine and autoinjectors in their first-aid kits to use in case of organophosphate nerve agent poisoning
he may accomplish by craft in the long run what he cannot do by force and violence in the short one
it has been said that only a hierarchical society with a leisure class at the top can produce works of art
his ingenuous explanation that he would not have burned the church if he had not thought the bishop was in it

View File

@@ -1,24 +0,0 @@
Jag:<PRON> tycker:<VERB> att:<SCONJ> du:<PRON> ska:<AUX> börja:<VERB> med:<ADP> en:<DET> språkkurs:<NOUN>.:<PUNCT>
Flerspråkighet:<NOUN> gynnar:<VERB> oss:<PRON> även:<ADV> på:<ADP> arbetsmarknaden:<NOUN>.:<PUNCT>
Språket:<NOUN> är:<AUX> lätt:<ADJ> och:<CCONJ> jag:<PRON> kan:<AUX> läsa:<VERB> utan:<ADP> något:<DET> problem:<PRON>.:<PUNCT>
Man:<PRON> känner:<VERB> sig:<PRON> ensam:<ADJ> när:<SCONJ> man:<PRON> inte:<PART> kan:<AUX> prata:<VERB> språket:<NOUN> bra:<ADV>.:<PUNCT>
Det:<PRON> kan:<AUX> vara:<AUX> kroppsspråk:<NOUN> men:<CCONJ> främst:<ADV> sker:<VERB> det:<PRON> genom:<ADP> talet:<NOUN>.
Språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> har:<AUX> vi:<PRON> hört:<VERB> flera:<ADJ> gånger:<NOUN>.:<PUNCT>
Att:<PART> kunna:<VERB> ett:<DET> språk:<NOUN> är:<AUX> en:<DET> av:<ADP> de:<DET> viktigaste:<ADJ> och:<CCONJ> värdefullaste:<ADJ> egenskaper:<NOUN> en:<DET> människa:<NOUN> kan:<AUX> ha:<VERB> så:<SCONJ> det:<PRON> är:<AUX> värt:<ADJ> mer:<ADV> än:<ADP> vad:<PRON> man:<PRON> tror:<VERB>.:<PUNCT>
Med:<ADP> andra:<ADJ> ord:<NOUN>,:<PUNCT> språket:<NOUN> är:<AUX> nyckeln:<NOUN> till:<ADP> alla:<DET> låsta:<ADJ> dörrar:<NOUN>,:<PUNCT> men:<CCONJ> det:<PRON> finns:<VERB> viktigare:<ADJ> saker:<NOUN> att:<PART> satsa:<VERB> på:<ADP> som:<PRON> jag:<PRON> kommer:<AUX> att:<PART> nämna:<VERB> längre:<ADV> ner:<ADV>.:<PUNCT>
Han:<PRON> kom:<VERB> till:<ADP> Sverige:<PROPN> för:<ADP> 4:<NUM> år:<NOUN> sedan:<ADV>,:<PUNCT> han:<PRON> kunde:<AUX> inte:<PART> tala:<VERB> svenska:<ADJ> språket:<NOUN>,<PUNCT> ingen:<DET> engelska:<NOUN>,:<PUNCT> han:<PRON> kunde:<AUX> i:<ADP> princip:<NOUN> inte:<PART> kommunicera:<VERB> med:<ADP> någon:<PRON> här<ADV>.:<PUNCT>
För:<ADP> det:<DET> första:<ADJ> hänger:<VERB> språket:<NOUN> ihop:<ADV> med:<ADP> tillhörighet:<NOUN>,:<PUNCT> särskilt:<ADV> för:<ADP> de:<DET> nya:<ADJ> invandrare:<NOUN> som:<PRON> har:<AUX> bestämt:<VERB> sig:<PRON> för:<ADP> att:<PART> flytta:<VERB> och:<CCONJ> bosätta:<VERB> sig:<PRON> i:<ADP> Sverige:<PROPN>.:<PUNCT>
Om:<SCONJ> alla:<PRON> hade:<AUX> talat:<VERB> samma:<DET> språk:<NOUN> hade:<AUX> det:<PRON> förmodligen:<ADV> inte:<PART> funnits:<VERB> något:<DET> utanförskap:<NOUN>,:<PUNCT> utan:<CCONJ> man:<PRON> hade:<AUX> fått:<VERB> en:<DET> typ:<NOUN> av:<ADP> gemenskap:<NOUN> där:<ADV> man:<PRON> delar:<VERB> samma:<DET> kultur:<NOUN>.:<PUNCT>
Att:<PART> lära:<VERB> sig:<PRON> ett:<DET> språk:<NOUN> är:<AUX> väldigt:<ADV> svårt:<ADJ>,:<PUNCT> speciellt:<ADV> för:<ADP> vuxna:<ADJ> människor:<NOUN>,:<PUNCT> och:<CCONJ> eftersom:<SCONJ> majoritetsspråket:<NOUN> blir:<VERB> en:<DET> viktig:<ADJ> del:<NOUN> i:<ADP> en:<DET> persons:<NOUN> liv:<NOUN> räcker:<VERB> det:<PRON> inte:<PART> att:<PART> tala:<VERB> det:<PRON> på:<ADP> söndagar:<NOUN> utan:<CCONJ> det:<PRON> måste:<AUX> läras:<VERB> in:<PART> som:<SCONJ> ett:<DET> modersmål:<NOUN>,:<PUNCT> vilket:<PRON> finansieras:<VERB> av:<ADP> oss:<PRON> skattebetalare:<NOUN>.:<PUNCT>
Avslutningsvis så vill jag förmedla att vi bör rädda världen innan språken.
Språket är ganska enkelt, och det är lätt att förstå vad romanen handlar om.
Det är även kostsamt för staten att se till att dessa minoritetsspråk lever kvar.
Låt mig säga att det är inte för sent att rädda de små språken, vi måste ta steget nu.
Att hålla dessa minoritetsspråk vid liv är både slöseri med tid och mycket ekonomiskt krävande.
Jag tackar alla lärare på Sfi som hjälper oss för att vi ska kunna bli bättre på svenska språket.
Språk skapades för flera tusen år sedan och vissa språk har tynat bort medan några nya har skapats.
Samhället behöver flerspråkiga och vägen till kommunikation och till att begripa andras kulturer är ett språk.
Om man kan fler språk har man fler möjligheter att använda sig av dem vilket leder till utveckling.
Därför tycker jag att vi bör införa ett förbud mot främmande språk i statliga myndigheter och föreningar.
Men jag anser först och främst att språket är som själen, det som ger oss livskraft, säregenhet och karaktär.
På Sveriges riksdags hemsida kan man läsa om hur Sverige bidrar med att skydda dessa språk med hjälp av statligt bidrag.

View File

@@ -1,38 +0,0 @@
John prefers wine to beer.
John definitely prefers beer to wine today if he is hungry.
John walks.
John does not walk.
John has walked.
John is old.
John is a doctor.
John is here.
Probably the best beer in the world.
Why?
She will.
The black cat was seen.
There is an elephant in the room.
It is too cold in the room.
She gave me a hint.
She gave a hint to me.
John saw a man with a telescope.
She walks today.
genetically modified
Why does she walk?
Why does she not walk?
Every man in the city works in the city.
She is here.
She has been here.
The reason is that I am tired.
He can sing.
Does he sing?
Would he have sung?
He would not have been tired.
He called me a "bad loser".

View File

@@ -1,63 +0,0 @@
# UDPipe: quick instructions
## Download and install
The simplest way to use UDPipe is to install the binaries for UDPipe-1, which exist for several operating systems.
They can be downloaded from
https://github.com/ufal/udpipe/releases/download/v1.3.0/udpipe-1.3.0-bin.zip
When you have downloaded and unzipped this file, you will find the binary for your system in a subdirectory, for instance,
```
udpipe-1.3.0-bin/bin-macos/udpipe
```
is the binary for MacOS.
If you include this directory on your PATH, you can run the command `udpipe` from anywhere.
Running the parser for a language requires a model for that language.
Models can be accessed via
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131
This page includes a long list of models and a command to install them all.
If you need only some of them, you can do, for instance,
```
$ wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131//english-lines-ud-2.5-191206.udpipe
```
## Running the parser
Assuming that you have the binary `udpipe` and the model `english-lines-ud-2.5-191206.udpipe` on you path, you can parse a single sentence with
```
$ echo "my hovercraft is full of eels" | udpipe --tokenize --tag --parse english-lines-ud-2.5-191206.udpipe
```
The result is a UD tree in the CoNLL-U notation,
```
# newdoc
# newpar
# sent_id = 1
# text = my hovercraft is full of eels
1 my I PRON P1SG-GEN Number=Sing|Person=1|Poss=Yes|PronType=Prs nmod:poss _ _
2 hovercraft hovercraft NOUN SG-NOM Number=Sing 4 nsubj _ _
3 is be AUX PRES Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop _ _
4 full full ADJ POS Degree=Pos 0 root _ _
5 of of ADP _ _ 6 case _ _
6 eels eel NOUN PL-NOM Number=Plur 4 nmod _ SpacesAfter=\n
```
If you also have `gfud` and `pdflatex` on your path, you can pipe the result into `gfud conll2pdf` to see the graphical tree.
As `udpipe` reads standard input, you can read it from a file, such as `lecture3-examples.txt`:
```
$ cat <myfile> | udpipe --tokenize --tag --parse <model>
```
Notice that sentences in that file must either end with a period or be separated by empty lines, because otherwise the whole file is parsed as one sentence.
## Training a new model
If you have a treebank in the CoNLL-U format, you can use it for training a new model, with
```
$ cat <myfile>.conllu | udpipe --tokenizer none --tagger none --train <myfile>.udpipe
```