This commit is contained in:
Arianna Masciolini
2025-03-26 14:25:16 +01:00
54 changed files with 193 additions and 554 deletions

View File

@@ -1,120 +1,95 @@
# Lab 1: Grammatical analysis
# Lab 1: Multilingual generation and translation
In this lab, you will implement the concrete syntax of a grammar for a language of your choice.
The abstract syntax is given in the directory [`../grammar/abstract/`](../grammar/abstract/) and an example concrete syntax for English can be found in [`../grammar/english/`](../grammar/english/).
This lab follows Chapters 1-4 in the course notes. Each part is started after the lecture on the corresponding chapter.
The assignments are submitted via Canvas.
## Chapter 1: explore the parallel UD treebank (PUD)
## Part 1: design the morphological types of the major parts of speech in your selected language
1. Go to [universaldependencies.org](https://universaldependencies.org/) and download Version 2.7+ treebanks
2. Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g. `UD_English-PUD/`
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the [deptreepy](https://github.com/aarneranta/deptreepy/) or [gf-ud](https://github.com/grammaticalFramework/gf-ud) tools. The latter is also available on the eduserv server.)
5. Convert the following four trees from CoNLL-U format to graphical trees by hand, on paper.
- a short English tree (5-10 words, of your choice) and its translation.
- a long English tree (>25 words) and its translation.
1. Draw word alignments for some non-trivial example in the PUD treebank, on paper.
Use the same trees as in the previous question.
What can you say about the syntactic differences between the languages?
## Chapter 2: design the morpological types of the major parts of speech in your selected language
1. It is enough to cover NOUN, ADJ, and VERB.
2. Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features.
3. Then use data from PUD to check which morphological features actually occur in the treebank for that language.
## Chapter 3: UD syntax analysis
## After lecture 6
In this lab, you will annotate a bilingual corpus with UD.
You can choose between starting with an English corpus and translate it to a language of your choice, or start with a Swedish corpus to translate into English.
1. Design a morphology for the main lexical types (N, A, V) with parameters and a couple of paradigms.
2. Test it by implementing the lexicon in the MicroLang module. You need to define lincat N,A,V,V2 as well as the paradigms in MicroResource.
Your task is to:
*To deliver*: the lexicon part of files MicroGrammarX.gf and MicroResourceX.gf for your language of choice X. Follow the structure of MicroGrammarEng and MicroResourceEng when preparing these.
1. write an CoNLL file analysing your chosen corpus
2. translate it
3. write a CoNLL file analysing your translation
## After lecture 7
### Option 1: English data
The English text is given in the file [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) in this directory.
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly from your browser. These automatic analyses must of course be taken with a grain of salt.
1. Define the linearization types of main phrasal categories - the remaining categories in MicroLang.
2. Define the rest of the linearization rules in MicroLang.
### Option 2: Swedish data
The Swedish text is given in the file [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) in this directory.
It consists of teacher-corrected sentences from the [Swedish Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold)[^1], which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but by choosing this corpus you will directly contribute to an ongoing annotation effort.
Of course, you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
*To deliver*: MicroLangX and MicroResourceX for your language of choice, with the lexicon part from Session 5 completed with syntax part.
In both corpora, the first few sentences are POS-tagged, with each word having the form
## After lecture 8
`word:<POS>`
1. Try out the applications in `../python` and read its README carefully.
2. Add a concrete syntax for your language to one of the grammars
in `../python/`, either `Query` or `Draw`.
The simplest way to do this
is first to copy the `Eng` grammar and then to change the words; the
syntax may work well as it is. Even though it can be a bit unnatural,
it should be in a wide sense natural.
3. Compile the grammar with `gf -make Query???.gf` so that your grammar
gets included (the same for `Draw`).
4. Generate phrases in GF by first importing your pgf file and then
issuing the command `gt | l -treebank`; fix your grammar if it looks
too bad.
5. Test the corresponding Python application with your language.
Hint: you can initialize the task by converting each word or word:<POS> to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course.
The Python code with embedded GF grammars will be explained in a greater
detail in Lecture 9.
The UD annotation that you produce manually can be simplified CoNLL, with just the fields
*To deliver*: your grammar module.
`position word postag head label`
*Deadline*: 29 May 2024. Demo your grammars (both Micro and this one) at
the last lecture of the course!
Make sure that each field is exactly one token, so that the whole line has exactly 5 tokens.
This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already).
## A method for testing your Micro grammar
`position word _ postag _ _ head label _ _`
Since MicroLang is a proper part of the RGL, it can be easily implemented as an application grammar.
How to do this is shown in `grammar/functor/`, where the implementation consists of two files:
- `MicroLangFunctor.gf` which is a generic implementation working for all RGL languages,
- `MicroLangFunctorEng.gf` which is a *functor instantiation* for English, easily reproduciple for other languages than `Eng`.
Example:
To use this for testing, you can take the following steps:
`7 world NOUN 4 nmod`
1. Build a functor instantiation for your language by copying `MicroLangFunctorEng.gf` and changing `Eng` in the file name and inside the file to your language code.
expands to
2. Use GF to create a testfile by random generation:
```
$ echo "gr -number=1000 | l -tabtreebank" | gf english/MicroLangEng.gf functor/MicroLangFunctorEng.gf >test.tmp
```
`7 world _ NOUN _ _ 4 nmod _ _`
3. Inspect the resulting file `test.tmp`.
But you can also use Unix `cut` to create separate files for the two versions of the grammar and `diff` to compare them:
```
$ cut -f2 test.tmp >test1.tmp
$ cut -f3 test.tmp >test2.tmp
$ diff test1.tmp test2.tmp
(Unfortunately, the tabs are not visible in the md output.)
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll` (available on eduserv) or with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2).
52c52
< the hot fire teachs her
---
> the hot fire teaches her
69c69
< the man teachs the apples
---
> the man teaches the apples
122c122
```
As seen from the result in this case, our implementation has a wrong inflection of the verb "teach".
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [gf-ud](https://github.com/grammaticalFramework/gf-ud) or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
The Mini grammar can be tested in the same way, by building a reference implementation using the functor in `functor/`.
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you use the gf-ud tool, the command is
`cat my-file.conllu | ./gf-ud conll2pdf`
which generates a PDF. However, this does not support all foreign characters.
(It is possible that you won't be able to visualize the trees directly on eduserv.
Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.)
## (Chapter 4: phrase structure analysis)
> __NOTE:__ chapter 4 is __not__ required in the 2024 edition of the course.
> You are of course welcome to try out these exercises, but they will not be graded.
### Prerequisites: get `gf-ud` to work
There are multiple ways to use `gf-ud`:
- using the version that is installed on eduserv
- installing a pre-compiled executable, available for Mac and Ubuntu machines at http://www.grammaticalframework.org/~aarne/software/
- compiling the source code, available at https://github.com/GrammaticalFramework/gf-ud. `gf-ud` can be built:
- with `make` provided that you have the GHC Haskell compiler and the gf-core libraries (available at https://github.com/GrammaticalFramework/gf-core) installed
- with the Haskell Stack tool, by running `stack install`. This will install all the necessary dependency automatically.
### Tasks
1. Construct (by hand) phrase structure trees for some of the sentences in the corpus used in Chapter 3, both for English and your chosen language.
2. Test the grammar at
https://github.com/GrammaticalFramework/gf-ud/blob/master/grammars/English.dbnf
on last week's corpus, both for English and your own language.
In practice, this means:
- running `gf-ud`'s `dbnf` command on (possibly POS-tagged) versions of the sentences in Chapter 3's corpus.
- comparing the CoNNL-U and parse trees obtained in this way with, respectively, your hand-drawn parse trees and the CoNNL-U trees from Chapter 3. Parse tree comparison can be qualitative, while CoNNL-U trees are to be compared quantitatively via `gf-ud eval`.
3. Modify the grammar to suit your language and test it on some of the UD treebanks by using `gf-ud eval`. Try to obtain a `udScore` above 0.60. You are welcome to explain the changes you make.
[^1]: to be precise, the sentences you will use have been extracted from [DaLAJ-GED-SuperLim 2.0](https://spraakbanken.gu.se/en/resources/dalaj-ged-superlim), a publicly available spinoff of the main SweLL corpus.

View File

@@ -1,63 +0,0 @@
# UDPipe: quick instructions
## Download and install
The simplest way to use UDPipe is to install the binaries for UDPipe-1, which exist for several operating systems.
They can be downloaded from
https://github.com/ufal/udpipe/releases/download/v1.3.0/udpipe-1.3.0-bin.zip
When you have downloaded and unzipped this file, you will find the binary for your system in a subdirectory, for instance,
```
udpipe-1.3.0-bin/bin-macos/udpipe
```
is the binary for MacOS.
If you include this directory on your PATH, you can run the command `udpipe` from anywhere.
Running the parser for a language requires a model for that language.
Models can be accessed via
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131
This page includes a long list of models and a command to install them all.
If you need only some of them, you can do, for instance,
```
$ wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131//english-lines-ud-2.5-191206.udpipe
```
## Running the parser
Assuming that you have the binary `udpipe` and the model `english-lines-ud-2.5-191206.udpipe` on you path, you can parse a single sentence with
```
$ echo "my hovercraft is full of eels" | udpipe --tokenize --tag --parse english-lines-ud-2.5-191206.udpipe
```
The result is a UD tree in the CoNLL-U notation,
```
# newdoc
# newpar
# sent_id = 1
# text = my hovercraft is full of eels
1 my I PRON P1SG-GEN Number=Sing|Person=1|Poss=Yes|PronType=Prs nmod:poss _ _
2 hovercraft hovercraft NOUN SG-NOM Number=Sing 4 nsubj _ _
3 is be AUX PRES Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop _ _
4 full full ADJ POS Degree=Pos 0 root _ _
5 of of ADP _ _ 6 case _ _
6 eels eel NOUN PL-NOM Number=Plur 4 nmod _ SpacesAfter=\n
```
If you also have `gfud` and `pdflatex` on your path, you can pipe the result into `gfud conll2pdf` to see the graphical tree.
As `udpipe` reads standard input, you can read it from a file, such as `lecture3-examples.txt`:
```
$ cat <myfile> | udpipe --tokenize --tag --parse <model>
```
Notice that sentences in that file must either end with a period or be separated by empty lines, because otherwise the whole file is parsed as one sentence.
## Training a new model
If you have a treebank in the CoNLL-U format, you can use it for training a new model, with
```
$ cat <myfile>.conllu | udpipe --tokenizer none --tagger none --train <myfile>.udpipe
```

View File

@@ -1,87 +0,0 @@
# Lab 2: Multilingual generation and translation
This lab corresponds to Chapters 5 to 9 of the Notes, but follows them only loosely.
Therefore we will structure it according to the exercise sessions
rather than chapters.
The abstract syntax is given in the subdirectory grammars/abstract/
## After lecture 6
1. Design a morphology for the main lexical types (N, A, V) with parameters and a couple of paradigms.
2. Test it by implementing the lexicon in the MicroLang module. You need to define lincat N,A,V,V2 as well as the paradigms in MicroResource.
*To deliver*: the lexicon part of files MicroGrammarX.gf and MicroResourceX.gf for your language of choice X. Follow the structure of MicroGrammarEng and MicroResourceEng when preparing these.
## After lecture 7
1. Define the linearization types of main phrasal categories - the remaining categories in MicroLang.
2. Define the rest of the linearization rules in MicroLang.
*To deliver*: MicroLangX and MicroResourceX for your language of choice, with the lexicon part from Session 5 completed with syntax part.
## After lecture 8
1. Try out the applications in `../python` and read its README carefully.
2. Add a concrete syntax for your language to one of the grammars
in `../python/`, either `Query` or `Draw`.
The simplest way to do this
is first to copy the `Eng` grammar and then to change the words; the
syntax may work well as it is. Even though it can be a bit unnatural,
it should be in a wide sense natural.
3. Compile the grammar with `gf -make Query???.gf` so that your grammar
gets included (the same for `Draw`).
4. Generate phrases in GF by first importing your pgf file and then
issuing the command `gt | l -treebank`; fix your grammar if it looks
too bad.
5. Test the corresponding Python application with your language.
The Python code with embedded GF grammars will be explained in a greater
detail in Lecture 9.
*To deliver*: your grammar module.
*Deadline*: 29 May 2024. Demo your grammars (both Micro and this one) at
the last lecture of the course!
## A method for testing your Micro grammar
Since MicroLang is a proper part of the RGL, it can be easily implemented as an application grammar.
How to do this is shown in `grammar/functor/`, where the implementation consists of two files:
- `MicroLangFunctor.gf` which is a generic implementation working for all RGL languages,
- `MicroLangFunctorEng.gf` which is a *functor instantiation* for English, easily reproduciple for other languages than `Eng`.
To use this for testing, you can take the following steps:
1. Build a functor instantiation for your language by copying `MicroLangFunctorEng.gf` and changing `Eng` in the file name and inside the file to your language code.
2. Use GF to create a testfile by random generation:
```
$ echo "gr -number=1000 | l -tabtreebank" | gf english/MicroLangEng.gf functor/MicroLangFunctorEng.gf >test.tmp
```
3. Inspect the resulting file `test.tmp`.
But you can also use Unix `cut` to create separate files for the two versions of the grammar and `diff` to compare them:
```
$ cut -f2 test.tmp >test1.tmp
$ cut -f3 test.tmp >test2.tmp
$ diff test1.tmp test2.tmp
52c52
< the hot fire teachs her
---
> the hot fire teaches her
69c69
< the man teachs the apples
---
> the man teaches the apples
122c122
```
As seen from the result in this case, our implementation has a wrong inflection of the verb "teach".
The Mini grammar can be tested in the same way, by building a reference implementation using the functor in `functor/`.

View File

@@ -1,92 +0,0 @@
abstract Doctor = {
flags startcat = Phrase ;
cat
Phrase ; -- has she slept?
Fact ; -- she sleeps
Action ; -- sleep
Property ; -- be a doctor
Profession ; -- doctor
Person ; -- she
Place ; -- the hospital
Substance ; -- drugs
Illness ; -- fever
fun
presPosPhrase : Fact -> Phrase ; -- she sleeps
presNegPhrase : Fact -> Phrase ; -- she doesn't sleep
pastPosPhrase : Fact -> Phrase ; -- she has slept
pastNegPhrase : Fact -> Phrase ; -- she has not slept
presQuestionPhrase : Fact -> Phrase ; -- does she sleep
pastQuestionPhrase : Fact -> Phrase ; -- has she slept
impPosPhrase : Action -> Phrase ; -- eat
impNegPhrase : Action -> Phrase ; -- don't eat
actionFact : Person -> Action -> Fact ; -- she vaccinates you
propertyFact : Person -> Property -> Fact ; -- she is a doctor
isProfessionProperty : Profession -> Property ; -- be a doctor
isAtPlaceProperty : Place -> Property ; -- be at the hospital
haveIllnessProperty : Illness -> Property ; -- have a fever
needProfessionProperty : Profession -> Property ; -- need a doctor
theProfessionPerson : Profession -> Person ; -- the doctor
iMascPerson : Person ;
iFemPerson : Person ;
youMascPerson : Person ;
youFemPerson : Person ;
hePerson : Person ;
shePerson : Person ;
goToAction : Place -> Action ; -- go to the hospital
stayAtAction : Place -> Action ; -- stay at home
vaccinateAction : Person -> Action ; -- vaccinate you
examineAction : Person -> Action ; -- examine you
takeSubstanceAction : Substance -> Action ; -- take drugs
coughAction : Action ;
breatheAction : Action ;
vomitAction : Action ;
sleepAction : Action ;
undressAction : Action ;
dressAction : Action ;
eatAction : Action ;
drinkAction : Action ;
smokeAction : Action ;
measureTemperatureAction : Action ;
measureBloodPressureAction : Action ;
hospitalPlace : Place ;
homePlace : Place ;
schoolPlace : Place ;
workPlace : Place ;
doctorProfession : Profession ;
nurseProfession : Profession ;
interpreterProfession : Profession ;
bePregnantProperty : Property ;
beIllProperty : Property ;
beWellProperty : Property ;
beDeadProperty : Property ;
haveAllergiesProperty : Property ;
havePainsProperty : Property ;
haveChildrenProperty : Property ;
feverIllness : Illness ;
fluIllness : Illness ;
headacheIllness : Illness ;
diarrheaIllness : Illness ;
heartDiseaseIllness : Illness ;
lungDiseaseIllness : Illness ;
hypertensionIllness : Illness ;
alcoholSubstance : Substance ;
medicineSubstance : Substance ;
drugsSubstance : Substance ;
}

View File

@@ -1,110 +0,0 @@
concrete DoctorEng of Doctor =
open
SyntaxEng,
ParadigmsEng,
Prelude
in {
-- application using standard RGL
lincat
Phrase = Utt ;
Fact = Cl ;
Action = VP ;
Property = VP ;
Profession = CN ;
Person = NP ;
Place = {at,to : Adv} ;
Substance = NP ;
Illness = NP ;
lin
presPosPhrase fact = mkUtt (mkS fact) ;
presNegPhrase fact = mkUtt (mkS negativePol fact) ;
pastPosPhrase fact = mkUtt (mkS anteriorAnt fact) ;
pastNegPhrase fact = mkUtt (mkS anteriorAnt negativePol fact) ;
presQuestionPhrase fact = mkUtt (mkQS (mkQCl fact)) ;
pastQuestionPhrase fact = mkUtt (mkQS anteriorAnt (mkQCl fact)) ;
impPosPhrase action = mkUtt (mkImp action) ;
impNegPhrase action = mkUtt negativePol (mkImp action) ;
actionFact person action = mkCl person action ;
propertyFact person property = mkCl person property ;
isProfessionProperty profession = mkVP (mkNP a_Det profession) ;
needProfessionProperty profession = mkVP need_V2 (mkNP a_Det profession) ;
isAtPlaceProperty place = mkVP place.at ;
haveIllnessProperty illness = mkVP have_V2 illness ;
theProfessionPerson profession = mkNP the_Det profession ;
iMascPerson = i_NP ;
iFemPerson = i_NP ;
youMascPerson = you_NP ;
youFemPerson = you_NP ;
hePerson = he_NP ;
shePerson = she_NP ;
goToAction place = mkVP (mkVP go_V) place.to ;
stayAtAction place = mkVP (mkVP stay_V) place.at ;
vaccinateAction person = mkVP vaccinate_V2 person ;
examineAction person = mkVP examine_V2 person ;
takeSubstanceAction substance = mkVP take_V2 substance ;
-- end of what could be a functor
--------------------------------
coughAction = mkVP (mkV "cough") ;
breatheAction = mkVP (mkV "breathe") ;
vomitAction = mkVP (mkV "vomit") ;
sleepAction = mkVP (mkV "sleep" "slept" "slept") ;
undressAction = mkVP (mkVP take_V2 (mkNP thePl_Det (mkN "clothe"))) (pAdv "off") ;
dressAction = mkVP (mkVP put_V2 (mkNP thePl_Det (mkN "clothe"))) (pAdv "on") ;
eatAction = mkVP (mkV "eat" "ate" "eaten") ;
drinkAction = mkVP (mkV "drink" "drank" "drunk") ;
smokeAction = mkVP (mkV "smoke") ;
measureTemperatureAction = mkVP (mkV2 (mkV "measure")) (mkNP the_Det (mkN "body temperature")) ;
measureBloodPressureAction = mkVP (mkV2 (mkV "measure")) (mkNP the_Det (mkN "blood pressure")) ;
hospitalPlace = {at = pAdv "at the hospital" ; to = pAdv "to the hospital"} ;
homePlace = {at = pAdv "at home" ; to = pAdv "home"} ;
schoolPlace = {at = pAdv "at school" ; to = pAdv "to school"} ;
workPlace = {at = pAdv "at work" ; to = pAdv "to work"} ;
doctorProfession = mkCN (mkN "doctor") ;
nurseProfession = mkCN (mkN "nurse") ;
interpreterProfession = mkCN (mkN "interpreter") ;
bePregnantProperty = mkVP (mkA "pregnant") ;
beIllProperty = mkVP (mkA "ill") ;
beWellProperty = mkVP (mkA "well") ;
beDeadProperty = mkVP (mkA "dead") ;
haveAllergiesProperty = mkVP have_V2 (mkNP aPl_Det (mkN "allergy")) ;
havePainsProperty = mkVP have_V2 (mkNP aPl_Det (mkN "pain")) ;
haveChildrenProperty = mkVP have_V2 (mkNP aPl_Det (mkN "child" "children")) ;
feverIllness = mkNP a_Det (mkN "fever") ;
fluIllness = mkNP a_Det (mkN "flu") ;
headacheIllness = mkNP a_Det (mkN "headache") ;
diarrheaIllness = mkNP a_Det (mkN "diarrhea") ;
heartDiseaseIllness = mkNP a_Det (mkN "heart disease") ;
lungDiseaseIllness = mkNP a_Det (mkN "lung disease") ;
hypertensionIllness = mkNP (mkN "hypertension") ;
alcoholSubstance = mkNP (mkN "alcohol") ;
medicineSubstance = mkNP a_Det (mkN "drug") ;
drugsSubstance = mkNP aPl_Det (mkN "drug") ;
oper
pAdv : Str -> Adv = ParadigmsEng.mkAdv ;
go_V = mkV "go" "went" "gone" ;
stay_V = mkV "stay" ;
need_V2 = mkV2 (mkV "need") ;
take_V2 = mkV2 (mkV "take" "took" "taken") ;
put_V2 = mkV2 (mkV "put" "put" "put") ;
vaccinate_V2 = mkV2 (mkV "vaccinate") ;
examine_V2 = mkV2 (mkV "examine") ;
}

View File

@@ -1,117 +0,0 @@
--# -path=.:../abstract:../english:../api
-- model implementation using Mini RGL
concrete DoctorMiniEng of Doctor =
open
MiniSyntaxEng,
MiniParadigmsEng,
Prelude
in {
-- application using your own Mini* modules
lincat
Phrase = Utt ;
Fact = Cl ;
Action = VP ;
Property = VP ;
Profession = CN ;
Person = NP ;
Place = {at,to : Adv} ;
Substance = NP ;
Illness = NP ;
lin
presPosPhrase fact = mkUtt (mkS fact) ;
presNegPhrase fact = mkUtt (mkS negativePol fact) ;
pastPosPhrase fact = mkUtt (mkS anteriorAnt fact) ;
pastNegPhrase fact = mkUtt (mkS anteriorAnt negativePol fact) ;
-- presQuestionPhrase fact = mkUtt (mkQS (mkQCl fact)) ;
-- pastQuestionPhrase fact = mkUtt (mkQS anteriorAnt (mkQCl fact)) ;
presQuestionPhrase fact = let p : Utt = mkUtt (mkQS (mkQCl fact)) in p ** {s = p.s ++ SOFT_BIND ++ "?"} ;
pastQuestionPhrase fact = let p : Utt = mkUtt (mkQS anteriorAnt (mkQCl fact)) in p ** {s = p.s ++ SOFT_BIND ++ "?"} ;
impPosPhrase action = mkUtt (mkImp action) ;
impNegPhrase action = mkUtt negativePol (mkImp action) ;
actionFact person action = mkCl person action ;
propertyFact person property = mkCl person property ;
isProfessionProperty profession = mkVP (mkNP a_Det profession) ;
needProfessionProperty profession = mkVP need_V2 (mkNP a_Det profession) ;
isAtPlaceProperty place = mkVP place.at ;
haveIllnessProperty illness = mkVP have_V2 illness ;
theProfessionPerson profession = mkNP the_Det profession ;
iMascPerson = i_NP ;
iFemPerson = i_NP ;
youMascPerson = you_NP ;
youFemPerson = you_NP ;
hePerson = he_NP ;
shePerson = she_NP ;
goToAction place = mkVP (mkVP go_V) place.to ;
stayAtAction place = mkVP (mkVP stay_V) place.at ;
vaccinateAction person = mkVP vaccinate_V2 person ;
examineAction person = mkVP examine_V2 person ;
takeSubstanceAction substance = mkVP take_V2 substance ;
-- end of what could be a functor
--------------------------------
coughAction = mkVP (mkV "cough") ;
breatheAction = mkVP (mkV "breathe") ;
vomitAction = mkVP (mkV "vomit") ;
sleepAction = mkVP (mkV "sleep" "slept" "slept") ;
undressAction = mkVP (mkVP take_V2 (mkNP thePl_Det (mkN "clothe"))) (pAdv "off") ;
dressAction = mkVP (mkVP put_V2 (mkNP thePl_Det (mkN "clothe"))) (pAdv "on") ;
eatAction = mkVP (mkV "eat" "ate" "eaten") ;
drinkAction = mkVP (mkV "drink" "drank" "drunk") ;
smokeAction = mkVP (mkV "smoke") ;
measureTemperatureAction = mkVP (mkV2 (mkV "measure")) (mkNP the_Det (mkN "body temperature")) ;
measureBloodPressureAction = mkVP (mkV2 (mkV "measure")) (mkNP the_Det (mkN "blood pressure")) ;
hospitalPlace = {at = pAdv "at the hospital" ; to = pAdv "to the hospital"} ;
homePlace = {at = pAdv "at home" ; to = pAdv "home"} ;
schoolPlace = {at = pAdv "at school" ; to = pAdv "to school"} ;
workPlace = {at = pAdv "at work" ; to = pAdv "to work"} ;
doctorProfession = mkCN (mkN "doctor") ;
nurseProfession = mkCN (mkN "nurse") ;
interpreterProfession = mkCN (mkN "interpreter") ;
bePregnantProperty = mkVP (mkA "pregnant") ;
beIllProperty = mkVP (mkA "ill") ;
beWellProperty = mkVP (mkA "well") ;
beDeadProperty = mkVP (mkA "dead") ;
haveAllergiesProperty = mkVP have_V2 (mkNP aPl_Det (mkN "allergy")) ;
havePainsProperty = mkVP have_V2 (mkNP aPl_Det (mkN "pain")) ;
haveChildrenProperty = mkVP have_V2 (mkNP aPl_Det (mkN "child" "children")) ;
feverIllness = mkNP a_Det (mkN "fever") ;
fluIllness = mkNP a_Det (mkN "flu") ;
headacheIllness = mkNP a_Det (mkN "headache") ;
diarrheaIllness = mkNP a_Det (mkN "diarrhea") ;
heartDiseaseIllness = mkNP a_Det (mkN "heart disease") ;
lungDiseaseIllness = mkNP a_Det (mkN "lung disease") ;
hypertensionIllness = mkNP (mkN "hypertension") ;
alcoholSubstance = mkNP (mkN "alcohol") ;
medicineSubstance = mkNP a_Det (mkN "drug") ;
drugsSubstance = mkNP aPl_Det (mkN "drug") ;
oper
pAdv : Str -> Adv = MiniParadigmsEng.mkAdv ;
go_V = mkV "go" "went" "gone" ;
stay_V = mkV "stay" ;
need_V2 = mkV2 (mkV "need") ;
take_V2 = mkV2 (mkV "take" "took" "taken") ;
put_V2 = mkV2 (mkV "put" "put" "put") ;
vaccinate_V2 = mkV2 (mkV "vaccinate") ;
examine_V2 = mkV2 (mkV "examine") ;
}

116
lab3/README.md Normal file
View File

@@ -0,0 +1,116 @@
# Lab 3: Universal Dependencies
This lab is divided into two parts.
In [part 1](#part-1-ud-annotation), you will create a small parallel UD treebank for English/Swedish and a language of your choice.
In [part 2](#part-2-ud-parsing), you will train a parsing model and evaluate it on your treebank.
## Part 1: UD annotation
The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank.
### Step 1: familiarize with the CoNLL-U format
Go to [universaldependencies.org](https://universaldependencies.org/) and download a treebank for a language of your choice.
Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand.
### Step 2: choose a corpus
Choose one of the two corpora provided in this folder:
- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt
- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time.
In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses.
In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form
`word:<UPOS>`.
### Step 3: annotate
For each sentence in the corpus, the annotation tasks consists in:
1. analyzing the sentence in UD
2. translating it to a language of your choice
3. analyzing your translation
The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`.
In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations.
To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting) or use a dedicated annotation tool such as [Arborator](https://arborator.grew.fr/#/).
If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar).
Example:
`7 world NOUN 4 nmod`
expands to
`7 world _ NOUN _ _ 4 nmod _ _`
We recommend that you annotate at least the first few sentences from scratch.
When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation.
### Step 4: make sure your files match the CoNLL-U specification
Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
With deptreepy, you will need to issue the command
`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html`
which creates an HTML file you can open in you web browser.
If you can visualize your trees with any of these tools, it means that they are in valid CoNLL-U format.
If you want to check for more subtle errors, you can try to download and run [the official UD validator](https://github.com/UniversalDependencies/tools/blob/master/validate.py).
Submit the two CoNLL-U files on Canvas.
## Part 2: UD parsing
In this part of the lab, you will train and evaluate a UD parsing + POS tagging model.
For better performance, you are strongly encouraged to use the MLTGPU server.
### Step 1: setting up MaChAmp
1. optional, but recommended: create a Python virtual environment with the command
```
python -m venv ENVNAME
```
and activate it with
`source ENVNAME/bin/activate` (Linux/MacOS), or
`ENVNAME/Scripts/activate.bat` (Windows)
2. clone [the MaChAmp repository](https://github.com/machamp-nlp/machamp), move inside it and run
```
pip3 install -r requirements.txt
```
### Step 2: selecting the training and development data
Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it.
If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian).
If you are working on MLTGPU, you may choose a large treebank such as [Swedish-Talbanken](https://github.com/UniversalDependencies/UD_Swedish-Talbanken), which is already divided into a training, development and test split.
If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set.
In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development).
### Step 3: training
Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2.
You can now train your model by running
```
python3 train.py --dataset_configs configs/compsyn.json --device N
```
from the MaChAmp folder.
If you are working on MLTGPU, replace `N` with `0` (GPU). If you are using your laptop or EDUSERV, replace it with `-1`, which instructs MaChAmp to train the model on the CPU.
### Step 4: evaluation
Run your newly trained model with
```
python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N
```
and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as
```
python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu
```
On Canvas, submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser.

17
lab3/machamp_config.json Normal file
View File

@@ -0,0 +1,17 @@
{
"compsyn": {
"train_data_path": "PATH-TO-YOUR-TRAIN-SPLIT",
"dev_data_path": "PATH-TO-YOUR-DEV-SPLIT",
"word_idx": 1,
"tasks": {
"upos": {
"task_type": "seq",
"column_idx": 3
},
"dependency": {
"task_type": "dependency",
"column_idx": 6
}
}
}
}