From 19fb44c64db81a754876859844ec13924fb4cf3d Mon Sep 17 00:00:00 2001 From: Arianna Masciolini Date: Fri, 21 Mar 2025 13:50:23 +0100 Subject: [PATCH] rename stuff + new ud lab draft --- lab1/README.md | 120 ------------------ lab1/using_udpipe.md | 63 --------- lab3/README.md | 116 +++++++++++++++++ {lab1 => lab3}/comp-syntax-corpus-english.txt | 0 {lab1 => lab3}/comp-syntax-corpus-swedish.txt | 0 {lab1 => lab3}/lecture3-examples.txt | 0 lab3/machamp_config.json | 17 +++ {lab2 => labs1-2}/README.md | 16 ++- .../grammar/abstract/MicroLang.gf | 0 .../grammar/abstract/MicroLang.labels | 0 .../grammar/abstract/MiniGrammar.gf | 0 .../grammar/abstract/MiniLang.gf | 0 .../grammar/abstract/MiniLang.labels | 0 .../grammar/abstract/MiniLexicon.gf | 0 {lab2 => labs1-2}/grammar/api/MiniSyntax.gf | 0 .../grammar/api/MiniSyntaxEng.gf | 0 .../grammar/application-2022/Doctor.gf | 0 .../grammar/application-2022/DoctorEng.gf | 0 .../grammar/application-2022/DoctorMiniEng.gf | 0 .../grammar/english/MicroLangEng.gf | 0 .../grammar/english/MicroLangEng.labels | 0 .../grammar/english/MicroResEng.gf | 0 .../grammar/english/MiniGrammarEng.gf | 0 .../grammar/english/MiniLangEng.gf | 0 .../grammar/english/MiniLangEng.labels | 0 .../grammar/english/MiniLexiconEng.gf | 0 .../grammar/english/MiniParadigmsEng.gf | 0 .../grammar/english/MiniResEng.gf | 0 {lab2 => labs1-2}/grammar/foods/Foods.gf | 0 .../grammar/functor/MicroLangFunctor.gf | 0 .../grammar/functor/MicroLangFunctorEng.gf | 0 .../grammar/functor/MicroLangFunctorSwe.gf | 0 .../grammar/functor/MiniLangFunctor.gf | 0 .../grammar/functor/MiniLangFunctorEng.gf | 0 .../grammar/functor/MiniLangFunctorSwe.gf | 0 .../grammar/intro/MicroLangEng.gf | 0 .../grammar/intro/MicroResEng.gf | 0 .../grammar/italian/MicroLangIta.gf | 0 .../grammar/italian/MicroLangIta.gfo | Bin .../grammar/italian/MicroResIta.gf | 0 .../grammar/italian/MicroResIta.gfo | Bin .../grammar/myproject/MicroLangEng.gf | 0 .../grammar/myproject/MicroLangEng.gfo | Bin .../grammar/myproject/MicroLangSwe.gf | 0 .../grammar/myproject/MicroLangSwe.gfo | Bin .../grammar/myproject/MicroResEng.gf | 0 .../grammar/myproject/MicroResEng.gfo | Bin .../grammar/myproject/MicroResSwe.gf | 0 .../grammar/myproject/MicroResSwe.gfo | Bin {lab2 => labs1-2}/grammar/test.gfs | 0 {lab2 => labs1-2}/intro/Intro.gf | 0 {lab2 => labs1-2}/intro/IntroEng.gf | 0 {lab2 => labs1-2}/intro/IntroFre.gf | 0 {lab2 => labs1-2}/intro/english.cf | 0 {lab2 => labs1-2}/wikipedia-2022/Countries.gf | 0 .../wikipedia-2022/CountriesEng.gf | 0 .../wikipedia-2022/CountriesFin.gf | 0 .../wikipedia-2022/CountriesGer.gf | 0 .../wikipedia-2022/CountriesSwe.gf | 0 .../wikipedia-2022/CountryNames.gf | 0 .../wikipedia-2022/CountryNamesEng.gf | 0 .../wikipedia-2022/CountryNamesFin.gf | 0 .../wikipedia-2022/CountryNamesGer.gf | 0 .../wikipedia-2022/CountryNamesSwe.gf | 0 {lab2 => labs1-2}/wikipedia-2022/Facts.gf | 0 {lab2 => labs1-2}/wikipedia-2022/FactsEng.gf | 0 {lab2 => labs1-2}/wikipedia-2022/FactsFin.gf | 0 {lab2 => labs1-2}/wikipedia-2022/FactsGer.gf | 0 {lab2 => labs1-2}/wikipedia-2022/FactsSwe.gf | 0 .../wikipedia-2022/country_facts.py | 0 .../wikipedia-2022/data_facts.py | 0 .../wikipedia-2022/extract_names.py | 0 72 files changed, 148 insertions(+), 184 deletions(-) delete mode 100644 lab1/README.md delete mode 100644 lab1/using_udpipe.md create mode 100644 lab3/README.md rename {lab1 => lab3}/comp-syntax-corpus-english.txt (100%) rename {lab1 => lab3}/comp-syntax-corpus-swedish.txt (100%) rename {lab1 => lab3}/lecture3-examples.txt (100%) create mode 100644 lab3/machamp_config.json rename {lab2 => labs1-2}/README.md (72%) rename {lab2 => labs1-2}/grammar/abstract/MicroLang.gf (100%) rename {lab2 => labs1-2}/grammar/abstract/MicroLang.labels (100%) rename {lab2 => labs1-2}/grammar/abstract/MiniGrammar.gf (100%) rename {lab2 => labs1-2}/grammar/abstract/MiniLang.gf (100%) rename {lab2 => labs1-2}/grammar/abstract/MiniLang.labels (100%) rename {lab2 => labs1-2}/grammar/abstract/MiniLexicon.gf (100%) rename {lab2 => labs1-2}/grammar/api/MiniSyntax.gf (100%) rename {lab2 => labs1-2}/grammar/api/MiniSyntaxEng.gf (100%) rename {lab2 => labs1-2}/grammar/application-2022/Doctor.gf (100%) rename {lab2 => labs1-2}/grammar/application-2022/DoctorEng.gf (100%) rename {lab2 => labs1-2}/grammar/application-2022/DoctorMiniEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MicroLangEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MicroLangEng.labels (100%) rename {lab2 => labs1-2}/grammar/english/MicroResEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MiniGrammarEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MiniLangEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MiniLangEng.labels (100%) rename {lab2 => labs1-2}/grammar/english/MiniLexiconEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MiniParadigmsEng.gf (100%) rename {lab2 => labs1-2}/grammar/english/MiniResEng.gf (100%) rename {lab2 => labs1-2}/grammar/foods/Foods.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MicroLangFunctor.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MicroLangFunctorEng.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MicroLangFunctorSwe.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MiniLangFunctor.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MiniLangFunctorEng.gf (100%) rename {lab2 => labs1-2}/grammar/functor/MiniLangFunctorSwe.gf (100%) rename {lab2 => labs1-2}/grammar/intro/MicroLangEng.gf (100%) rename {lab2 => labs1-2}/grammar/intro/MicroResEng.gf (100%) rename {lab2 => labs1-2}/grammar/italian/MicroLangIta.gf (100%) rename {lab2 => labs1-2}/grammar/italian/MicroLangIta.gfo (100%) rename {lab2 => labs1-2}/grammar/italian/MicroResIta.gf (100%) rename {lab2 => labs1-2}/grammar/italian/MicroResIta.gfo (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroLangEng.gf (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroLangEng.gfo (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroLangSwe.gf (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroLangSwe.gfo (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroResEng.gf (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroResEng.gfo (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroResSwe.gf (100%) rename {lab2 => labs1-2}/grammar/myproject/MicroResSwe.gfo (100%) rename {lab2 => labs1-2}/grammar/test.gfs (100%) rename {lab2 => labs1-2}/intro/Intro.gf (100%) rename {lab2 => labs1-2}/intro/IntroEng.gf (100%) rename {lab2 => labs1-2}/intro/IntroFre.gf (100%) rename {lab2 => labs1-2}/intro/english.cf (100%) rename {lab2 => labs1-2}/wikipedia-2022/Countries.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountriesEng.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountriesFin.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountriesGer.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountriesSwe.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountryNames.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountryNamesEng.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountryNamesFin.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountryNamesGer.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/CountryNamesSwe.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/Facts.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/FactsEng.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/FactsFin.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/FactsGer.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/FactsSwe.gf (100%) rename {lab2 => labs1-2}/wikipedia-2022/country_facts.py (100%) rename {lab2 => labs1-2}/wikipedia-2022/data_facts.py (100%) rename {lab2 => labs1-2}/wikipedia-2022/extract_names.py (100%) diff --git a/lab1/README.md b/lab1/README.md deleted file mode 100644 index 29e2abe..0000000 --- a/lab1/README.md +++ /dev/null @@ -1,120 +0,0 @@ -# Lab 1: Grammatical analysis - - -This lab follows Chapters 1-4 in the course notes. Each part is started after the lecture on the corresponding chapter. -The assignments are submitted via Canvas. - -## Chapter 1: explore the parallel UD treebank (PUD) - -1. Go to [universaldependencies.org](https://universaldependencies.org/) and download Version 2.7+ treebanks -2. Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g. `UD_English-PUD/` -3. Select a language to compare with English. -4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the [deptreepy](https://github.com/aarneranta/deptreepy/) or [gf-ud](https://github.com/grammaticalFramework/gf-ud) tools. The latter is also available on the eduserv server.) -5. Convert the following four trees from CoNLL-U format to graphical trees by hand, on paper. - - a short English tree (5-10 words, of your choice) and its translation. - - a long English tree (>25 words) and its translation. -1. Draw word alignments for some non-trivial example in the PUD treebank, on paper. - Use the same trees as in the previous question. - What can you say about the syntactic differences between the languages? - - -## Chapter 2: design the morpological types of the major parts of speech in your selected language - -1. It is enough to cover NOUN, ADJ, and VERB. -2. Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features. -3. Then use data from PUD to check which morphological features actually occur in the treebank for that language. - -## Chapter 3: UD syntax analysis - -In this lab, you will annotate a bilingual corpus with UD. -You can choose between starting with an English corpus and translate it to a language of your choice, or start with a Swedish corpus to translate into English. - -Your task is to: - -1. write an CoNLL file analysing your chosen corpus -2. translate it -3. write a CoNLL file analysing your translation - -### Option 1: English data -The English text is given in the file [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) in this directory. -The corpus is a combination of different sources, including the Parallel UD treebank (PUD). -If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly from your browser. These automatic analyses must of course be taken with a grain of salt. - -### Option 2: Swedish data -The Swedish text is given in the file [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) in this directory. -It consists of teacher-corrected sentences from the [Swedish Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold)[^1], which is currently being annotated in UD for the first time. -In this case, there is no "gold standard" to check your answers against, but by choosing this corpus you will directly contribute to an ongoing annotation effort. -Of course, you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses. - -In both corpora, the first few sentences are POS-tagged, with each word having the form - -`word:` - -Hint: you can initialize the task by converting each word or word: to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course. - -The UD annotation that you produce manually can be simplified CoNLL, with just the fields - -`position word postag head label` - -Make sure that each field is exactly one token, so that the whole line has exactly 5 tokens. - -This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already). - -`position word _ postag _ _ head label _ _` - -Example: - -`7 world NOUN 4 nmod` - -expands to - -`7 world _ NOUN _ _ 4 nmod _ _` - -(Unfortunately, the tabs are not visible in the md output.) -The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll` (available on eduserv) or with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2). - -Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [gf-ud](https://github.com/grammaticalFramework/gf-ud) or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it. - -With deptreepy, you will need to issue the command - -`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html` - -which creates an HTML file you can open in you web browser. - -If you use the gf-ud tool, the command is - -`cat my-file.conllu | ./gf-ud conll2pdf` - -which generates a PDF. However, this does not support all foreign characters. - -(It is possible that you won't be able to visualize the trees directly on eduserv. -Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.) - -## (Chapter 4: phrase structure analysis) - -> __NOTE:__ chapter 4 is __not__ required in the 2024 edition of the course. -> You are of course welcome to try out these exercises, but they will not be graded. - -### Prerequisites: get `gf-ud` to work -There are multiple ways to use `gf-ud`: -- using the version that is installed on eduserv -- installing a pre-compiled executable, available for Mac and Ubuntu machines at http://www.grammaticalframework.org/~aarne/software/ -- compiling the source code, available at https://github.com/GrammaticalFramework/gf-ud. `gf-ud` can be built: - - with `make` provided that you have the GHC Haskell compiler and the gf-core libraries (available at https://github.com/GrammaticalFramework/gf-core) installed - - with the Haskell Stack tool, by running `stack install`. This will install all the necessary dependency automatically. - -### Tasks -1. Construct (by hand) phrase structure trees for some of the sentences in the corpus used in Chapter 3, both for English and your chosen language. - -2. Test the grammar at - - https://github.com/GrammaticalFramework/gf-ud/blob/master/grammars/English.dbnf - - on last week's corpus, both for English and your own language. - In practice, this means: - - running `gf-ud`'s `dbnf` command on (possibly POS-tagged) versions of the sentences in Chapter 3's corpus. - - comparing the CoNNL-U and parse trees obtained in this way with, respectively, your hand-drawn parse trees and the CoNNL-U trees from Chapter 3. Parse tree comparison can be qualitative, while CoNNL-U trees are to be compared quantitatively via `gf-ud eval`. - -3. Modify the grammar to suit your language and test it on some of the UD treebanks by using `gf-ud eval`. Try to obtain a `udScore` above 0.60. You are welcome to explain the changes you make. - -[^1]: to be precise, the sentences you will use have been extracted from [DaLAJ-GED-SuperLim 2.0](https://spraakbanken.gu.se/en/resources/dalaj-ged-superlim), a publicly available spinoff of the main SweLL corpus. \ No newline at end of file diff --git a/lab1/using_udpipe.md b/lab1/using_udpipe.md deleted file mode 100644 index 5f1741e..0000000 --- a/lab1/using_udpipe.md +++ /dev/null @@ -1,63 +0,0 @@ -# UDPipe: quick instructions - -## Download and install - -The simplest way to use UDPipe is to install the binaries for UDPipe-1, which exist for several operating systems. -They can be downloaded from - -https://github.com/ufal/udpipe/releases/download/v1.3.0/udpipe-1.3.0-bin.zip - -When you have downloaded and unzipped this file, you will find the binary for your system in a subdirectory, for instance, -``` -udpipe-1.3.0-bin/bin-macos/udpipe -``` -is the binary for MacOS. -If you include this directory on your PATH, you can run the command `udpipe` from anywhere. - -Running the parser for a language requires a model for that language. -Models can be accessed via - -https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131 - -This page includes a long list of models and a command to install them all. -If you need only some of them, you can do, for instance, -``` -$ wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131//english-lines-ud-2.5-191206.udpipe -``` - -## Running the parser - -Assuming that you have the binary `udpipe` and the model `english-lines-ud-2.5-191206.udpipe` on you path, you can parse a single sentence with -``` -$ echo "my hovercraft is full of eels" | udpipe --tokenize --tag --parse english-lines-ud-2.5-191206.udpipe -``` -The result is a UD tree in the CoNLL-U notation, -``` -# newdoc -# newpar -# sent_id = 1 -# text = my hovercraft is full of eels -1 my I PRON P1SG-GEN Number=Sing|Person=1|Poss=Yes|PronType=Prs nmod:poss _ _ -2 hovercraft hovercraft NOUN SG-NOM Number=Sing 4 nsubj _ _ -3 is be AUX PRES Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop _ _ -4 full full ADJ POS Degree=Pos 0 root _ _ -5 of of ADP _ _ 6 case _ _ -6 eels eel NOUN PL-NOM Number=Plur 4 nmod _ SpacesAfter=\n -``` -If you also have `gfud` and `pdflatex` on your path, you can pipe the result into `gfud conll2pdf` to see the graphical tree. - -As `udpipe` reads standard input, you can read it from a file, such as `lecture3-examples.txt`: -``` -$ cat | udpipe --tokenize --tag --parse -``` -Notice that sentences in that file must either end with a period or be separated by empty lines, because otherwise the whole file is parsed as one sentence. - - -## Training a new model - -If you have a treebank in the CoNLL-U format, you can use it for training a new model, with -``` -$ cat .conllu | udpipe --tokenizer none --tagger none --train .udpipe -``` - - diff --git a/lab3/README.md b/lab3/README.md new file mode 100644 index 0000000..ca384c0 --- /dev/null +++ b/lab3/README.md @@ -0,0 +1,116 @@ +# Lab 3: Universal Dependencies + +This lab is divided into two parts. +In [part 1](#part-1-ud-annotation), you will create a small parallel UD treebank for English/Swedish and a language of your choice. +In [part 2](#part-2-ud-parsing), you will train a parsing model and evaluate it on your treebank. + +## Part 1: UD annotation +The goal of this part of the lab is for you to become able to contribute to a UD annotation project. You will familiarize with the CoNNL-U format and annotate your own parallel UD treebank. + +### Step 1: familiarize with the CoNLL-U format +Go to [universaldependencies.org](https://universaldependencies.org/) and download a treebank for a language of your choice. +Choose a short (5-10 tokens) and a long (>25 words) sentence and convert it from CoNNL-U to a graphical trees by hand. + +### Step 2: choose a corpus +Choose one of the two corpora provided in this folder: + +- [`comp-syntax-corpus-english.txt`](comp-syntax-corpus-english.txt) is a combination of __English__ sentences from different sources, including [the Parallel UD treebank (PUD)](https://github.com/UniversalDependencies/UD_English-PUD/tree/master). If you want to cheat - or just check your answers - you can look for them in the official treebank. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly in your browser. These automatic analyses must of course be taken with a grain of salt +- [`comp-syntax-corpus-swedish.txt`](comp-syntax-corpus-swedish.txt) consists of teacher-corrected sentences from the [__Swedish__ Learner Language (SweLL) corpus](https://spraakbanken.gu.se/en/resources/swell-gold), which is currently being annotated in UD for the first time. +In this case, there is no "gold standard" to check your answers against, but you can still compare your solutions with [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/)'s automatic analyses. + +In both corpora, the first few sentences are pre-tokenized and POS-tagged. Each token is in the form + +`word:`. + +### Step 3: annotate +For each sentence in the corpus, the annotation tasks consists in: + +1. analyzing the sentence in UD +2. translating it to a language of your choice +3. analyzing your translation + +The only required fields are `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`. + +In the end, you will submit two parallel CoNLL-U files, one containing the analyses of the source sentences and one for the analyses of the translations. + +To produce the CoNLL-U files, you may work in your text editor (if you use Visual Studio Code, you can use the [vscode-conllu](https://marketplace.visualstudio.com/items?itemName=lgrobol.vscode-conllu) to get syntax highlighting) or use a dedicated annotation tool such as [Arborator](https://arborator.grew.fr/#/). + +If you work in your text editor, it might be easier to first write a simplified CoNLL-U, with just the fields `ID`, `FORM`, `UPOS`, `HEAD` and `DEPREL`, separated by tabs, and then expand it to full CoNLL-U with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2) (or similar). + +Example: + +`7 world NOUN 4 nmod` + +expands to + +`7 world _ NOUN _ _ 4 nmod _ _` + +We recommend that you annotate at least the first few sentences from scratch. +When you start feeling confident, you may pre-parse the remaining ones with UDPipe and manually correct the automatic annotation. + +### Step 4: make sure your files match the CoNLL-U specification +Once you have full CoNLL, you can use [deptreepy](https://github.com/aarneranta/deptreepy/), [STUnD](https://harisont.github.io/STUnD/) or [the official online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it. + +With deptreepy, you will need to issue the command + +`cat my-file.conllu | python deptreepy.py visualize_conllu > my-file.html` + +which creates an HTML file you can open in you web browser. + +If you can visualize your trees with any of these tools, it means that they are in valid CoNLL-U format. +If you want to check for more subtle errors, you can try to download and run [the official UD validator](https://github.com/UniversalDependencies/tools/blob/master/validate.py). + +Submit the two CoNLL-U files on Canvas. + +## Part 2: UD parsing +In this part of the lab, you will train and evaluate a UD parsing + POS tagging model. +For better performance, you are strongly encouraged to use the MLTGPU server. + +### Step 1: setting up MaChAmp +1. optional, but recommended: create a Python virtual environment with the command + ``` + python -m venv ENVNAME + ``` + and activate it with + + `source ENVNAME/bin/activate` (Linux/MacOS), or + + `ENVNAME/Scripts/activate.bat` (Windows) +2. clone [the MaChAmp repository](https://github.com/machamp-nlp/machamp), move inside it and run + ``` + pip3 install -r requirements.txt + ``` + +### Step 2: selecting the training and development data +Choose a UD treebank for one of the two languages you annotated in [part 1](#part-1-ud-annotation) and download it. +If you translated the corpus to a language that does not have a UD treebank, download a treebank for a related language (e.g. Italian if you annotated sentences in Sardinian). + +If you are working on MLTGPU, you may choose a large treebank such as [Swedish-Talbanken](https://github.com/UniversalDependencies/UD_Swedish-Talbanken), which is already divided into a training, development and test split. + +If you are working on your laptop and/or if your language does not have a lot of data available, you may want to use a smaller treebank, such as [Amharic-ATT](https://github.com/UniversalDependencies/UD_Amharic-ATT), which only comes with a test set. +In this case, split the test into a training and a development portion (e.g. 80% of the sentences for training and 20% for development). + +### Step 3: training +Copy `compsyn.json` to `machamp/configs` and replace the traning and development data paths with the paths to the files you selected/created in step 2. + +You can now train your model by running + +``` +python3 train.py --dataset_configs configs/compsyn.json --device N +``` +from the MaChAmp folder. +If you are working on MLTGPU, replace `N` with `0` (GPU). If you are using your laptop or EDUSERV, replace it with `-1`, which instructs MaChAmp to train the model on the CPU. + +### Step 4: evaluation +Run your newly trained model with + +``` +python predict.py logs/compsyn/DATE/model.pt PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu --device N +``` +and use the `machamp/scripts/misc/conll18_ud_eval.py` script to evaluate the system output against your annotations. You can run it as + +``` +python conll18_ud_eval.py PATH-TO-YOUR-PART1-TREEBANK predictions/OUTPUT-FILE-NAME.conllu +``` + +Submit the training logs, the predictions and the output of `conll18_ud_eval.py`, along with a short text summarizing your considerations on the performance of the parser. \ No newline at end of file diff --git a/lab1/comp-syntax-corpus-english.txt b/lab3/comp-syntax-corpus-english.txt similarity index 100% rename from lab1/comp-syntax-corpus-english.txt rename to lab3/comp-syntax-corpus-english.txt diff --git a/lab1/comp-syntax-corpus-swedish.txt b/lab3/comp-syntax-corpus-swedish.txt similarity index 100% rename from lab1/comp-syntax-corpus-swedish.txt rename to lab3/comp-syntax-corpus-swedish.txt diff --git a/lab1/lecture3-examples.txt b/lab3/lecture3-examples.txt similarity index 100% rename from lab1/lecture3-examples.txt rename to lab3/lecture3-examples.txt diff --git a/lab3/machamp_config.json b/lab3/machamp_config.json new file mode 100644 index 0000000..4c2eea6 --- /dev/null +++ b/lab3/machamp_config.json @@ -0,0 +1,17 @@ +{ + "compsyn": { + "train_data_path": "PATH-TO-YOUR-TRAIN-SPLIT", + "dev_data_path": "PATH-TO-YOUR-DEV-SPLIT", + "word_idx": 1, + "tasks": { + "upos": { + "task_type": "seq", + "column_idx": 3 + }, + "dependency": { + "task_type": "dependency", + "column_idx": 6 + } + } + } +} \ No newline at end of file diff --git a/lab2/README.md b/labs1-2/README.md similarity index 72% rename from lab2/README.md rename to labs1-2/README.md index 78e7831..54136de 100644 --- a/lab2/README.md +++ b/labs1-2/README.md @@ -3,8 +3,22 @@ This lab corresponds to Chapters 5 to 9 of the Notes, but follows them only loosely. Therefore we will structure it according to the exercise sessions rather than chapters. -The abstract syntax is given in the subdirectory grammars/abstract/ +The abstract syntax is given in the subdirectory grammars/abstract/ +1. Go to [universaldependencies.org](https://universaldependencies.org/) and download Version 2.7+ treebanks +2. Look up the Parallel UD treebanks for those 21 languages that have it. They are named e.g. `UD_English-PUD/` +3. Select a language to compare with English. +4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or, more easily, with the [deptreepy](https://github.com/aarneranta/deptreepy/) or [gf-ud](https://github.com/grammaticalFramework/gf-ud) tools. The latter is also available on the eduserv server.) + +1. Draw word alignments for some non-trivial example in the PUD treebank, on paper. + Use the same trees as in the previous question. + What can you say about the syntactic differences between the languages? + + ## Chapter 2: design the morpological types of the major parts of speech in your selected language + +1. It is enough to cover NOUN, ADJ, and VERB. +2. Use a traditional grammar book or a Wikipedia article to identify the inflectional and inherent features. +3. Then use data from PUD to check which morphological features actually occur in the treebank for that language. ## After lecture 6 1. Design a morphology for the main lexical types (N, A, V) with parameters and a couple of paradigms. diff --git a/lab2/grammar/abstract/MicroLang.gf b/labs1-2/grammar/abstract/MicroLang.gf similarity index 100% rename from lab2/grammar/abstract/MicroLang.gf rename to labs1-2/grammar/abstract/MicroLang.gf diff --git a/lab2/grammar/abstract/MicroLang.labels b/labs1-2/grammar/abstract/MicroLang.labels similarity index 100% rename from lab2/grammar/abstract/MicroLang.labels rename to labs1-2/grammar/abstract/MicroLang.labels diff --git a/lab2/grammar/abstract/MiniGrammar.gf b/labs1-2/grammar/abstract/MiniGrammar.gf similarity index 100% rename from lab2/grammar/abstract/MiniGrammar.gf rename to labs1-2/grammar/abstract/MiniGrammar.gf diff --git a/lab2/grammar/abstract/MiniLang.gf b/labs1-2/grammar/abstract/MiniLang.gf similarity index 100% rename from lab2/grammar/abstract/MiniLang.gf rename to labs1-2/grammar/abstract/MiniLang.gf diff --git a/lab2/grammar/abstract/MiniLang.labels b/labs1-2/grammar/abstract/MiniLang.labels similarity index 100% rename from lab2/grammar/abstract/MiniLang.labels rename to labs1-2/grammar/abstract/MiniLang.labels diff --git a/lab2/grammar/abstract/MiniLexicon.gf b/labs1-2/grammar/abstract/MiniLexicon.gf similarity index 100% rename from lab2/grammar/abstract/MiniLexicon.gf rename to labs1-2/grammar/abstract/MiniLexicon.gf diff --git a/lab2/grammar/api/MiniSyntax.gf b/labs1-2/grammar/api/MiniSyntax.gf similarity index 100% rename from lab2/grammar/api/MiniSyntax.gf rename to labs1-2/grammar/api/MiniSyntax.gf diff --git a/lab2/grammar/api/MiniSyntaxEng.gf b/labs1-2/grammar/api/MiniSyntaxEng.gf similarity index 100% rename from lab2/grammar/api/MiniSyntaxEng.gf rename to labs1-2/grammar/api/MiniSyntaxEng.gf diff --git a/lab2/grammar/application-2022/Doctor.gf b/labs1-2/grammar/application-2022/Doctor.gf similarity index 100% rename from lab2/grammar/application-2022/Doctor.gf rename to labs1-2/grammar/application-2022/Doctor.gf diff --git a/lab2/grammar/application-2022/DoctorEng.gf b/labs1-2/grammar/application-2022/DoctorEng.gf similarity index 100% rename from lab2/grammar/application-2022/DoctorEng.gf rename to labs1-2/grammar/application-2022/DoctorEng.gf diff --git a/lab2/grammar/application-2022/DoctorMiniEng.gf b/labs1-2/grammar/application-2022/DoctorMiniEng.gf similarity index 100% rename from lab2/grammar/application-2022/DoctorMiniEng.gf rename to labs1-2/grammar/application-2022/DoctorMiniEng.gf diff --git a/lab2/grammar/english/MicroLangEng.gf b/labs1-2/grammar/english/MicroLangEng.gf similarity index 100% rename from lab2/grammar/english/MicroLangEng.gf rename to labs1-2/grammar/english/MicroLangEng.gf diff --git a/lab2/grammar/english/MicroLangEng.labels b/labs1-2/grammar/english/MicroLangEng.labels similarity index 100% rename from lab2/grammar/english/MicroLangEng.labels rename to labs1-2/grammar/english/MicroLangEng.labels diff --git a/lab2/grammar/english/MicroResEng.gf b/labs1-2/grammar/english/MicroResEng.gf similarity index 100% rename from lab2/grammar/english/MicroResEng.gf rename to labs1-2/grammar/english/MicroResEng.gf diff --git a/lab2/grammar/english/MiniGrammarEng.gf b/labs1-2/grammar/english/MiniGrammarEng.gf similarity index 100% rename from lab2/grammar/english/MiniGrammarEng.gf rename to labs1-2/grammar/english/MiniGrammarEng.gf diff --git a/lab2/grammar/english/MiniLangEng.gf b/labs1-2/grammar/english/MiniLangEng.gf similarity index 100% rename from lab2/grammar/english/MiniLangEng.gf rename to labs1-2/grammar/english/MiniLangEng.gf diff --git a/lab2/grammar/english/MiniLangEng.labels b/labs1-2/grammar/english/MiniLangEng.labels similarity index 100% rename from lab2/grammar/english/MiniLangEng.labels rename to labs1-2/grammar/english/MiniLangEng.labels diff --git a/lab2/grammar/english/MiniLexiconEng.gf b/labs1-2/grammar/english/MiniLexiconEng.gf similarity index 100% rename from lab2/grammar/english/MiniLexiconEng.gf rename to labs1-2/grammar/english/MiniLexiconEng.gf diff --git a/lab2/grammar/english/MiniParadigmsEng.gf b/labs1-2/grammar/english/MiniParadigmsEng.gf similarity index 100% rename from lab2/grammar/english/MiniParadigmsEng.gf rename to labs1-2/grammar/english/MiniParadigmsEng.gf diff --git a/lab2/grammar/english/MiniResEng.gf b/labs1-2/grammar/english/MiniResEng.gf similarity index 100% rename from lab2/grammar/english/MiniResEng.gf rename to labs1-2/grammar/english/MiniResEng.gf diff --git a/lab2/grammar/foods/Foods.gf b/labs1-2/grammar/foods/Foods.gf similarity index 100% rename from lab2/grammar/foods/Foods.gf rename to labs1-2/grammar/foods/Foods.gf diff --git a/lab2/grammar/functor/MicroLangFunctor.gf b/labs1-2/grammar/functor/MicroLangFunctor.gf similarity index 100% rename from lab2/grammar/functor/MicroLangFunctor.gf rename to labs1-2/grammar/functor/MicroLangFunctor.gf diff --git a/lab2/grammar/functor/MicroLangFunctorEng.gf b/labs1-2/grammar/functor/MicroLangFunctorEng.gf similarity index 100% rename from lab2/grammar/functor/MicroLangFunctorEng.gf rename to labs1-2/grammar/functor/MicroLangFunctorEng.gf diff --git a/lab2/grammar/functor/MicroLangFunctorSwe.gf b/labs1-2/grammar/functor/MicroLangFunctorSwe.gf similarity index 100% rename from lab2/grammar/functor/MicroLangFunctorSwe.gf rename to labs1-2/grammar/functor/MicroLangFunctorSwe.gf diff --git a/lab2/grammar/functor/MiniLangFunctor.gf b/labs1-2/grammar/functor/MiniLangFunctor.gf similarity index 100% rename from lab2/grammar/functor/MiniLangFunctor.gf rename to labs1-2/grammar/functor/MiniLangFunctor.gf diff --git a/lab2/grammar/functor/MiniLangFunctorEng.gf b/labs1-2/grammar/functor/MiniLangFunctorEng.gf similarity index 100% rename from lab2/grammar/functor/MiniLangFunctorEng.gf rename to labs1-2/grammar/functor/MiniLangFunctorEng.gf diff --git a/lab2/grammar/functor/MiniLangFunctorSwe.gf b/labs1-2/grammar/functor/MiniLangFunctorSwe.gf similarity index 100% rename from lab2/grammar/functor/MiniLangFunctorSwe.gf rename to labs1-2/grammar/functor/MiniLangFunctorSwe.gf diff --git a/lab2/grammar/intro/MicroLangEng.gf b/labs1-2/grammar/intro/MicroLangEng.gf similarity index 100% rename from lab2/grammar/intro/MicroLangEng.gf rename to labs1-2/grammar/intro/MicroLangEng.gf diff --git a/lab2/grammar/intro/MicroResEng.gf b/labs1-2/grammar/intro/MicroResEng.gf similarity index 100% rename from lab2/grammar/intro/MicroResEng.gf rename to labs1-2/grammar/intro/MicroResEng.gf diff --git a/lab2/grammar/italian/MicroLangIta.gf b/labs1-2/grammar/italian/MicroLangIta.gf similarity index 100% rename from lab2/grammar/italian/MicroLangIta.gf rename to labs1-2/grammar/italian/MicroLangIta.gf diff --git a/lab2/grammar/italian/MicroLangIta.gfo b/labs1-2/grammar/italian/MicroLangIta.gfo similarity index 100% rename from lab2/grammar/italian/MicroLangIta.gfo rename to labs1-2/grammar/italian/MicroLangIta.gfo diff --git a/lab2/grammar/italian/MicroResIta.gf b/labs1-2/grammar/italian/MicroResIta.gf similarity index 100% rename from lab2/grammar/italian/MicroResIta.gf rename to labs1-2/grammar/italian/MicroResIta.gf diff --git a/lab2/grammar/italian/MicroResIta.gfo b/labs1-2/grammar/italian/MicroResIta.gfo similarity index 100% rename from lab2/grammar/italian/MicroResIta.gfo rename to labs1-2/grammar/italian/MicroResIta.gfo diff --git a/lab2/grammar/myproject/MicroLangEng.gf b/labs1-2/grammar/myproject/MicroLangEng.gf similarity index 100% rename from lab2/grammar/myproject/MicroLangEng.gf rename to labs1-2/grammar/myproject/MicroLangEng.gf diff --git a/lab2/grammar/myproject/MicroLangEng.gfo b/labs1-2/grammar/myproject/MicroLangEng.gfo similarity index 100% rename from lab2/grammar/myproject/MicroLangEng.gfo rename to labs1-2/grammar/myproject/MicroLangEng.gfo diff --git a/lab2/grammar/myproject/MicroLangSwe.gf b/labs1-2/grammar/myproject/MicroLangSwe.gf similarity index 100% rename from lab2/grammar/myproject/MicroLangSwe.gf rename to labs1-2/grammar/myproject/MicroLangSwe.gf diff --git a/lab2/grammar/myproject/MicroLangSwe.gfo b/labs1-2/grammar/myproject/MicroLangSwe.gfo similarity index 100% rename from lab2/grammar/myproject/MicroLangSwe.gfo rename to labs1-2/grammar/myproject/MicroLangSwe.gfo diff --git a/lab2/grammar/myproject/MicroResEng.gf b/labs1-2/grammar/myproject/MicroResEng.gf similarity index 100% rename from lab2/grammar/myproject/MicroResEng.gf rename to labs1-2/grammar/myproject/MicroResEng.gf diff --git a/lab2/grammar/myproject/MicroResEng.gfo b/labs1-2/grammar/myproject/MicroResEng.gfo similarity index 100% rename from lab2/grammar/myproject/MicroResEng.gfo rename to labs1-2/grammar/myproject/MicroResEng.gfo diff --git a/lab2/grammar/myproject/MicroResSwe.gf b/labs1-2/grammar/myproject/MicroResSwe.gf similarity index 100% rename from lab2/grammar/myproject/MicroResSwe.gf rename to labs1-2/grammar/myproject/MicroResSwe.gf diff --git a/lab2/grammar/myproject/MicroResSwe.gfo b/labs1-2/grammar/myproject/MicroResSwe.gfo similarity index 100% rename from lab2/grammar/myproject/MicroResSwe.gfo rename to labs1-2/grammar/myproject/MicroResSwe.gfo diff --git a/lab2/grammar/test.gfs b/labs1-2/grammar/test.gfs similarity index 100% rename from lab2/grammar/test.gfs rename to labs1-2/grammar/test.gfs diff --git a/lab2/intro/Intro.gf b/labs1-2/intro/Intro.gf similarity index 100% rename from lab2/intro/Intro.gf rename to labs1-2/intro/Intro.gf diff --git a/lab2/intro/IntroEng.gf b/labs1-2/intro/IntroEng.gf similarity index 100% rename from lab2/intro/IntroEng.gf rename to labs1-2/intro/IntroEng.gf diff --git a/lab2/intro/IntroFre.gf b/labs1-2/intro/IntroFre.gf similarity index 100% rename from lab2/intro/IntroFre.gf rename to labs1-2/intro/IntroFre.gf diff --git a/lab2/intro/english.cf b/labs1-2/intro/english.cf similarity index 100% rename from lab2/intro/english.cf rename to labs1-2/intro/english.cf diff --git a/lab2/wikipedia-2022/Countries.gf b/labs1-2/wikipedia-2022/Countries.gf similarity index 100% rename from lab2/wikipedia-2022/Countries.gf rename to labs1-2/wikipedia-2022/Countries.gf diff --git a/lab2/wikipedia-2022/CountriesEng.gf b/labs1-2/wikipedia-2022/CountriesEng.gf similarity index 100% rename from lab2/wikipedia-2022/CountriesEng.gf rename to labs1-2/wikipedia-2022/CountriesEng.gf diff --git a/lab2/wikipedia-2022/CountriesFin.gf b/labs1-2/wikipedia-2022/CountriesFin.gf similarity index 100% rename from lab2/wikipedia-2022/CountriesFin.gf rename to labs1-2/wikipedia-2022/CountriesFin.gf diff --git a/lab2/wikipedia-2022/CountriesGer.gf b/labs1-2/wikipedia-2022/CountriesGer.gf similarity index 100% rename from lab2/wikipedia-2022/CountriesGer.gf rename to labs1-2/wikipedia-2022/CountriesGer.gf diff --git a/lab2/wikipedia-2022/CountriesSwe.gf b/labs1-2/wikipedia-2022/CountriesSwe.gf similarity index 100% rename from lab2/wikipedia-2022/CountriesSwe.gf rename to labs1-2/wikipedia-2022/CountriesSwe.gf diff --git a/lab2/wikipedia-2022/CountryNames.gf b/labs1-2/wikipedia-2022/CountryNames.gf similarity index 100% rename from lab2/wikipedia-2022/CountryNames.gf rename to labs1-2/wikipedia-2022/CountryNames.gf diff --git a/lab2/wikipedia-2022/CountryNamesEng.gf b/labs1-2/wikipedia-2022/CountryNamesEng.gf similarity index 100% rename from lab2/wikipedia-2022/CountryNamesEng.gf rename to labs1-2/wikipedia-2022/CountryNamesEng.gf diff --git a/lab2/wikipedia-2022/CountryNamesFin.gf b/labs1-2/wikipedia-2022/CountryNamesFin.gf similarity index 100% rename from lab2/wikipedia-2022/CountryNamesFin.gf rename to labs1-2/wikipedia-2022/CountryNamesFin.gf diff --git a/lab2/wikipedia-2022/CountryNamesGer.gf b/labs1-2/wikipedia-2022/CountryNamesGer.gf similarity index 100% rename from lab2/wikipedia-2022/CountryNamesGer.gf rename to labs1-2/wikipedia-2022/CountryNamesGer.gf diff --git a/lab2/wikipedia-2022/CountryNamesSwe.gf b/labs1-2/wikipedia-2022/CountryNamesSwe.gf similarity index 100% rename from lab2/wikipedia-2022/CountryNamesSwe.gf rename to labs1-2/wikipedia-2022/CountryNamesSwe.gf diff --git a/lab2/wikipedia-2022/Facts.gf b/labs1-2/wikipedia-2022/Facts.gf similarity index 100% rename from lab2/wikipedia-2022/Facts.gf rename to labs1-2/wikipedia-2022/Facts.gf diff --git a/lab2/wikipedia-2022/FactsEng.gf b/labs1-2/wikipedia-2022/FactsEng.gf similarity index 100% rename from lab2/wikipedia-2022/FactsEng.gf rename to labs1-2/wikipedia-2022/FactsEng.gf diff --git a/lab2/wikipedia-2022/FactsFin.gf b/labs1-2/wikipedia-2022/FactsFin.gf similarity index 100% rename from lab2/wikipedia-2022/FactsFin.gf rename to labs1-2/wikipedia-2022/FactsFin.gf diff --git a/lab2/wikipedia-2022/FactsGer.gf b/labs1-2/wikipedia-2022/FactsGer.gf similarity index 100% rename from lab2/wikipedia-2022/FactsGer.gf rename to labs1-2/wikipedia-2022/FactsGer.gf diff --git a/lab2/wikipedia-2022/FactsSwe.gf b/labs1-2/wikipedia-2022/FactsSwe.gf similarity index 100% rename from lab2/wikipedia-2022/FactsSwe.gf rename to labs1-2/wikipedia-2022/FactsSwe.gf diff --git a/lab2/wikipedia-2022/country_facts.py b/labs1-2/wikipedia-2022/country_facts.py similarity index 100% rename from lab2/wikipedia-2022/country_facts.py rename to labs1-2/wikipedia-2022/country_facts.py diff --git a/lab2/wikipedia-2022/data_facts.py b/labs1-2/wikipedia-2022/data_facts.py similarity index 100% rename from lab2/wikipedia-2022/data_facts.py rename to labs1-2/wikipedia-2022/data_facts.py diff --git a/lab2/wikipedia-2022/extract_names.py b/labs1-2/wikipedia-2022/extract_names.py similarity index 100% rename from lab2/wikipedia-2022/extract_names.py rename to labs1-2/wikipedia-2022/extract_names.py