last year's lecture material moved to directory 2025

This commit is contained in:
aarneranta
2026-03-30 07:43:08 +02:00
parent 088f52a0f6
commit 9d0f650881
39 changed files with 0 additions and 0 deletions

175
lectures/2025/README.md Normal file
View File

@@ -0,0 +1,175 @@
# Computations Syntax Lectures: Outline
## Lecture 1
Coursenotes: Chapter 1
Participants' native languages:
Chinese (2), Dutch, English, Finnish, French (2), Greek, Hebrew, Italian (3),
Korean, Persian (2), Polish, Portuguese, Romanian (3), Russian, Spanish, Swedish (2),
Swiss German, West-Assyrian - 24 students, 17 languages + 2 teachers, 1 more language
Formal grammar is no more expected to match natural language exactly
- analysis: should be wider than the language (we will use UD)
- generation: should be contained in the language (we will use GF)
- in both formats, we aim to use universal concepts for many languages
Phrase structure grammars, context-free = BNF, grammar rules, trees
- example: [english.cf](lecture-01/english.cf)
- testing grammars in GF: import, generate_random, parse, linearize, visualize_parse, help
GF grammars: dividing .cf into abstract and concrete .gf
- example: [Intro*.gf](lecture-01/)
- forms of rules: cat, fun, lincat, lin
- word order switch English-Italian
- to solve next time:
Experiments in GF:
- https://cloud.grammaticalframework.org/minibar/minibar.html
- Grammar: ResourceDemo, Startcat: S
## Lecture 2
Agreement, parameter definitions, variable and inherent features, linearization types
[IntroEng.gf](lecture-02/InfroEng.gf)
For you to do:
- write a concrete syntax for some other language, carefully thinking about
### Instructions for ARM Mac users
The GF Download page contains a binary for the Mac with an Intel processor, but it will not work for newer Macs, which use an ARM Processor (called M1, M2, or M3 by Apple).
We have therefore prepared a binary for these newer Macs.
To download it, open a terminal and do:
```
cd # go to your home directory
mkdir tmp # if the directory tmp does not exist already
cd tmp
wget https://www.grammaticalframework.org/~aarne/gf-mac.gz
```
This is better than downloading via a browser, because your Mac OS may then block the use of the program as "unreliable".
After download, stay in the terminal and do:
```
gunzip gf-mac.gz
mv gf-mac gf
chmod a+x gf
./gf
```
You should now see the GF promt. Type 'help' to see if it works!
Hint: if any of the terminal commands used above are unfamiliar to you, it is a good time to learn them now.
They will be useful throughout your future career as a programmer!
The readily available method is the `man` command, for instance,
```
man gunzip
```
The next thing is to move it to a place where you can find it from anywhere in your system.
One standard place is
```
mv gf /usr/local/bin
```
If you get "permission denied", you will have to write
```
sudo mv gf /usr/local/bin
```
and type your computer's password.
Then you can try
```
cd
gf
```
to verify that GF works in your home directory.
After that, you can test it in the course GitHub directory
```
cd comp-syntax-gu-mlt/lectures/lecture-02
gf
> import IntroEng.gf # in GF
```
You can work here for a while.
The next step will be to install the RGL, but this can wait a bit.
The instructions in https://www.grammaticalframework.org/download/index-3.11.html should work even for the ARM Mac.
## Lecture 3
Course notes: Chapter 2, Chapter 5
Analysing UD data with shell commands:
```
cat treebanks/UD_Swedish-Talbanken/sv_talbanken-ud-train.conllu | cut -f4 | grep -v "#" | sort
cat treebanks/UD_Swedish-Talbanken/sv_talbanken-ud-train.conllu | cut -f4 | grep -v "#" | sort -u
cat treebanks/UD_Swedish-Talbanken/sv_talbanken-ud-train.conllu | cut -f4 | grep -v "#" | sort -u | wc
```
Again, make sure to learn to use these shell commands!
Adding deptreepy to the pipeline:
```
cat treebanks/UD_English-EWT/en_ewt-ud-train.conllu | ./deptreepy.py "statistics POS"
cat treebanks/UD_English-EWT/en_ewt-ud-train.conllu | ./deptreepy.py "match_wordlines (POS X)"
cat treebanks/UD_English-EWT/en_ewt-ud-train.conllu | ./deptreepy.py "statistics FEATS"
cat treebanks/UD_English-EWT/en_ewt-ud-train.conllu | ./deptreepy.py "match_wordlines (POS NOUN) | statistics FEATS"
```
Download deptreepy and the UD treebanks, and do the same for other treebanks of other languages!
Confirmed Swedish inflection table by looking up a word at https://svenska.se/ and also learn what is inherent and what is variable.
Started MorphologyEng.gf and MorphologySwe.gf in lecture-03/.
## Lecture 7
We took a look at the RGL synopsis, https://www.grammaticalframework.org/lib/doc/synopsis/
We focused on a few things:
- the hierarchic view of categories (Chapter 1)
- Sentence/Clause distinction, looking at "inflection tables" of clauses in https://cloud.grammaticalframework.org/minibar/minibar.html (ResourceDemo, startcat Cl)
- verb valencies: V, V2, V3, VA, VV, etc and the "sense distinctions" that come with different valency patterns and also typically are translated with different words
An examples of verb valencies, "look":
- V: look ; titta
- V2: look at ; titta på
- V2: look for ; leta efter
- V2: look after ; ta hand om
- V2 : look like ; se ut som
- V3 : look up ; slå upp
- VA: look (good) ; se (bra) ut
We also briefly discussed complements vs. adjuncts and pointed out that they can be difficult to distinguish and that UD does not even try.
## Lecture 8
Installing RGL: a binary release can be found in
https://github.com/GrammaticalFramework/gf-rgl/releases/tag/20250429
Steps:
1. Download rgl-20250429.tgz
2. Put it into some good place, for instance ~/GF or /usr/local/lib
3. Uncompress it with `tar xvfz`
4. The top directory created is lib, with subdirectories alltenses, prelude, present. List them to see lots of .gfo files
5. Export the absolute path to this lib as the value of the environment variable `GF_LIB_PATH`, which GF recognizes: `export GF_LIB_PATH=/Users/aarne/GF/lib` if this is where you have placed it.
6. This export command can also be attached you your .bashrc or .zprofile, or whatever shell initialization file you have
When you have done this, you can test if it works in the following way:
```
$ gf
> i alltenses/LangEng.gfo
> gr -cat=Cl | l -table
```
We also looked at the source of the RGL, obtained by cloning https://github.com/GrammaticalFramework/gf-rgl
The binaries can be compiled from this, if you need a Haskell compiler.
If you don't have one, you can still keep the sources just for documentation.
They can be imported in the GF program, but compiling the whole RGL is easier if you use `make install`, which requires Haskell.

View File

@@ -0,0 +1,30 @@
abstract Intro = {
cat
S ;
NP ;
VP ;
CN ;
AP ;
Det ;
Pron ;
N ;
A ;
V2 ;
fun
PredVP : NP -> VP -> S ;
ComplV2 : V2 -> NP -> VP ;
DetCN : Det -> CN -> NP ;
AdjCN : AP -> CN -> CN ;
UseN : N -> CN ;
UseA : A -> AP ;
UsePron : Pron -> NP ;
the_Det : Det ;
black_A : A ;
cat_N : N ;
see_V2 : V2 ;
we_Pron : Pron ;
}

View File

@@ -0,0 +1,30 @@
concrete IntroEng of Intro = {
lincat
S = Str ;
NP = Str ;
VP = Str ;
CN = Str ;
AP = Str ;
Det = Str ;
Pron = Str ;
N = Str ;
A = Str ;
V2 = Str ;
lin
PredVP np vp = np ++ vp ;
ComplV2 v2 np = v2 ++ np ;
DetCN det cn = det ++ cn ;
AdjCN ap cn = ap ++ cn ;
UseN n = n ;
UseA a = a ;
UsePron pron = pron ;
the_Det = "the" ;
black_A = "black" ;
cat_N = "cat" ;
see_V2 = "sees" ;
we_Pron = "us" ;
}

View File

@@ -0,0 +1,30 @@
concrete IntroIta of Intro = {
lincat
S = Str ;
NP = Str ;
VP = Str ;
CN = Str ;
AP = Str ;
Det = Str ;
Pron = Str ;
N = Str ;
A = Str ;
V2 = Str ;
lin
PredVP np vp = np ++ vp ;
ComplV2 v2 np = np ++ v2 ;
DetCN det cn = det ++ cn ;
AdjCN ap cn = cn ++ ap ;
UseN n = n ;
UseA a = a ;
UsePron pron = pron ;
the_Det = "il" ;
black_A = "nero" ;
cat_N = "gatto" ;
see_V2 = "vede" ;
we_Pron = "ci" ;
}

View File

@@ -0,0 +1,14 @@
S ::= NP VP ;
NP ::= Det CN ;
NP ::= Pron ;
CN ::= AP CN ;
CN ::= N ;
VP ::= V2 NP ;
AP ::= A ;
Det ::= "the" ;
N ::= "cat" ;
A ::= "black" ;
V2 ::= "sees" ;
Pron ::= "us" ;

View File

@@ -0,0 +1,30 @@
abstract Intro = {
cat
S ;
NP ;
VP ;
CN ;
AP ;
Det ;
Pron ;
N ;
A ;
V2 ;
fun
PredVP : NP -> VP -> S ;
ComplV2 : V2 -> NP -> VP ;
DetCN : Det -> CN -> NP ;
AdjCN : AP -> CN -> CN ;
UseN : N -> CN ;
UseA : A -> AP ;
UsePron : Pron -> NP ;
the_Det : Det ;
black_A : A ;
cat_N : N ;
see_V2 : V2 ;
we_Pron : Pron ;
}

View File

@@ -0,0 +1,40 @@
concrete IntroEng of Intro = {
lincat
S = Str ;
NP = {s : Case => Str ; a : Agr} ;
VP = Agr => Str ;
CN = Str ;
AP = Str ;
Det = Str ;
Pron = {s : Case => Str ; a : Agr} ;
N = Str ;
A = Str ;
V2 = Agr => Str ;
lin
PredVP np vp = np.s ! Nom ++ vp ! np.a ;
ComplV2 v2 np = table {a => v2 ! a ++ np.s ! Acc} ;
DetCN det cn = {
s = table {_ => det ++ cn} ;
a = SgP3
} ;
AdjCN ap cn = ap ++ cn ;
UseN n = n ;
UseA a = a ;
UsePron pron = pron ;
the_Det = "the" ;
black_A = "black" ;
cat_N = "cat" ;
see_V2 = table {SgP3 => "sees" ; Other => "see"} ;
we_Pron = {
s = table {Nom => "we" ; Acc => "us"} ;
a = Other
} ;
param
Case = Nom | Acc ;
Agr = SgP3 | Other ;
}

View File

@@ -0,0 +1,30 @@
concrete IntroIta of Intro = {
lincat
S = Str ;
NP = Str ;
VP = Str ;
CN = Str ;
AP = Str ;
Det = Str ;
Pron = Str ;
N = Str ;
A = Str ;
V2 = Str ;
lin
PredVP np vp = np ++ vp ;
ComplV2 v2 np = np ++ v2 ;
DetCN det cn = det ++ cn ;
AdjCN ap cn = cn ++ ap ;
UseN n = n ;
UseA a = a ;
UsePron pron = pron ;
the_Det = "il" ;
black_A = "nero" ;
cat_N = "gatto" ;
see_V2 = "vede" ;
we_Pron = "ci" ;
}

View File

@@ -0,0 +1,34 @@
resource MorphologyEng = {
param
Number = Sg | Pl ;
oper
Noun : Type = {s : Number => Str} ;
mkNoun : Str -> Str -> Noun = \sg, pl ->
{s = table {Sg => sg ; Pl => pl}} ;
regNoun : Str -> Noun = \sg -> mkNoun sg (sg + "s") ;
smartNoun : Str -> Noun = \sg -> case sg of {
_ + ("s" | "ch" | "sh") => mkNoun sg (sg + "es") ;
_ + ("ay" | "ey" | "oy" | "uy") => regNoun sg ;
x + "y" => mkNoun sg (x + "ies") ;
_ => regNoun sg
} ;
-- to test
teacher_N : Noun = {s = table {Sg => "teacher" ; Pl => "teachers"}} ;
cat_N : Noun = mkNoun "cat" "cats" ;
dog_N : Noun = regNoun "dog" ;
bus_N : Noun = smartNoun "bus" ;
baby_N : Noun = smartNoun "baby" ;
fly_N : Noun = smartNoun "fly" ;
}

View File

@@ -0,0 +1,61 @@
resource MorphologySwe = {
param
Case = Nom | Gen ;
Definite = Ind | Def ;
Gender = Com | Neut ;
Number = Sg | Pl ;
NForm = NF Number Definite Case ; -- NF is a constructor
oper
-- Noun = {s : Number => Definite => Case => Str ; g : Gender} ;
Noun = {s : NForm => Str ; g : Gender} ;
mkNoun : (sin, sig, sdn, sdg, pin, pig, pdn, pdg : Str) -> Gender -> Noun =
\sin, sig, sdn, sdg, pin, pig, pdn, pdg, g -> {
s = table {
NF Sg Ind Nom => sin ;
NF Sg Ind Gen => sig ;
NF Sg Def Nom => sdn ;
NF Sg Def Gen => sdg ;
NF Pl Ind Nom => pin ;
NF Pl Ind Gen => pig ;
NF Pl Def Nom => pdn ;
NF Pl Def Gen => pdg
} ;
g = g
} ;
addS : Str -> Str = \s -> case s of {
_ + ("s" | "x" | "z") => s ;
_ => s + "s"
} ;
mk4Noun : (sin, sdn, pin, pdn : Str) -> Noun =
\sin, sdn, pin, pdn -> {
s = table {
NF Sg Ind Nom => sin ;
NF Sg Ind Gen => addS sin ;
NF Sg Def Nom => sdn ;
NF Sg Def Gen => addS sdn ;
NF Pl Ind Nom => pin ;
NF Pl Ind Gen => addS pin ;
NF Pl Def Nom => pdn ;
NF Pl Def Gen => addS pdn
} ;
g = case sdn of {
_ + "n" => Com ;
_ => Neut
}
} ;
smartNoun : Str -> Noun = \mamma -> case mamma of {
mamm + "a" => mkNoun mamma (mamma + "s") (mamma + "n") (mamma + "ns")
(mamm + "or") (mamm + "ors") (mamm + "orna") (mamm + "ornas")
Com ;
bil => mkNoun bil (bil + "s") (bil + "en") (bil + "ens")
(bil + "ar") (bil + "ars") (bil + "arna") (bil + "arnas") Com
} ;
}

Binary file not shown.

View File

@@ -0,0 +1,21 @@
abstract Agreement = {
cat
NP ;
CN ;
N ;
A ;
Det ;
fun
DetCN : Det -> CN -> NP ; -- this black cat
AdjCN : A -> N -> CN ; -- black cat
UseN : N -> CN ; -- cat
cat_N : N ;
house_N : N ;
black_A : A ;
big_A : A ;
-- simplification of pronouns just to make English interesting
this_Det : Det ;
these_Det : Det ;
}

View File

@@ -0,0 +1,32 @@
concrete AgreementEng of Agreement = open MorphologyEng in {
lincat
NP = {s: Str; n: Number} ;
CN = Noun ;
N = Noun ;
A = {s: Str} ;
Det = {s: Str; n: Number} ;
lin
DetCN d cn = {
s = d.s ++ (cn.s ! d.n) ;
n = d.n ;
} ;
AdjCN a cn = {
s = \\n => a.s ++ (cn.s ! n) ;
} ;
UseN n = n ;
cat_N = regNoun "cat" ;
house_N = regNoun "house" ;
black_A = {s = "black"} ;
big_A = {s = "big"} ;
this_Det = {
s = "this";
n = Sg ;
} ;
these_Det = {
s = "these";
n = Pl ;
} ;
}

View File

@@ -0,0 +1,42 @@
concrete AgreementSwe of Agreement = open MorphologySwe in {
lincat
NP = {s: Str; a: NPAgreement} ;
CN = Noun ;
N = Noun ;
A = Adjective ;
Det = {s : Gender => Str; n: Number; d: Definite} ; -- and possible Definiteness
lin
DetCN d cn = {
s = (d.s ! cn.g) ++ (cn.s ! (NF d.n d.d Nom)) ;
a = NPAgr d.n d.d cn.g ;
} ;
AdjCN a n = {
s = \\nf => let agr = NPAgr (nform2number nf) (nform2definite nf) n.g
in (a.s ! agr) ++ (n.s ! nf) ;
g = n.g
} ;
UseN n = n ;
cat_N = mk4Noun "katt" "katten" "katter" "katterna" ;
house_N = mk4Noun "hus" "huset" "hus" "husen" ;
black_A = mk3Adjective "svart" "svart" "svarta" ;
big_A = mk3Adjective "stor" "stort" "stora" ;
this_Det = {
s = table {
Com => "den här" ;
Neut => "det här"
} ;
n = Sg ;
d = Def ;
} ;
these_Det = {
s = table {
Com => "de här" ;
Neut => "de här"
};
n = Pl ;
d = Def ;
} ;
}

View File

@@ -0,0 +1,34 @@
resource MorphologyEng = {
param
Number = Sg | Pl ;
oper
Noun : Type = {s : Number => Str} ;
mkNoun : Str -> Str -> Noun = \sg, pl ->
{s = table {Sg => sg ; Pl => pl}} ;
regNoun : Str -> Noun = \sg -> mkNoun sg (sg + "s") ;
smartNoun : Str -> Noun = \sg -> case sg of {
_ + ("s" | "ch" | "sh") => mkNoun sg (sg + "es") ;
_ + ("ay" | "ey" | "oy" | "uy") => regNoun sg ;
x + "y" => mkNoun sg (x + "ies") ;
_ => regNoun sg
} ;
-- to test
teacher_N : Noun = {s = table {Sg => "teacher" ; Pl => "teachers"}} ;
cat_N : Noun = mkNoun "cat" "cats" ;
dog_N : Noun = regNoun "dog" ;
bus_N : Noun = smartNoun "bus" ;
baby_N : Noun = smartNoun "baby" ;
fly_N : Noun = smartNoun "fly" ;
}

View File

@@ -0,0 +1,85 @@
resource MorphologySwe = {
param
Case = Nom | Gen ;
Definite = Ind | Def ;
Gender = Com | Neut ;
Number = Sg | Pl ;
NForm = NF Number Definite Case ; -- NF is a constructor
NPAgreement = NPAgr Number Definite Gender ;
oper
nform2number : NForm -> Number = \nf -> case nf of {
(NF n _ _) => n
} ;
nform2definite : NForm -> Definite = \nf -> case nf of {
(NF _ d _) => d
} ;
-- Noun = {s : Number => Definite => Case => Str ; g : Gender} ;
Noun = {s : NForm => Str ; g : Gender} ;
Adjective = { s: NPAgreement => Str } ;
mkNoun : (sin, sig, sdn, sdg, pin, pig, pdn, pdg : Str) -> Gender -> Noun =
\sin, sig, sdn, sdg, pin, pig, pdn, pdg, g -> {
s = table {
NF Sg Ind Nom => sin ;
NF Sg Ind Gen => sig ;
NF Sg Def Nom => sdn ;
NF Sg Def Gen => sdg ;
NF Pl Ind Nom => pin ;
NF Pl Ind Gen => pig ;
NF Pl Def Nom => pdn ;
NF Pl Def Gen => pdg
} ;
g = g
} ;
addS : Str -> Str = \s -> case s of {
_ + ("s" | "x" | "z") => s ;
_ => s + "s"
} ;
mk3Adjective : (stor, stort, stora : Str) -> Adjective = \stor, stort, stora -> {
s = table {
NPAgr Sg Ind Com => stor ;
NPAgr Sg Ind Neut => stort ;
NPAgr Sg Def Com => stora ;
NPAgr Sg Def Neut => stora ;
NPAgr Pl Ind Com => stora ;
NPAgr Pl Ind Neut => stora ;
NPAgr Pl Def Com => stora ;
NPAgr Pl Def Neut => stora
}
} ;
mk4Noun : (sin, sdn, pin, pdn : Str) -> Noun =
\sin, sdn, pin, pdn -> {
s = table {
NF Sg Ind Nom => sin ;
NF Sg Ind Gen => addS sin ;
NF Sg Def Nom => sdn ;
NF Sg Def Gen => addS sdn ;
NF Pl Ind Nom => pin ;
NF Pl Ind Gen => addS pin ;
NF Pl Def Nom => pdn ;
NF Pl Def Gen => addS pdn
} ;
g = case sdn of {
_ + "n" => Com ;
_ => Neut
}
} ;
smartNoun : Str -> Noun = \mamma -> case mamma of {
mamm + "a" => mkNoun mamma (mamma + "s") (mamma + "n") (mamma + "ns")
(mamm + "or") (mamm + "ors") (mamm + "orna") (mamm + "ornas")
Com ;
bil => mkNoun bil (bil + "s") (bil + "en") (bil + "ens")
(bil + "ar") (bil + "ars") (bil + "arna") (bil + "arnas") Com
} ;
}

View File

@@ -0,0 +1,33 @@
## Number agreement in NPs
| Singular | Plural |
| --- | --- |
| black cat | black cats |
| musta kissa | __mustat__ kissat |
| gatto nero | gatti __neri__ |
| schwarze Katze | schwarze Katzen |
| chat noir | chats __noirs__ |
| μαύρη γάτα | __μαύρες__ γάτες |
| czarny kot | __czarne__ koty |
| gato negro | gatos __negros__ |
| pisică neagră | pisici __negre__ |
| svart katt | __svarta__ katter |
| zwarte kat | zwarte katten |
| gato preto | gatos __pretos__ |
| черная кошка | __черные__ кошки |
- these black cats - de här svarta katterna
- these black houses - de här svarta husen
- these cats - de här katterna
- these houses - de här husen
- this black cat - den här svarta katten
- this black house - det här svarta huset
- this cat - den här katten
- this house - det här huset
- big cat(s) - stor katt / stora katten / stora katter / stora katterna
- black cat(s) - svart katt / svarta katten / svarta katter / svarta katterna
- big house(s) - stort hus / stora huset / stora hus / stora husen
- black house(s) - svart hus / svarta huset / svarta hus / svarta husen
- cat - katt
- house - hus

94
lectures/2025/lecture-n-1/.gitignore vendored Normal file
View File

@@ -0,0 +1,94 @@
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
## Intermediate documents:
*.dvi
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# amsthm
*.thm
# beamer
*.nav
*.snm
*.vrb
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
# hyperref
*.brf
# listings
*.lol
# makeidx
*.idx
*.ilg
*.ind
*.ist
# minitoc
*.maf
*.mtc
*.mtc0
# minted
*.pyg
# nomencl
*.nlo
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# todonotes
*.tdo
# xindy
*.xdy
# useless files
color_scheme.png
identicon.png
._wordcount_selection.tex

View File

@@ -0,0 +1,187 @@
\usepackage{tikz}
\usetikzlibrary{calc}
% -------- COLOR SCHEME --------
\definecolor{PrimaryColor}{RGB}{7,79,140} % primary color (blue)
\definecolor{SecondaryColor}{RGB}{242,88,26} % bulleted lists
\definecolor{BackgroundColor}{RGB}{255,255,255} % background & titles (white)
\definecolor{TextColor}{RGB}{0,0,0} % text (black)
\definecolor{ProgBarBGColor}{RGB}{175,175,175} % progress bar background (grey)
% set colours
\setbeamercolor{normal text}{fg=TextColor}\usebeamercolor*{normal text}
\setbeamercolor{alerted text}{fg=PrimaryColor}
\setbeamercolor{section in toc}{fg=PrimaryColor}
\setbeamercolor{structure}{fg=SecondaryColor}
\hypersetup{colorlinks,linkcolor=,urlcolor=SecondaryColor}
% set fonts
\setbeamerfont{itemize/enumerate body}{size=\large}
\setbeamerfont{itemize/enumerate subbody}{size=\normalsize}
\setbeamerfont{itemize/enumerate subsubbody}{size=\small}
% make pixelated bullets
\setbeamertemplate{itemize item}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0, 0) rectangle(0.1, 0.1);
\draw[fill=SecondaryColor,draw=none] (0.1, 0.1) rectangle(0.2, 0.2);
\draw[fill=SecondaryColor,draw=none] (0, 0.2) rectangle(0.1, 0.3);
}
}
\setbeamertemplate{itemize subitem}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0, 0) rectangle(0.075, 0.075);
\draw[fill=SecondaryColor,draw=none] (0.075, 0.075) rectangle(0.15, 0.15);
\draw[fill=SecondaryColor,draw=none] (0, 0.15) rectangle(0.075, 0.225);
}
}
\setbeamertemplate{itemize subsubitem}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0.050, 0.050) rectangle(0.15, 0.15);
}
}
% disable navigation
\setbeamertemplate{navigation symbols}{}
% disable the damn default logo!
\setbeamertemplate{sidebar right}{}
% custom draw the title page above
\setbeamertemplate{title page}{}
% again, manually draw the frame title above
\setbeamertemplate{frametitle}{}
% disable "Figure:" in the captions
% TODO: somehow this doesn't work for md-generated slides
%\setbeamertemplate{caption}{\tiny\insertcaption}
%\setbeamertemplate{caption label separator}{}
% add some space below the footnotes so they don't end up on the progress bar
\setbeamertemplate{footnote}{
\parindent 0em
\noindent
\raggedright
\hbox to 0.8em{\hfil\insertfootnotemark}
\insertfootnotetext
\par
\vspace{2em}
}
% add the same vspace both before and after quotes
\setbeamertemplate{quote begin}{\vspace{0.5em}}
\setbeamertemplate{quote end}{\vspace{0.5em}}
% progress bar counters
\newcounter{showProgressBar}
\setcounter{showProgressBar}{1}
\newcounter{showSlideNumbers}
\setcounter{showSlideNumbers}{1}
\newcounter{showSlideTotal}
\setcounter{showSlideTotal}{1}
% use \makeatletter for our progress bar definitions
% progress bar idea from http://tex.stackexchange.com/a/59749/44221
% slightly adapted for visual purposes here
\makeatletter
\newcount\progressbar@tmpcounta% auxiliary counter
\newcount\progressbar@tmpcountb% auxiliary counter
\newdimen\progressbar@pbwidth %progressbar width
\newdimen\progressbar@tmpdim % auxiliary dimension
\newdimen\slidewidth % auxiliary dimension
\newdimen\slideheight % auxiliary dimension
% make the progress bar go across the screen
\progressbar@pbwidth=\the\paperwidth
\slidewidth=\the\paperwidth
\slideheight=\the\paperheight
% draw everything with tikz
\setbeamertemplate{background}{ % all slides
% progress bar stuff
\progressbar@tmpcounta=\insertframenumber
\progressbar@tmpcountb=\inserttotalframenumber
\progressbar@tmpdim=\progressbar@pbwidth
\divide\progressbar@tmpdim by 100
\multiply\progressbar@tmpdim by \progressbar@tmpcounta
\divide\progressbar@tmpdim by \progressbar@tmpcountb
\multiply\progressbar@tmpdim by 100
\begin{tikzpicture}
% set up the entire slide as the canvas
\useasboundingbox (0,0) rectangle(\the\paperwidth,\the\paperheight);
% background
\fill[color=BackgroundColor] (0,0) rectangle(\the\paperwidth,\the\paperheight);
\ifnum\thepage=1\relax % only title slides
% primary color rectangle
\fill[color=PrimaryColor] (0, 4cm) rectangle(\slidewidth,\slideheight);
% text (title, subtitle, author, date)
\node[anchor=south,text width=\slidewidth-1cm,inner xsep=0.5cm] at (0.5\slidewidth,4cm) {\color{BackgroundColor}\Huge\textbf{\inserttitle}};
\node[anchor=north east,text width=\slidewidth-1cm,align=right] at (\slidewidth-0.4cm,4cm) {\color{PrimaryColor}\large\textbf{\insertsubtitle}};
\node at (0.5\slidewidth,2cm) {\color{PrimaryColor}\LARGE\insertauthor};
\node at (0.5\slidewidth,1.25cm) {\color{PrimaryColor}\Large\insertinstitute};
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertdate};
\else % other slides
% title bar
\fill[color=PrimaryColor] (0, \slideheight-1cm) rectangle(\slidewidth,\slideheight);
% slide title
\node[anchor=north,text width=\slidewidth-0.75cm,inner xsep=0.5cm,inner ysep=0.25cm] at (0.5\slidewidth,\slideheight) {\color{BackgroundColor}\huge\textbf{\insertframetitle}};
% logo (TODO: autoscale; now it expects 350x350
\node[anchor=north east] at (\slidewidth-0.25cm,\slideheight+0.06cm){\insertlogo};
% show progress bar
\ifnum \value{showProgressBar}>0\relax%
% progress bar icon in the middle of the screen
\draw[fill=ProgBarBGColor,draw=none] (0cm,0cm) rectangle(\slidewidth,0.25cm);
\draw[fill=PrimaryColor,draw=none] (0cm,0cm) rectangle(\progressbar@tmpdim,0.25cm);
% bottom info
\node[anchor=south west] at(0cm,0.25cm) {\color{PrimaryColor}\tiny\vphantom{lp}\insertsection};
% if slide numbers are active
\ifnum \value{showSlideNumbers}>0\relax%
% if slide totals are active
\ifnum \value{showSlideTotal}>0\relax%
% draw both slide number and slide total
\node[anchor=south east] at(\slidewidth,0.25cm) {\color{PrimaryColor}\tiny\insertframenumber/\inserttotalframenumber};
\else
\node[anchor=south east] at(\slidewidth,0.25cm) {\color{PrimaryColor}\tiny\insertframenumber};
\fi
\fi
\else
% section title in the bottom left
\node[anchor=south west] at(0cm,0cm) {\color{PrimaryColor}\tiny\vphantom{lp}\insertsection};
% if we're showing slide numbers
\ifnum \value{showSlideNumbers}>0\relax%
% if slide totals are active
\ifnum \value{showSlideTotal}>0\relax%
% slide number and slide total
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertframenumber/\inserttotalframenumber};
\else
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertframenumber};
\fi
\fi
\fi
\fi
\end{tikzpicture}
}
\makeatother
\AtBeginSection{\frame{\sectionpage}} % section pages
\setbeamertemplate{section page}
{
\begin{tikzpicture}
% set up the entire slide as the canvas
\useasboundingbox (0,0) rectangle(\slidewidth,\slideheight);
\fill[color=BackgroundColor] (-1cm, 2cm) rectangle (\slidewidth, \slideheight+0.1cm);
\fill[color=PrimaryColor] (-1cm, 0.5\slideheight-1cm) rectangle(\slidewidth, 0.5\slideheight+1cm);
\node[text width=\the\paperwidth-1cm,align=center] at (0.4\slidewidth, 0.5\slideheight) {\color{BackgroundColor}\Huge\textbf{\insertsection}};
\end{tikzpicture}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 337 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 51 KiB

View File

@@ -0,0 +1,6 @@
1 the the DET DT Definite=Def|PronType=Art 3 det _ TokenRange=0:3
2 black black ADJ JJ Degree=Pos 3 amod _ TokenRange=4:9
3 cat cat NOUN NN Number=Sing 4 nsubj _ TokenRange=10:13
4 sees see VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=14:18
5 us we PRON PRP Case=Acc|Number=Plur|Person=1|PronType=Prs 4 obj _ TokenRange=19:21
6 now now ADV RB PronType=Dem 4 advmod _ SpaceAfter=No|TokenRange=22:25

View File

@@ -0,0 +1,51 @@
<svg width="317"
height="115"
viewBox="0 0 317 115"
version="1.1"
xmlns="http://www.w3.org/2000/svg">
<text x="5" y="108" font-size="16">the</text>
<text x="42" y="108" font-size="16">black</text>
<text x="97" y="108" font-size="16">cat</text>
<text x="143" y="108" font-size="16">sees</text>
<text x="189" y="108" font-size="16">us</text>
<text x="235" y="108" font-size="16">now</text>
<text x="5" y="93" font-size="10">DET</text>
<text x="42" y="93" font-size="10">ADJ</text>
<text x="97" y="93" font-size="10">NOUN</text>
<text x="143" y="93" font-size="10">VERB</text>
<text x="189" y="93" font-size="10">PRON</text>
<text x="235" y="93" font-size="10">ADV</text>
<path d="M 17 80 Q 17 47 50 47 L 72 47 Q 105 47 105 80"
stroke="black"
fill="none"/>
<line x1="17" y1="75" x2="17" y2="80" stroke="black"/>
<path d="M 17 80 14 74 20 74"/>
<text x="54" y="42" font-size="10">det</text>
<path d="M 55 80 Q 55 63 71 63 L 88 63 Q 104 63 104 80"
stroke="black"
fill="none"/>
<line x1="55" y1="75" x2="55" y2="80" stroke="black"/>
<path d="M 55 80 52 74 58 74"/>
<text x="71" y="58" font-size="10">amod</text>
<path d="M 110 80 Q 110 63 127 63 L 133 63 Q 150 63 150 80"
stroke="black"
fill="none"/>
<line x1="110" y1="75" x2="110" y2="80" stroke="black"/>
<path d="M 110 80 107 74 113 74"/>
<text x="119" y="58" font-size="10">nsubj</text>
<line x1="158" y1="20" x2="158" y2="80" stroke="black"/>
<path d="M 158 80 155 74 161 74"/>
<text x="163" y="28" font-size="10">root</text>
<path d="M 166 80 Q 166 63 183 63 L 189 63 Q 206 63 206 80"
stroke="black"
fill="none"/>
<line x1="206" y1="75" x2="206" y2="80" stroke="black"/>
<path d="M 206 80 203 74 209 74"/>
<text x="179" y="58" font-size="10">obj</text>
<path d="M 165 80 Q 165 47 198 47 L 220 47 Q 253 47 253 80"
stroke="black"
fill="none"/>
<line x1="253" y1="75" x2="253" y2="80" stroke="black"/>
<path d="M 253 80 250 74 256 74"/>
<text x="195" y="42" font-size="10">advmod</text>
</svg>

After

Width:  |  Height:  |  Size: 2.1 KiB

View File

@@ -0,0 +1,219 @@
---
title: "Training and evaluating \\newline dependency parsers"
subtitle: "(added to the course by popular demand)"
author: "Arianna Masciolini"
theme: "lucid"
logo: "gu.png"
date: "VT25"
institute: "LT2214 Computational Syntax"
---
## Today's topic
\bigskip \bigskip
![](img/sets.png)
# Parsing
## A structured prediction task
Sequence $\to$ structure, e.g.
- natural language sentence $\to$ syntax tree
- code $\to$ AST
- argumentative essay $\to$ argumentative structure
- ...
## Example (argmining)
> Språkbanken has better fika than CLASP: every fika, someone bakes. Sure, CLASP has a better coffee machine. On the other hand, there are more important things than coffee. In fact, most people drink tea in the afternoon.
## Example (argmining)
![](img/argmining.png)
\footnotesize From "A gentle introduction to argumentation mining" (Lindahl et al., 2022)
# Syntactic parsing
## From sentence to tree
From chapter 18 of _Speech and Language Processing_, (Jurafsky & Martin, January 2024 draft):
> Syntactic parsing is the task of assigning a syntactic structure to a sentence
- the structure is usually a _syntax tree_
- two main classes of approaches:
- constituency parsing (e.g. GF)
- dependency parsing (e.g. UD)
## Example (GF)
```
MicroLang> i MicroLangEng.gf
linking ... OK
Languages: MicroLangEng
7 msec
MicroLang> p "the black cat sees us now"
PredVPS (DetCN the_Det (AdjCN (PositA black_A)
(UseN cat_N))) (AdvVP (ComplV2 see_V2 (UsePron
we_Pron)) now_Adv)
```
## Example (GF)
```haskell
PredVPS
(DetCN
the_Det
(AdjCN (PositA black_A) (UseN cat_N))
)
(AdvVP
(ComplV2 see_V2 (UsePron we_Pron))
now_Adv
)
```
## Example (GF)
![](img/gfast.png)
# Dependency parsing
## Example (UD)
![](img/ud.svg)
\small
```
1 the _ DET _ _ 3 det _ _
2 black _ ADJ _ _ 3 amod _ _
3 cat _ NOUN _ _ 4 nsubj _ _
4 sees _ VERB _ _ 0 root _ _
5 us _ PRON _ _ 4 obj _ _
6 now _ ADV _ _ 4 advmod _ _
```
## Two paradigms
- __graph-based algorithms__: find the optimal tree from the set of all possible candidate solutions (or a subset of it)
- __transition-based algorithms__: incrementally build a tree by solving a sequence of classification problems
## Graph-based approaches
$$\hat{t} = \underset{t \in T(s)}{argmax}\, score(s,t)$$
- $t$: candidate tree
- $\hat{t}$: predicted tree
- $s$: input sentence
- $T(s)$: set of candidate trees for $s$
## Complexity
Depends on:
- choice of $T$ (upper bound: $n^{n-1}$, where $n$ is the number of words in $s$)
- scoring function (in the __arc-factor model__, the score of a tree is the sum of the score of each edge, scored individually by a NN)
In practice: $O(n^3)$ complexity
## Transition-based approaches
- trees are built through a sequence of steps, called _transitions_
- training requires:
- a gold-standard treebank (as for graph-based approaches)
- an _oracle_ i.e. an algorithm that converts each tree into a a gold-standard sequence of transitions
- much more efficient: $O(n)$
## Evaluation
2 main metrics:
- __UAS__ (Unlabelled Attachment Score): what's the fraction of nodes are attached to the correct dependency head?
- __LAS__ (Labelled Attachment Score): what's the fraction of nodes are attached to the correct dependency head _with an arc labelled with the correct relation type_[^1]?
[^1]: in UD: the `DEPREL` column
# Specifics of UD parsing
## Not just parsing per se
UD "parsers" typically do a lot more than dependency parsing:
- sentence segmentation
- tokenization
- lemmatization (`LEMMA` column)
- POS tagging (`UPOS` + `XPOS`)
- morphological tagging (`FEATS`)
- ...
Sometimes, some of these tasks are performed __jointly__ to achieve better performance.
## Evaluation (UD-specific)
Some more specific metrics:
- __CLAS__ (Content-word LAS): LAS limited to content words
- __MLAS__ (Morphology-Aware LAS): CLAS that also uses the `FEATS` column
- __BLEX__ (Bi-Lexical dependency score): CLAS that also uses the `LEMMA` column
## Evaluation script output
\small
```
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 100.00 | 100.00 | 100.00 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 100.00 | 100.00 | 100.00 |
UPOS | 98.36 | 98.36 | 98.36 | 98.36
XPOS | 100.00 | 100.00 | 100.00 | 100.00
UFeats | 100.00 | 100.00 | 100.00 | 100.00
AllTags | 98.36 | 98.36 | 98.36 | 98.36
Lemmas | 100.00 | 100.00 | 100.00 | 100.00
UAS | 92.73 | 92.73 | 92.73 | 92.73
LAS | 90.30 | 90.30 | 90.30 | 90.30
CLAS | 88.50 | 88.34 | 88.42 | 88.34
MLAS | 86.72 | 86.56 | 86.64 | 86.56
BLEX | 88.50 | 88.34 | 88.42 | 88.34
```
## Three generations of parsers
(all transition-based)
1. __MaltParser__ (Nivre et al. 2006): "classic" transition-based parser, data-driven but not NN-based
2. __UDPipe__: neural parser, personal favorite
- v1 (Straka et al. 2016): fast, solid software, easy to install and available anywhere
- v2 (Straka et al. 2018): much better results but slower and only available through an API/via the web GUI
3. __MaChAmp__ (van der Goot et al. 2021): transformer-based toolkit for multi-task learning, works on all CoNNL-like data, close to the SOTA, relatively easy to install and train
## MaChAmp config example
```json
{"compsyn": {
"train_data_path": "PATH-TO-YOUR-TRAIN-SPLIT",
"dev_data_path": "PATH-TO-YOUR-DEV-SPLIT",
"word_idx": 1,
"tasks": {
"upos": {
"task_type": "seq",
"column_idx": 3
},
"dependency": {
"task_type": "dependency",
"column_idx": 6}}}}
```
## Your task (lab 3)
![](img/machamp.png)
1. annotate a small treebank for your language of choice (started yesterday)
2. __train a parser-tagger on a reference UD treebank__ (tomorrow, or maybe even today: installation)
3. evaluate it on your treebank
# To learn more
## Main sources
- chapters 18-19 of the January 2024 draft of _Speech and Language Processing_ (Jurafsky & Martin) (full text available [__here__](https://web.stanford.edu/~jurafsky/slp3/))
- unit 3-2 of Johansson & Kuhlmann's course "Deep Learning for Natural Language Processing" ([__slides and videos__](https://liu-nlp.ai/dl4nlp/modules/module3/))
- section 10.9.2 on parser evaluation from Aarne's course notes (on Canvas)
## Papers describing the parsers
- _MaltParser: A Data-Driven Parser-Generator for Dependency Parsing_ (Nivre et al. 2006) ([__PDF__](http://lrec-conf.org/proceedings/lrec2006/pdf/162_pdf.pdf))
- _UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing_ (Straka et al. 2016) ([__PDF__](https://aclanthology.org/L16-1680.pdf))
- _UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task_ (Straka et al. 2018) ([__PDF__](https://aclanthology.org/K18-2020.pdf))
- _Massive Choice, Ample Tasks (MACHAMP): A Toolkit for Multi-task Learning in NLP_ (van der Goot et al., 2021) ([__PDF__](https://arxiv.org/pdf/2005.14672))
## CSE courses you may like
1. [DIT231](https://www.gu.se/en/study-gothenburg/programming-language-technology-dit231) Programming language technology
- build a complete compiler
2. [DIT301](https://www.gu.se/en/study-gothenburg/compiler-construction-dit301) Compiler construction
- the hardcore version of 1.
- build another compiler _and optimize it_
3. DIT247 Machine learning for NLP (?)
- has a module on dependency parsing similar to the one in "Deep Learning for Natural Language Processing"

Binary file not shown.

View File

@@ -0,0 +1,94 @@
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
## Intermediate documents:
*.dvi
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# amsthm
*.thm
# beamer
*.nav
*.snm
*.vrb
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
# hyperref
*.brf
# listings
*.lol
# makeidx
*.idx
*.ilg
*.ind
*.ist
# minitoc
*.maf
*.mtc
*.mtc0
# minted
*.pyg
# nomencl
*.nlo
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# todonotes
*.tdo
# xindy
*.xdy
# useless files
color_scheme.png
identicon.png
._wordcount_selection.tex

View File

@@ -0,0 +1,187 @@
\usepackage{tikz}
\usetikzlibrary{calc}
% -------- COLOR SCHEME --------
\definecolor{PrimaryColor}{RGB}{7,79,140} % primary color (blue)
\definecolor{SecondaryColor}{RGB}{242,88,26} % bulleted lists
\definecolor{BackgroundColor}{RGB}{255,255,255} % background & titles (white)
\definecolor{TextColor}{RGB}{0,0,0} % text (black)
\definecolor{ProgBarBGColor}{RGB}{175,175,175} % progress bar background (grey)
% set colours
\setbeamercolor{normal text}{fg=TextColor}\usebeamercolor*{normal text}
\setbeamercolor{alerted text}{fg=PrimaryColor}
\setbeamercolor{section in toc}{fg=PrimaryColor}
\setbeamercolor{structure}{fg=SecondaryColor}
\hypersetup{colorlinks,linkcolor=,urlcolor=SecondaryColor}
% set fonts
\setbeamerfont{itemize/enumerate body}{size=\large}
\setbeamerfont{itemize/enumerate subbody}{size=\normalsize}
\setbeamerfont{itemize/enumerate subsubbody}{size=\small}
% make pixelated bullets
\setbeamertemplate{itemize item}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0, 0) rectangle(0.1, 0.1);
\draw[fill=SecondaryColor,draw=none] (0.1, 0.1) rectangle(0.2, 0.2);
\draw[fill=SecondaryColor,draw=none] (0, 0.2) rectangle(0.1, 0.3);
}
}
\setbeamertemplate{itemize subitem}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0, 0) rectangle(0.075, 0.075);
\draw[fill=SecondaryColor,draw=none] (0.075, 0.075) rectangle(0.15, 0.15);
\draw[fill=SecondaryColor,draw=none] (0, 0.15) rectangle(0.075, 0.225);
}
}
\setbeamertemplate{itemize subsubitem}{
\tikz{
\draw[fill=SecondaryColor,draw=none] (0.050, 0.050) rectangle(0.15, 0.15);
}
}
% disable navigation
\setbeamertemplate{navigation symbols}{}
% disable the damn default logo!
\setbeamertemplate{sidebar right}{}
% custom draw the title page above
\setbeamertemplate{title page}{}
% again, manually draw the frame title above
\setbeamertemplate{frametitle}{}
% disable "Figure:" in the captions
% TODO: somehow this doesn't work for md-generated slides
%\setbeamertemplate{caption}{\tiny\insertcaption}
%\setbeamertemplate{caption label separator}{}
% add some space below the footnotes so they don't end up on the progress bar
\setbeamertemplate{footnote}{
\parindent 0em
\noindent
\raggedright
\hbox to 0.8em{\hfil\insertfootnotemark}
\insertfootnotetext
\par
\vspace{2em}
}
% add the same vspace both before and after quotes
\setbeamertemplate{quote begin}{\vspace{0.5em}}
\setbeamertemplate{quote end}{\vspace{0.5em}}
% progress bar counters
\newcounter{showProgressBar}
\setcounter{showProgressBar}{1}
\newcounter{showSlideNumbers}
\setcounter{showSlideNumbers}{1}
\newcounter{showSlideTotal}
\setcounter{showSlideTotal}{1}
% use \makeatletter for our progress bar definitions
% progress bar idea from http://tex.stackexchange.com/a/59749/44221
% slightly adapted for visual purposes here
\makeatletter
\newcount\progressbar@tmpcounta% auxiliary counter
\newcount\progressbar@tmpcountb% auxiliary counter
\newdimen\progressbar@pbwidth %progressbar width
\newdimen\progressbar@tmpdim % auxiliary dimension
\newdimen\slidewidth % auxiliary dimension
\newdimen\slideheight % auxiliary dimension
% make the progress bar go across the screen
\progressbar@pbwidth=\the\paperwidth
\slidewidth=\the\paperwidth
\slideheight=\the\paperheight
% draw everything with tikz
\setbeamertemplate{background}{ % all slides
% progress bar stuff
\progressbar@tmpcounta=\insertframenumber
\progressbar@tmpcountb=\inserttotalframenumber
\progressbar@tmpdim=\progressbar@pbwidth
\divide\progressbar@tmpdim by 100
\multiply\progressbar@tmpdim by \progressbar@tmpcounta
\divide\progressbar@tmpdim by \progressbar@tmpcountb
\multiply\progressbar@tmpdim by 100
\begin{tikzpicture}
% set up the entire slide as the canvas
\useasboundingbox (0,0) rectangle(\the\paperwidth,\the\paperheight);
% background
\fill[color=BackgroundColor] (0,0) rectangle(\the\paperwidth,\the\paperheight);
\ifnum\thepage=1\relax % only title slides
% primary color rectangle
\fill[color=PrimaryColor] (0, 4cm) rectangle(\slidewidth,\slideheight);
% text (title, subtitle, author, date)
\node[anchor=south,text width=\slidewidth-1cm,inner xsep=0.5cm] at (0.5\slidewidth,4cm) {\color{BackgroundColor}\Huge\textbf{\inserttitle}};
\node[anchor=north east,text width=\slidewidth-1cm,align=right] at (\slidewidth-0.4cm,4cm) {\color{PrimaryColor}\large\textbf{\insertsubtitle}};
\node at (0.5\slidewidth,2cm) {\color{PrimaryColor}\LARGE\insertauthor};
\node at (0.5\slidewidth,1.25cm) {\color{PrimaryColor}\Large\insertinstitute};
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertdate};
\else % other slides
% title bar
\fill[color=PrimaryColor] (0, \slideheight-1cm) rectangle(\slidewidth,\slideheight);
% slide title
\node[anchor=north,text width=\slidewidth-0.75cm,inner xsep=0.5cm,inner ysep=0.25cm] at (0.5\slidewidth,\slideheight) {\color{BackgroundColor}\huge\textbf{\insertframetitle}};
% logo (TODO: autoscale; now it expects 350x350
\node[anchor=north east] at (\slidewidth-0.25cm,\slideheight+0.06cm){\insertlogo};
% show progress bar
\ifnum \value{showProgressBar}>0\relax%
% progress bar icon in the middle of the screen
\draw[fill=ProgBarBGColor,draw=none] (0cm,0cm) rectangle(\slidewidth,0.25cm);
\draw[fill=PrimaryColor,draw=none] (0cm,0cm) rectangle(\progressbar@tmpdim,0.25cm);
% bottom info
\node[anchor=south west] at(0cm,0.25cm) {\color{PrimaryColor}\tiny\vphantom{lp}\insertsection};
% if slide numbers are active
\ifnum \value{showSlideNumbers}>0\relax%
% if slide totals are active
\ifnum \value{showSlideTotal}>0\relax%
% draw both slide number and slide total
\node[anchor=south east] at(\slidewidth,0.25cm) {\color{PrimaryColor}\tiny\insertframenumber/\inserttotalframenumber};
\else
\node[anchor=south east] at(\slidewidth,0.25cm) {\color{PrimaryColor}\tiny\insertframenumber};
\fi
\fi
\else
% section title in the bottom left
\node[anchor=south west] at(0cm,0cm) {\color{PrimaryColor}\tiny\vphantom{lp}\insertsection};
% if we're showing slide numbers
\ifnum \value{showSlideNumbers}>0\relax%
% if slide totals are active
\ifnum \value{showSlideTotal}>0\relax%
% slide number and slide total
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertframenumber/\inserttotalframenumber};
\else
\node[anchor=south east] at(\slidewidth,0cm) {\color{PrimaryColor}\tiny\insertframenumber};
\fi
\fi
\fi
\fi
\end{tikzpicture}
}
\makeatother
\AtBeginSection{\frame{\sectionpage}} % section pages
\setbeamertemplate{section page}
{
\begin{tikzpicture}
% set up the entire slide as the canvas
\useasboundingbox (0,0) rectangle(\slidewidth,\slideheight);
\fill[color=BackgroundColor] (-1cm, 2cm) rectangle (\slidewidth, \slideheight+0.1cm);
\fill[color=PrimaryColor] (-1cm, 0.5\slideheight-1cm) rectangle(\slidewidth, 0.5\slideheight+1cm);
\node[text width=\the\paperwidth-1cm,align=center] at (0.4\slidewidth, 0.5\slideheight) {\color{BackgroundColor}\Huge\textbf{\insertsection}};
\end{tikzpicture}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

View File

@@ -0,0 +1,24 @@
# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = swedish-talbanken-ud-2.15-241121
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# newpar
# sent_id = 1
# text = den är smog salt och det bra för all kropen
1 den den PRON PN|UTR|SIN|DEF|SUB/OBJ Definite=Def|Gender=Com|Number=Sing|PronType=Prs 4 nsubj _ TokenRange=0:3
2 är vara AUX VB|PRS|AKT Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act 4 cop _ TokenRange=4:6
3 smog smog ADV AB _ 4 advmod _ TokenRange=7:11
4 salt salt ADJ JJ|POS|UTR|SIN|IND|NOM Case=Nom|Definite=Ind|Degree=Pos|Number=Sing 0 root _ TokenRange=12:16
5 och och CCONJ KN _ 7 cc _ TokenRange=17:20
6 det den PRON PN|NEU|SIN|DEF|SUB/OBJ Definite=Def|Gender=Neut|Number=Sing|PronType=Prs 7 nsubj _ TokenRange=21:24
7 bra bra ADJ JJ|POS|UTR/NEU|SIN/PLU|IND/DEF|NOM Case=Nom|Degree=Pos 4 conj _ TokenRange=25:28
8 för för ADP PP _ 10 case _ TokenRange=29:32
9 all all DET DT|UTR|SIN|IND/DEF Gender=Com|Number=Sing|PronType=Tot 10 det _ TokenRange=33:36
10 kropen krop NOUN NN|UTR|SIN|DEF|NOM Case=Nom|Definite=Def|Gender=Com|Number=Sing 7 obl _ SpaceAfter=No|TokenRange=37:43
1 Självklart självklar ADV JJ|POS|NEU|SIN|IND|NOM Degree=Pos 0 root _ ORIG_LABEL=root
2 att att SCONJ SN _ 5 mark _ CorrectionLabels=S-Clause
3 det den PRON PN|NEU|SIN|DEF|SUB/OBJ Definite=Def|Gender=Neut|Number=Sing|PronType=Prs 5 nsubj _ _
4 är vara AUX VB|PRS|AKT Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act 5 cop _ CorrectionLabels=S-Clause
5 viktigt viktig ADJ JJ|POS|NEU|SIN|IND|NOM Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing 1 csubj _ _
6 . . PUNCT MAD _ 1 punct _ _

Binary file not shown.

View File

@@ -0,0 +1,111 @@
\documentclass{article}
\usepackage[a4paper,margin=0.5in,landscape]{geometry}
\usepackage[utf8]{inputenc}
\begin{document}
%% den är smog salt och det bra för all kropen
\setlength{\unitlength}{0.2mm}
\begin{picture}(531.0,110.0)
\put(0.0,0.0){den}
\put(46.0,0.0){är}
\put(83.0,0.0){smog}
\put(129.0,0.0){salt}
\put(175.0,0.0){och}
\put(230.0,0.0){det}
\put(276.0,0.0){bra}
\put(313.0,0.0){för}
\put(350.0,0.0){all}
\put(387.0,0.0){kropen}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny AUX}}
\put(83.0,15.0){{\tiny ADV}}
\put(129.0,15.0){{\tiny ADJ}}
\put(175.0,15.0){{\tiny CCONJ}}
\put(230.0,15.0){{\tiny PRON}}
\put(276.0,15.0){{\tiny ADJ}}
\put(313.0,15.0){{\tiny ADP}}
\put(350.0,15.0){{\tiny DET}}
\put(387.0,15.0){{\tiny NOUN}}
\put(0.0,-11.0){{\scriptsize {\slshape den}}}
\put(46.0,-11.0){{\scriptsize {\slshape vara}}}
\put(83.0,-11.0){{\scriptsize {\slshape smog}}}
\put(129.0,-11.0){{\scriptsize {\slshape salt}}}
\put(175.0,-11.0){{\scriptsize {\slshape och}}}
\put(230.0,-11.0){{\scriptsize {\slshape den}}}
\put(276.0,-11.0){{\scriptsize {\slshape bra}}}
\put(313.0,-11.0){{\scriptsize {\slshape för}}}
\put(350.0,-11.0){{\scriptsize {\slshape all}}}
\put(387.0,-11.0){{\scriptsize {\slshape krop}}}
\put(74.5,30.0){\oval(126.67441860465117,100.0)[t]}
\put(11.162790697674417,35.0){\vector(0,-1){5.0}}
\put(63.25,83.0){{\tiny nsubj}}
\put(97.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(57.80722891566265,35.0){\vector(0,-1){5.0}}
\put(90.75,66.33333333333334){{\tiny cop}}
\put(116.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(96.26086956521739,35.0){\vector(0,-1){5.0}}
\put(102.5,49.66666666666667){{\tiny advmod}}
\put(144.0,110.0){\vector(0,-1){80.0}}
\put(149.0,100.0){{\tiny root}}
\put(235.5,30.0){\oval(98.02970297029702,66.66666666666667)[t]}
\put(186.4851485148515,35.0){\vector(0,-1){5.0}}
\put(231.0,66.33333333333334){{\tiny cc}}
\put(263.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(243.26086956521738,35.0){\vector(0,-1){5.0}}
\put(251.75,49.66666666666667){{\tiny nsubj}}
\put(222.5,30.0){\oval(144.9591836734694,100.0)[t]}
\put(294.9795918367347,35.0){\vector(0,-1){5.0}}
\put(213.5,83.0){{\tiny conj}}
\put(360.0,30.0){\oval(69.94594594594595,66.66666666666667)[t]}
\put(325.02702702702703,35.0){\vector(0,-1){5.0}}
\put(351.0,66.33333333333334){{\tiny case}}
\put(378.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(364.05405405405406,35.0){\vector(0,-1){5.0}}
\put(371.75,49.66666666666667){{\tiny det}}
\put(351.5,30.0){\oval(108.29729729729729,100.0)[t]}
\put(405.64864864864865,35.0){\vector(0,-1){5.0}}
\put(344.75,83.0){{\tiny obl}}
\end{picture}
\vspace{4mm}
%% Självklart att det är viktigt .
\setlength{\unitlength}{0.2mm}
\begin{picture}(406.0,150.0)
\put(0.0,0.0){Självklart}
\put(100.0,0.0){att}
\put(155.0,0.0){det}
\put(201.0,0.0){är}
\put(238.0,0.0){viktigt}
\put(311.0,0.0){.}
\put(0.0,15.0){{\tiny ADV}}
\put(100.0,15.0){{\tiny SCONJ}}
\put(155.0,15.0){{\tiny PRON}}
\put(201.0,15.0){{\tiny AUX}}
\put(238.0,15.0){{\tiny ADJ}}
\put(311.0,15.0){{\tiny PUNCT}}
\put(0.0,-11.0){{\scriptsize {\slshape självklar}}}
\put(100.0,-11.0){{\scriptsize {\slshape att}}}
\put(155.0,-11.0){{\scriptsize {\slshape den}}}
\put(201.0,-11.0){{\scriptsize {\slshape vara}}}
\put(238.0,-11.0){{\scriptsize {\slshape viktig}}}
\put(311.0,-11.0){{\scriptsize {\slshape .}}}
\put(15.0,150.0){\vector(0,-1){120.0}}
\put(20.0,140.0){{\tiny root}}
\put(179.0,30.0){\oval(135.82608695652175,100.0)[t]}
\put(111.08695652173913,35.0){\vector(0,-1){5.0}}
\put(170.0,83.0){{\tiny mark}}
\put(206.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(166.80722891566265,35.0){\vector(0,-1){5.0}}
\put(195.25,66.33333333333334){{\tiny nsubj}}
\put(229.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(215.05405405405406,35.0){\vector(0,-1){5.0}}
\put(222.75,49.66666666666667){{\tiny cop}}
\put(139.0,30.0){\oval(236.73949579831933,133.33333333333334)[t]}
\put(257.3697478991597,35.0){\vector(0,-1){5.0}}
\put(127.75,99.66666666666667){{\tiny csubj}}
\put(175.5,30.0){\oval(310.0353697749196,166.66666666666666)[t]}
\put(330.51768488745984,35.0){\vector(0,-1){5.0}}
\put(164.25,116.33333333333333){{\tiny punct}}
\end{picture}
\end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

View File

@@ -0,0 +1,637 @@
---
title: "UD as an annotation standard \\newline for learner language"
subtitle: "a case study on L2 Swedish"
author: "Arianna Masciolini"
theme: "lucid"
logo: "gu.png"
date: "VT25"
institute: "LT2214 Computational Syntax"
---
## Learner data
<!--see any problems?-->
\bigskip \bigskip
### English (FCE)
\small
```xml
I also suggest that more plays and films should
<ns type="RV"> <ns type="FV"><i>be taken</i><c>take</c>
</ns> place</ns>.
```
### Italian (VALICO)
\small
```xml
Finse <MC><i>aveva paura</i><c>che aveva paura</c>
</MC> di un <DN><i>rapito</i><c>rapimento</c></DN>.
```
### Swedish (SweLL)
\small
```xml
<sentence> <w ref="1">"</w> <w ref="2" target_form="Det"
correction_label="L-Ref">Den</w> <w ref="3">är</w>
<w ref="4">en</w> <w ref="5">tredjedel</w>
<w ref="6">av</w> <w ref="7">din</w> <w ref="8">dag</w>
<w ref="9">!</w> </sentence>
```
## The problems
- coarse-grained error labels
- exclusive focus on errors
- lots of manual annotation needed
- lack of interoperability between corpora
## The solution: UD
- fine-grained morphosyntactic annotation <!--answers the first two-->
- parsers
- cross-linguistic consistency $\to$ possibility to compare:
- L2 vs. standard
- L1 vs. L2
- different L2s
## L1-L2 treebanks
<!--I did not come up with this, or actually I did, but someone else had already-->
![](img/l1l2.png)
\bigskip
- L2 sentences $\parallel$ correction hypotheses <!--explain hypotheses-->
- no explicit error tagging
<!--I love parallel treebanks btw - concept-alignment thesis with Aarne-->
## UD treebanks of learner language
\bigskip
| **language** | **name** | **size** | **status** | **parallel** |
| ----------: | --------- | -------: | :-----------: | :--------: |
| Chinese | CFL | 451 | released | **yes**\*\* |
| English | ESL | 5124 | retired\* | **yes** |
| English | ESLSpok | 2320 | released | no |
| Italian | Valico | 398 | released | **yes** |
| Korean | KSL | 12977 | released | no |
| Russian | ? | 500 | WIP | **yes** |
| \color{SecondaryColor}Swedish | \color{SecondaryColor}SweLL | \color{SecondaryColor}\~5000 | \color{SecondaryColor}WIP | \color{SecondaryColor}**yes** |
\footnotesize \*available for download but not part of the latest UD release
\newline\footnotesize \**only L2 half available
## Challenges
| **expectations** | **reality** |
| -----: | :----- |
| fine-grained annotation | when the validator allows that |
| parsers | don't work terribly well |
| cross-linguistic consistency | is limited to error-free spans |
## The `root` of the problem
The UD guidelines are designed with standard language in mind
- should we annotate the intended meaning (correction) and/or the observed language use?
- how to handle mismatches between the characteristics of individual tokens and their use in context?
# Treebanking SweLL
## Source corpus
__SweLL-gold__, aka the Swedish Learner Language corpus:
- __genre__: essays (misc topics)
- __learners__: adult L2 Swedish learners with various language backgrounds and proficiency levels
- __annotation__: error tagging, pseudonymization and normalization (minimal edits)
- __license__: CLARIN-ID -PRIV \underline{-NORED} -BY
## Example 0
\setlength{\unitlength}{0.20mm}
\begin{picture}(406.0,150.0)
\put(0.0,0.0){Självklart}
\put(100.0,0.0){\bfseries att}
\put(155.0,0.0){\bfseries det}
\put(201.0,0.0){\bfseries är}
\put(238.0,0.0){viktigt}
\put(311.0,0.0){.}
\put(0.0,-11.0){{\scriptsize {\slshape of.course}}}
\put(100.0,-11.0){{\scriptsize {\slshape that}}}
\put(155.0,-11.0){{\scriptsize {\slshape it}}}
\put(201.0,-11.0){{\scriptsize {\slshape is}}}
\put(238.0,-11.0){{\scriptsize {\slshape important}}}
\put(311.0,-11.0){{\scriptsize {\slshape .}}}
\end{picture}
\bigskip
- \small correction: "Självklart __är det__ viktigt."
- \small translation: "Of course it is important."
## Example 0
\setlength{\unitlength}{0.20mm}
\begin{picture}(406.0,150.0)
\put(0.0,0.0){Självklart}
\put(100.0,0.0){\bfseries att}
\put(155.0,0.0){\bfseries det}
\put(201.0,0.0){\bfseries är}
\put(238.0,0.0){viktigt}
\put(311.0,0.0){.}
\put(0.0,15.0){{\tiny ADV}}
\put(100.0,15.0){{\tiny SCONJ}}
\put(155.0,15.0){{\tiny PRON}}
\put(201.0,15.0){{\tiny AUX}}
\put(238.0,15.0){{\tiny ADJ}}
\put(311.0,15.0){{\tiny PUNCT}}
\put(0.0,-11.0){{\scriptsize {\slshape of.course}}}
\put(100.0,-11.0){{\scriptsize {\slshape that}}}
\put(155.0,-11.0){{\scriptsize {\slshape it}}}
\put(201.0,-11.0){{\scriptsize {\slshape is}}}
\put(238.0,-11.0){{\scriptsize {\slshape important}}}
\put(311.0,-11.0){{\scriptsize {\slshape .}}}
\end{picture}
\bigskip
- \small correction: "Självklart __är det__ viktigt."
- \small translation: "Of course it is important."
## Example 0
\setlength{\unitlength}{0.20mm}
\begin{picture}(406.0,150.0)
\put(0.0,0.0){Självklart}
\put(100.0,0.0){\bfseries att}
\put(155.0,0.0){\bfseries det}
\put(201.0,0.0){\bfseries är}
\put(238.0,0.0){viktigt}
\put(311.0,0.0){.}
\put(0.0,15.0){{\tiny ADV}}
\put(100.0,15.0){{\tiny SCONJ}}
\put(155.0,15.0){{\tiny PRON}}
\put(201.0,15.0){{\tiny AUX}}
\put(238.0,15.0){{\tiny ADJ}}
\put(311.0,15.0){{\tiny PUNCT}}
\put(0.0,-11.0){{\scriptsize {\slshape of.course}}}
\put(100.0,-11.0){{\scriptsize {\slshape that}}}
\put(155.0,-11.0){{\scriptsize {\slshape it}}}
\put(201.0,-11.0){{\scriptsize {\slshape is}}}
\put(238.0,-11.0){{\scriptsize {\slshape important}}}
\put(311.0,-11.0){{\scriptsize {\slshape .}}}
\put(15.0,150.0){\vector(0,-1){120.0}}
\put(20.0,140.0){{\tiny root}}
\put(179.0,30.0){\oval(135.82608695652175,100.0)[t]}
\put(111.08695652173913,35.0){\vector(0,-1){5.0}}
\put(170.0,83.0){{\tiny mark}}
\put(206.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(166.80722891566265,35.0){\vector(0,-1){5.0}}
\put(195.25,66.33333333333334){{\tiny nsubj}}
\put(229.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(215.05405405405406,35.0){\vector(0,-1){5.0}}
\put(222.75,49.66666666666667){{\tiny cop}}
\put(139.0,30.0){\oval(236.73949579831933,133.33333333333334)[t]}
\put(257.3697478991597,35.0){\vector(0,-1){5.0}}
\put(127.75,99.66666666666667){{\tiny csubj}}
\put(175.5,30.0){\oval(310.0353697749196,166.66666666666666)[t]}
\put(330.51768488745984,35.0){\vector(0,-1){5.0}}
\put(164.25,116.33333333333333){{\tiny punct}}
\end{picture}
\bigskip
- \small correction: "Självklart __är det__ viktigt."
- \small translation: "Of course it is important."
## Example 1
\setlength{\unitlength}{0.23mm}
\begin{picture}(409.0,130.0)
\put(0.0,0.0){Jag}
\put(46.0,0.0){hade}
\put(92.0,0.0){\bfseries emotskänslor}
\put(200.0,0.0){fast}
\put(270.0,0.0){jag}
\put(311.0,0.0){\bfseries var}
\put(348.0,0.0){\bfseries vänta}
\put(403.0,0.0){det}
\put(0.0,-11.0){{\scriptsize {\slshape I}}}
\put(46.0,-11.0){{\scriptsize {\slshape had}}}
\put(92.0,-11.0){{\scriptsize {\slshape againstfeelings}}}
\put(200.0,-11.0){{\scriptsize {\slshape although}}}
\put(270.0,-11.0){{\scriptsize {\slshape I}}}
\put(311.0,-11.0){{\scriptsize {\slshape was}}}
\put(348.0,-11.0){{\scriptsize {\slshape wait}}}
\put(403.0,-11.0){{\scriptsize {\slshape that}}}
\end{picture}
\bigskip
- \small correction: "Jag hade __motstridiga känslor__ fast jag __hade väntat mig__ det"
- \small translation: "I had mixed feelings although I was expecting that"
## Example 1
\setlength{\unitlength}{0.23mm}
\begin{picture}(409.0,130.0)
\put(0.0,0.0){Jag}
\put(46.0,0.0){hade}
\put(92.0,0.0){\bfseries emotskänslor}
\put(200.0,0.0){fast}
\put(270.0,0.0){jag}
\put(311.0,0.0){\bfseries var}
\put(348.0,0.0){\bfseries vänta}
\put(403.0,0.0){det}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny VERB}}
\put(92.0,15.0){{\tiny NOUN}}
\put(200.0,15.0){{\tiny SCONJ}}
\put(270.0,15.0){{\tiny PRON}}
\put(311.0,15.0){{\tiny AUX}}
\put(348.0,15.0){{\tiny VERB}}
\put(403.0,15.0){{\tiny PRON}}
\put(0.0,-11.0){{\scriptsize {\slshape I}}}
\put(46.0,-11.0){{\scriptsize {\slshape had}}}
\put(92.0,-11.0){{\scriptsize {\slshape againstfeelings}}}
\put(200.0,-11.0){{\scriptsize {\slshape although}}}
\put(270.0,-11.0){{\scriptsize {\slshape I}}}
\put(311.0,-11.0){{\scriptsize {\slshape was}}}
\put(348.0,-11.0){{\scriptsize {\slshape wait}}}
\put(403.0,-11.0){{\scriptsize {\slshape that}}}
\end{picture}
\bigskip
- \small correction: "Jag hade __motstridiga känslor__ fast jag __hade väntat mig__ det"
- \small translation: "I had mixed feelings although I was expecting that"
## Example 1
\setlength{\unitlength}{0.23mm}
\begin{picture}(409.0,130.0)
\put(0.0,0.0){Jag}
\put(46.0,0.0){hade}
\put(92.0,0.0){\bfseries emotskänslor}
\put(200.0,0.0){fast}
\put(270.0,0.0){jag}
\put(311.0,0.0){\bfseries var}
\put(348.0,0.0){\bfseries vänta}
\put(403.0,0.0){det}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny VERB}}
\put(92.0,15.0){{\tiny NOUN}}
\put(200.0,15.0){{\tiny SCONJ}}
\put(270.0,15.0){{\tiny PRON}}
\put(311.0,15.0){{\tiny AUX}}
\put(348.0,15.0){{\tiny VERB}}
\put(403.0,15.0){{\tiny PRON}}
\put(0.0,-11.0){{\scriptsize {\slshape I}}}
\put(46.0,-11.0){{\scriptsize {\slshape had}}}
\put(92.0,-11.0){{\scriptsize {\slshape againstfeelings}}}
\put(200.0,-11.0){{\scriptsize {\slshape although}}}
\put(270.0,-11.0){{\scriptsize {\slshape I}}}
\put(311.0,-11.0){{\scriptsize {\slshape was}}}
\put(348.0,-11.0){{\scriptsize {\slshape wait}}}
\put(403.0,-11.0){{\scriptsize {\slshape that}}}
\put(33.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(13.26086956521739,35.0){\vector(0,-1){5.0}}
\put(21.75,49.66666666666667){{\tiny nsubj}}
\put(61.0,130.0){\vector(0,-1){100.0}}
\put(66.0,120.0){{\tiny root}}
\put(89.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(108.73913043478261,35.0){\vector(0,-1){5.0}}
\put(82.25,49.66666666666667){{\tiny obj}}
\put(289.0,30.0){\oval(135.82608695652175,100.0)[t]}
\put(221.08695652173913,35.0){\vector(0,-1){5.0}}
\put(280.0,83.0){{\tiny mark}}
\put(316.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(276.8072289156627,35.0){\vector(0,-1){5.0}}
\put(305.25,66.33333333333334){{\tiny nsubj}}
\put(339.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(325.05405405405406,35.0){\vector(0,-1){5.0}}
\put(332.75,49.66666666666667){{\tiny \bfseries ?}}
\put(217.0,30.0){\oval(301.0066225165563,133.33333333333334)[t]}
\put(367.50331125827813,35.0){\vector(0,-1){5.0}}
\put(205.75,99.66666666666667){{\tiny advcl}}
\put(395.5,30.0){\oval(49.54545454545455,33.333333333333336)[t]}
\put(420.27272727272725,35.0){\vector(0,-1){5.0}}
\put(388.75,49.66666666666667){{\tiny obj}}
\end{picture}
\bigskip
- \small correction: "Jag hade __motstridiga känslor__ fast jag __hade väntat mig__ det"
- \small translation: "I had mixed feelings although I was expecting that"
## Example 1
\setlength{\unitlength}{0.23mm}
\begin{picture}(409.0,130.0)
\put(0.0,0.0){Jag}
\put(46.0,0.0){hade}
\put(92.0,0.0){\bfseries emotskänslor}
\put(200.0,0.0){fast}
\put(270.0,0.0){jag}
\put(311.0,0.0){\bfseries var}
\put(348.0,0.0){\bfseries vänta}
\put(403.0,0.0){det}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny VERB}}
\put(92.0,15.0){{\tiny NOUN}}
\put(200.0,15.0){{\tiny SCONJ}}
\put(270.0,15.0){{\tiny PRON}}
\put(311.0,15.0){{\tiny AUX}}
\put(348.0,15.0){{\tiny VERB}}
\put(403.0,15.0){{\tiny PRON}}
\put(0.0,-11.0){{\scriptsize {\slshape I}}}
\put(46.0,-11.0){{\scriptsize {\slshape had}}}
\put(92.0,-11.0){{\scriptsize {\slshape againstfeelings}}}
\put(200.0,-11.0){{\scriptsize {\slshape although}}}
\put(270.0,-11.0){{\scriptsize {\slshape I}}}
\put(311.0,-11.0){{\scriptsize {\slshape was}}}
\put(348.0,-11.0){{\scriptsize {\slshape wait}}}
\put(403.0,-11.0){{\scriptsize {\slshape that}}}
\put(33.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(13.26086956521739,35.0){\vector(0,-1){5.0}}
\put(21.75,49.66666666666667){{\tiny nsubj}}
\put(61.0,130.0){\vector(0,-1){100.0}}
\put(66.0,120.0){{\tiny root}}
\put(89.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(108.73913043478261,35.0){\vector(0,-1){5.0}}
\put(82.25,49.66666666666667){{\tiny obj}}
\put(289.0,30.0){\oval(135.82608695652175,100.0)[t]}
\put(221.08695652173913,35.0){\vector(0,-1){5.0}}
\put(280.0,83.0){{\tiny mark}}
\put(316.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(276.8072289156627,35.0){\vector(0,-1){5.0}}
\put(305.25,66.33333333333334){{\tiny nsubj}}
\put(339.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(325.05405405405406,35.0){\vector(0,-1){5.0}}
\put(332.75,49.66666666666667){{\tiny \bfseries aux:*}}
\put(217.0,30.0){\oval(301.0066225165563,133.33333333333334)[t]}
\put(367.50331125827813,35.0){\vector(0,-1){5.0}}
\put(205.75,99.66666666666667){{\tiny advcl}}
\put(395.5,30.0){\oval(49.54545454545455,33.333333333333336)[t]}
\put(420.27272727272725,35.0){\vector(0,-1){5.0}}
\put(388.75,49.66666666666667){{\tiny obj}}
\end{picture}
\bigskip
- \small correction: "Jag hade __motstridiga känslor__ fast jag __hade väntat mig__ det"
- \small translation: "I had mixed feelings although I was expecting that"
## Example 2
\setlength{\unitlength}{0.23mm}
\begin{picture}(195.0,110.0)
\put(0.0,0.0){en}
\put(37.0,0.0){lång}
\put(83.0,0.0){\bfseries bus}
\put(129.0,0.0){\bfseries resa}
\put(0.0,-13.0){{\scriptsize {\slshape a}}}
\put(37.0,-13.0){{\scriptsize {\slshape long}}}
\put(83.0,-13.0){{\scriptsize {\slshape bus}}}
\put(129.0,-13.0){{\scriptsize {\slshape trip}}}
\end{picture}
\bigskip
- \small correction: "en lång __bussresa__"
- \small translation: "a long bus trip"
## Example 2
\setlength{\unitlength}{0.23mm}
\begin{picture}(195.0,110.0)
\put(0.0,0.0){en}
\put(37.0,0.0){lång}
\put(83.0,0.0){\bfseries bus}
\put(129.0,0.0){\bfseries resa}
\put(0.0,15.0){{\tiny DET}}
\put(37.0,15.0){{\tiny ADJ}}
\put(83.0,15.0){{\tiny NOUN}}
\put(129.0,15.0){{\tiny NOUN}}
\put(0.0,-13.0){{\scriptsize {\slshape a}}}
\put(37.0,-13.0){{\scriptsize {\slshape long}}}
\put(83.0,-13.0){{\scriptsize {\slshape bus}}}
\put(129.0,-13.0){{\scriptsize {\slshape trip}}}
\end{picture}
\bigskip
- \small correction: "en lång __bussresa__"
- \small translation: "a long bus trip"
## Example 2
\setlength{\unitlength}{0.23mm}
\begin{picture}(195.0,110.0)
\put(0.0,0.0){en}
\put(37.0,0.0){lång}
\put(83.0,0.0){\bfseries bus}
\put(129.0,0.0){\bfseries resa}
\put(0.0,15.0){{\tiny DET}}
\put(37.0,15.0){{\tiny ADJ}}
\put(83.0,15.0){{\tiny NOUN}}
\put(129.0,15.0){{\tiny NOUN}}
\put(0.0,-13.0){{\scriptsize {\slshape a}}}
\put(37.0,-13.0){{\scriptsize {\slshape long}}}
\put(83.0,-13.0){{\scriptsize {\slshape bus}}}
\put(129.0,-13.0){{\scriptsize {\slshape trip}}}
\put(74.5,30.0){\oval(126.67441860465117,100.0)[t]}
\put(11.162790697674417,35.0){\vector(0,-1){5.0}}
\put(67.75,83.0){{\tiny det}}
\put(93.0,30.0){\oval(88.73913043478261,66.66666666666667)[t]}
\put(48.630434782608695,35.0){\vector(0,-1){5.0}}
\put(84.0,66.33333333333334){{\tiny amod}}
\put(116.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(96.26086956521739,35.0){\vector(0,-1){5.0}}
\put(75.5,49.66666666666667){{\tiny compound:*}}
\put(144.0,110.0){\vector(0,-1){80.0}}
\put(149.0,100.0){{\tiny root}}
\end{picture}
\bigskip
- \small correction: "en lång __bussresa__"
- \small translation: "a long bus trip"
## Example 3
\setlength{\unitlength}{0.23mm}
\begin{picture}(531.0,110.0)
\small
\put(0.0,0.0){\bfseries den}
\put(46.0,0.0){\bfseries är}
\put(83.0,0.0){\bfseries smog}
\put(129.0,0.0){salt}
\put(175.0,0.0){och}
\put(230.0,0.0){det}
\put(276.0,0.0){bra}
\put(313.0,0.0){för}
\put(350.0,0.0){\bfseries all}
\put(387.0,0.0){\bfseries kropen}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny AUX}}
\put(83.0,15.0){{\tiny NOUN}}
\put(129.0,15.0){{\tiny NOUN}}
\put(175.0,15.0){{\tiny CCONJ}}
\put(230.0,15.0){{\tiny PRON}}
\put(276.0,15.0){{\tiny ADJ}}
\put(313.0,15.0){{\tiny ADP}}
\put(350.0,15.0){{\tiny DET}}
\put(387.0,15.0){{\tiny NOUN}}
\put(0.0,-13.0){{\scriptsize {\slshape it}}}
\put(46.0,-13.0){{\scriptsize {\slshape is}}}
\put(83.0,-13.0){{\scriptsize {\slshape taste?}}}
\put(129.0,-13.0){{\scriptsize {\slshape salt}}}
\put(175.0,-13.0){{\scriptsize {\slshape and}}}
\put(230.0,-13.0){{\scriptsize {\slshape it}}}
\put(276.0,-13.0){{\scriptsize {\slshape good}}}
\put(313.0,-13.0){{\scriptsize {\slshape for}}}
\put(350.0,-13.0){{\scriptsize {\slshape all}}}
\put(387.0,-13.0){{\scriptsize {\slshape the.body}}}
\put(51.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(11.807228915662648,35.0){\vector(0,-1){5.0}}
\put(40.25,66.33333333333334){{\tiny nsubj}}
\put(74.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(60.054054054054056,35.0){\vector(0,-1){5.0}}
\put(67.75,49.66666666666667){{\tiny cop}}
\put(98.0,110.0){\vector(0,-1){80.0}}
\put(103.0,100.0){{\tiny root}}
\put(126.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(145.73913043478262,35.0){\vector(0,-1){5.0}}
\put(117.0,49.66666666666667){{\tiny nmod}}
\put(235.5,30.0){\oval(98.02970297029702,66.66666666666667)[t]}
\put(186.4851485148515,35.0){\vector(0,-1){5.0}}
\put(231.0,66.33333333333334){{\tiny cc}}
\put(263.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(243.26086956521738,35.0){\vector(0,-1){5.0}}
\put(251.75,49.66666666666667){{\tiny nsubj}}
\put(199.5,30.0){\oval(191.4455958549223,100.0)[t]}
\put(295.22279792746116,35.0){\vector(0,-1){5.0}}
\put(190.5,83.0){{\tiny conj}}
\put(360.0,30.0){\oval(69.94594594594595,66.66666666666667)[t]}
\put(325.02702702702703,35.0){\vector(0,-1){5.0}}
\put(351.0,66.33333333333334){{\tiny case}}
\put(378.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(364.05405405405406,35.0){\vector(0,-1){5.0}}
\put(371.75,49.66666666666667){{\tiny det}}
\put(351.5,30.0){\oval(108.29729729729729,100.0)[t]}
\put(405.64864864864865,35.0){\vector(0,-1){5.0}}
\put(344.75,83.0){{\tiny obl}}
\end{picture}
\bigskip
- \small correction: "__Det smakar__ salt och det __är__ bra för __hela kroppen__"
- \small translation: "it tastes salt and it's good for the whole body"
## Example 3: parser output
\setlength{\unitlength}{0.23mm}
\begin{picture}(531.0,110.0)
\put(0.0,0.0){\bfseries den}
\put(46.0,0.0){\bfseries är}
\put(83.0,0.0){\bfseries smog}
\put(129.0,0.0){salt}
\put(175.0,0.0){och}
\put(230.0,0.0){det}
\put(276.0,0.0){bra}
\put(313.0,0.0){för}
\put(350.0,0.0){\bfseries all}
\put(387.0,0.0){\bfseries kropen}
\put(0.0,15.0){{\tiny PRON}}
\put(46.0,15.0){{\tiny AUX}}
\put(83.0,15.0){{\tiny \color{SecondaryColor} ADV}}
\put(129.0,15.0){{\tiny \color{SecondaryColor} ADJ}}
\put(175.0,15.0){{\tiny CCONJ}}
\put(230.0,15.0){{\tiny PRON}}
\put(276.0,15.0){{\tiny ADJ}}
\put(313.0,15.0){{\tiny ADP}}
\put(350.0,15.0){{\tiny DET}}
\put(387.0,15.0){{\tiny NOUN}}
\put(74.5,30.0){\oval(126.67441860465117,100.0)[t]}
\put(11.162790697674417,35.0){\vector(0,-1){5.0}}
\put(63.25,83.0){{\tiny nsubj}}
\put(97.5,30.0){\oval(79.3855421686747,66.66666666666667)[t]}
\put(57.80722891566265,35.0){\vector(0,-1){5.0}}
\put(90.75,66.33333333333334){{\tiny cop}}
\put(116.0,30.0){\color{SecondaryColor} \oval(39.47826086956522,33.333333333333336)[t]}
\put(96.26086956521739,35.0){\color{SecondaryColor} \vector(0,-1){5.0}}
\put(102.5,49.66666666666667){{\tiny \color{SecondaryColor} advmod}}
\put(144.0,110.0){\color{SecondaryColor} \vector(0,-1){80.0}}
\put(149.0,100.0){{\tiny \color{SecondaryColor} root}}
\put(235.5,30.0){\oval(98.02970297029702,66.66666666666667)[t]}
\put(186.4851485148515,35.0){\vector(0,-1){5.0}}
\put(231.0,66.33333333333334){{\tiny cc}}
\put(263.0,30.0){\oval(39.47826086956522,33.333333333333336)[t]}
\put(243.26086956521738,35.0){\vector(0,-1){5.0}}
\put(251.75,49.66666666666667){{\tiny nsubj}}
\put(222.5,30.0){\oval(144.9591836734694,100.0)[t]}
\put(294.9795918367347,35.0){\vector(0,-1){5.0}}
\put(213.5,83.0){{\tiny conj}}
\put(360.0,30.0){\oval(69.94594594594595,66.66666666666667)[t]}
\put(325.02702702702703,35.0){\vector(0,-1){5.0}}
\put(351.0,66.33333333333334){{\tiny case}}
\put(378.5,30.0){\oval(28.89189189189189,33.333333333333336)[t]}
\put(364.05405405405406,35.0){\vector(0,-1){5.0}}
\put(371.75,49.66666666666667){{\tiny det}}
\put(351.5,30.0){\oval(108.29729729729729,100.0)[t]}
\put(405.64864864864865,35.0){\vector(0,-1){5.0}}
\put(344.75,83.0){{\tiny obl}}
\end{picture}
\bigskip \bigskip
\footnotesize (obtained with the UDPipe 2 Talbanken 2.15 model)
<!--and this is without FEATS and LEMMA!-->
## Our principles
- the validator is a tool, not a goal:
- __*literal* criteria at the token level__
- __*distributional* criteria at the syntax level__
- __borrow from L1__ guidelines when necessary
- __correction-aware annotation__: the annotation of learner sentences should be consistent with the semantics of the correction hypothesis
## Status
- guidelines and test set (200/500 sentences) WIP
- remaining 5000 + 500 sentences TODO \pause
- you are welcome to __participate__!
- you do _not_ have to be a native speaker
(in fact, none of the current annotators is)
- you _might_ be able to do this as a course project
# Exploring parallel learner treebanks with STUnD
## STUnD
- _Sökverktyg för Tvåspråkiga Universal Dependencies-trädbanker_, or
- Search Tool for (parallel) Universal Dependencies Treebanks
- available at `demo.spraakbanken.gu.se/stund` (hopefully)
## Under the hood
1. identify subtree alignments
2. run the query on the LHS treebanks, looking for matching subtres
3. find the corresponding RHS subtree (and check if it matches the RHS-specific patters)
## Use cases
- error retrieval: patterns (queries) $\to$ trees
- pattern extraction: trees $\to$ patterns
- feedback comment generation: patterns $\to$ natural language comments <!--maybe goto mittsem slides-->
# Sources
## In order of appearance
- \small John Lee, Keying Li, and Herman Leung. _L1-L2 parallel dependency treebank as learner corpus_. In Proceedings of the 15th International Conference on Parsing Technologies, pages 44-49, Pisa, Italy, September 2017. Association for Computational Linguistics
- \small John Lee, Herman Leung, and Keying Li. _Towards Universal Dependencies for learner Chinese_. In Marie-Catherine de Marneffe, Joakim Nivre, and Sebastian Schuster, editors, Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 67-71, Gothenburg, Sweden, may 2017. Association for Computational Linguistics
## In order of appearance
- \small Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, and Boris Katz. _Universal Dependencies for learner English_. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 737-746, Berlin, Germany, aug 2016. Association for Computational Linguistics.
- \small Elisa Di Nuovo, Manuela Sanguinetti, Alessandro Mazzei, Elisa Corino, and Cristina Bosco. _VALICO-UD: Treebanking an Italian learner corpus in Universal Dependencies_. IJCoL. Italian Journal of Computational Linguistics, 8(8-1), 2022
## In order of appearance
- \small Hakyung Sung and Gyu-Ho Shin. _Constructing a dependency treebank for second language learners of Korean_. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3747-3758, Torino, Italia, may 2024. ELRA and ICCL
- \small Hakyung Sung and Gyu-Ho Shin. _Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement_. In Špela Arhar Holdt, Nikolai Ilinykh, Barbara Scalvini, Micaella Bruton, Iben Nyholm Debess, and Crina Madalina Tudor, editors, Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), pages 13-19, Tallinn, Estonia, March 2025. University of Tartu Library, Estonia
## In order of appearance
- \small Alla Rozovskaya. _Universal Dependencies for learner Russian_. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17112-17119, Torino, Italia, may 2024. ELRA and ICCL
- \small Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, et al. _The SweLL language learner corpus: From design to annotation_. Northern European Journal of Language Technology, 6:67-104, 2019
- \small Arianna Masciolini. _A query engine for L1-L2 parallel dependency treebanks_. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 574--587, Tórshavn, Faroe Islands, May 2023. University of Tartu Library
## In order of appearance
- \small Arianna Masciolini, Elena Volodina, and Dana Dannélls. _Towards automatically extracting morphosyntactical error patterns from L1-L2 parallel dependency treebanks_. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 585-597, Toronto, Canada, jul 2023. Association for Computational Linguistics
- \small Arianna Masciolini and Márton A Tóth. _STUnD: ett Sökverktyg för Tvåspråkiga Universal Dependencies-trädbanker_. In Proceedings of the Huminfra Conference, pages 95-109, Gothenburg, Sweden, 2024
## To appear
- \small Arianna Masciolini, Herbert Lange and Márton A Tóth. _Exploring parallel corpora with STUnD: a Search Tool for Universal Dependencies_. In the upcoming Huminfra Handbook, Gothenburg, Sweden, __most likely__ 2025
- \small a paper about harmonization of UD guidelines for L2 treebanks (under review)

Binary file not shown.