diff --git a/lectures/lecture-n-1/img/argmining.png b/lectures/lecture-n-1/img/argmining.png new file mode 100644 index 0000000..9ed9b65 Binary files /dev/null and b/lectures/lecture-n-1/img/argmining.png differ diff --git a/lectures/lecture-n-1/img/gfast.png b/lectures/lecture-n-1/img/gfast.png new file mode 100644 index 0000000..7909973 Binary files /dev/null and b/lectures/lecture-n-1/img/gfast.png differ diff --git a/lectures/lecture-n-1/img/machamp.png b/lectures/lecture-n-1/img/machamp.png new file mode 100644 index 0000000..bbb7d34 Binary files /dev/null and b/lectures/lecture-n-1/img/machamp.png differ diff --git a/lectures/lecture-n-1/img/sets.png b/lectures/lecture-n-1/img/sets.png new file mode 100644 index 0000000..8306374 Binary files /dev/null and b/lectures/lecture-n-1/img/sets.png differ diff --git a/lectures/lecture-n-1/img/ud.conllu b/lectures/lecture-n-1/img/ud.conllu new file mode 100644 index 0000000..d44193f --- /dev/null +++ b/lectures/lecture-n-1/img/ud.conllu @@ -0,0 +1,6 @@ +1 the the DET DT Definite=Def|PronType=Art 3 det _ TokenRange=0:3 +2 black black ADJ JJ Degree=Pos 3 amod _ TokenRange=4:9 +3 cat cat NOUN NN Number=Sing 4 nsubj _ TokenRange=10:13 +4 sees see VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ TokenRange=14:18 +5 us we PRON PRP Case=Acc|Number=Plur|Person=1|PronType=Prs 4 obj _ TokenRange=19:21 +6 now now ADV RB PronType=Dem 4 advmod _ SpaceAfter=No|TokenRange=22:25 \ No newline at end of file diff --git a/lectures/lecture-n-1/img/ud.svg b/lectures/lecture-n-1/img/ud.svg new file mode 100644 index 0000000..7fcd7e0 --- /dev/null +++ b/lectures/lecture-n-1/img/ud.svg @@ -0,0 +1,51 @@ + + the + black + cat + sees + us + now + DET + ADJ + NOUN + VERB + PRON + ADV + + + + det + + + + amod + + + + nsubj + + + root + + + + obj + + + + advmod + diff --git a/lectures/lecture-n-1/sentence.txt b/lectures/lecture-n-1/sentence.txt deleted file mode 100644 index e69de29..0000000 diff --git a/lectures/lecture-n-1/slides.md b/lectures/lecture-n-1/slides.md index ab84462..c8afd58 100644 --- a/lectures/lecture-n-1/slides.md +++ b/lectures/lecture-n-1/slides.md @@ -1,6 +1,6 @@ --- -title: "Training and evaluating UD parsers" -subtitle: "by popular demand" +title: "Training and evaluating \\newline dependency parsers" +subtitle: "(added to the course by popular demand)" author: "Arianna Masciolini" theme: "lucid" logo: "gu.png" @@ -8,7 +8,173 @@ date: "VT25" institute: "LT2214 Computational Syntax" --- -# Basics of dependency parsing +## Today's topic +\bigskip \bigskip +![](img/sets.png) -## Today's focus - \ No newline at end of file +# Parsing + +## A structured prediction task +Sequence $\to$ structure, e.g. + +- natural language sentence $\to$ syntax tree +- code $\to$ AST +- argumentative essay $\to$ argumentative structure + +## Example (argmining) + +> Språkbanken has better fika than CLASP: every fika, someone bakes. Sure, CLASP has a better coffee machine. On the other hand, there are more important things than coffee. In fact, most people drink tea in the afternoon. + +## Example (argmining) +![](img/argmining.png) + +\footnotesize From "A gentle introduction to argumentation mining" (Lindahl et al., 2022) + +# Syntactic parsing + +## From sentence to tree +From Jurafsky & Martin. _Speech and Language Processing_, chapter 18 (January 2024 draft): + +> Syntactic parsing is the task of assigning a syntactic structure to a sentence + +- the structure is usually a _syntax tree_ +- two main classes of approaches: + - constituency parsing (e.g. GF) + - dependency parsing (e.g. UD) + +## Example (GF) +``` +MicroLang> i MicroLangEng.gf +linking ... OK + +Languages: MicroLangEng +7 msec +MicroLang> p "the black cat sees us now" +PredVPS (DetCN the_Det (AdjCN (PositA black_A) +(UseN cat_N))) (AdvVP (ComplV2 see_V2 (UsePron +we_Pron)) now_Adv) +``` + +## Example (GF) +```haskell +PredVPS ( + DetCN + the_Det + (AdjCN (PositA black_A) (UseN cat_N)) + ) + (AdvVP + (ComplV2 see_V2 (UsePron we_Pron)) + now_Adv + ) +``` + +## Example (GF) +![](img/gfast.png) + +# Dependency parsing + +## Example (UD) +![](img/ud.svg) + +\small +``` +1 the _ DET _ _ 3 det _ _ +2 black _ ADJ _ _ 3 amod _ _ +3 cat _ NOUN _ _ 4 nsubj _ _ +4 sees _ VERB _ _ 0 root _ _ +5 us _ PRON _ _ 4 obj _ _ +6 now _ ADV _ _ 4 advmod _ _ +``` + +## Two paradigms +- __graph-based algorithms__: find the optimal tree from the set of all possible candidate solutions or a subset of it +- __transition-based algorithms__: incrementally build a tree by solving a sequence of classification problems + +## Graph-based approaches +$$\hat{t} = \underset{t \in T(s)}{argmax}\, score(s,t)$$ + +- $t$: candidate tree +- $\hat{t}$: predicted tree +- $s$: input sentence +- $T(s)$: set of candidate trees for $s$ + +## Complexity +- choice of $T$ (upper bound: $n^{n-1}$, where $n$ is the number of words in $s$) +- scoring function (in the __arc-factor model__, the score of a tree is the sum of the score of each edge, scored individually by a NN. This results in $O(n^3)$ complexity) + +## Transition-based approaches +- trees are built through a sequence of steps, called _transitions_ +- training requires: + - a gold-standard treebank (as for graph-based approaches) + - an _oracle_ i.e. an algorithm that converts each tree into a a gold-standard sequence of transitions +- much more efficient: $O(n)$ + +## Evaluation +2 main metrics: + +- __UAS__ (Unlabelled Attachment Score): what's the fraction of nodes are attached to the correct dependency head? +- __LAS__ (Labelled Attachment Score): what's the fraction of nodes are attached to the correct dependency head _with an arc labelled with the correct relation type_[^1]? + +[^1]: in UD: the `DEPREL` column + +# Specifics of UD parsing + +## Not just parsing per se +UD "parsers" typically do a lot more than just dependency parsing: + +- lemmatization (`LEMMA` column) +- POS tagging (`UPOS` + `XPOS`) +- morphological tagging (`FEATS`) +- ... + +## Evaluation (UD-specific) +Some more specific metrics: + +- CLAS (Content-word LAS): LAS limited to content words +- MLAS (Morphology-Aware LAS): CLAS that also uses the `FEATS` column +- BLEX (Bi-Lexical dependency score): CLAS that also uses the `LEMMA` column + +## Evaluation script output +\small +``` +Metric | Precision | Recall | F1 Score | AligndAcc +-----------+-----------+-----------+-----------+----------- +Tokens | 100.00 | 100.00 | 100.00 | +Sentences | 100.00 | 100.00 | 100.00 | +Words | 100.00 | 100.00 | 100.00 | +UPOS | 98.36 | 98.36 | 98.36 | 98.36 +XPOS | 100.00 | 100.00 | 100.00 | 100.00 +UFeats | 100.00 | 100.00 | 100.00 | 100.00 +AllTags | 98.36 | 98.36 | 98.36 | 98.36 +Lemmas | 100.00 | 100.00 | 100.00 | 100.00 +UAS | 92.73 | 92.73 | 92.73 | 92.73 +LAS | 90.30 | 90.30 | 90.30 | 90.30 +CLAS | 88.50 | 88.34 | 88.42 | 88.34 +MLAS | 86.72 | 86.56 | 86.64 | 86.56 +BLEX | 88.50 | 88.34 | 88.42 | 88.34 +``` + +## Three generations of parsers +1. __MaltParser__ (Nivre et al., 2006): "classic" transition-based parser, data-driven but not NN-based +2. __UDPipe__: neural transition-based parser; personal favorite + - version 1 (Straka et al. 2016): solid and fast software, available anywhere + - version 2 (Straka et al. 2018): much better performance, but slower and only available through an API +3. __MaChAmp__ (van der Goot et al., 2021): transformer-based toolkit for multi-task learning, works on all CoNNL-like data, close to the SOTA, relatively easy to install and train + +## Your task (lab 3) +![](img/machamp.png) + +1. annotate a small treebank for your language of choice (started) +2. train a parser-tagger with MaChAmp on a reference UD treebank (tomorrow: installation) +3. evaluate it on your treebank + +## Sources/further reading +- chapters 18-19 of the January 2024 draft of _Speech and Language Processing_ (Jurafsky & Martin) (full text available [__here__](https://web.stanford.edu/~jurafsky/slp3/)) +- unit 3-2 of Johansson & Kuhlmann's course "Deep Learning for Natural Language Processing" (slides and videos available __[__here__](https://liu-nlp.ai/dl4nlp/modules/module3/)__) +- section 10.9.2 on parser evaluation from Aarne's course notes (on Canvas or [__here__](https://www.cse.chalmers.se/~aarne/grammarbook.pdf)) + +## Papers describing the parsers +- _MaltParser: A Data-Driven Parser-Generator for Dependency Parsing_ (Nivre et al. 2006) (PDF [__here__](http://lrec-conf.org/proceedings/lrec2006/pdf/162_pdf.pdf)) +- _UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing_ (Straka et al. 2016) (PDF [__here__](https://aclanthology.org/L16-1680.pdf)) +- _UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task_ (Straka et al. 2018) (PDF [__here__](https://aclanthology.org/K18-2020.pdf)) +- _Massive Choice, Ample Tasks (MACHAMP): A Toolkit for Multi-task Learning in NLP_ (van der Goot et al., 2021) (PDF [__here__](https://arxiv.org/pdf/2005.14672)) \ No newline at end of file diff --git a/lectures/lecture-n-1/slides.pdf b/lectures/lecture-n-1/slides.pdf index ff18785..bee586b 100644 Binary files a/lectures/lecture-n-1/slides.pdf and b/lectures/lecture-n-1/slides.pdf differ