forked from GitHub/comp-syntax-gu-mlt
Merge branch 'main' of https://github.com/harisont/comp-syntax-gu-mlt
This commit is contained in:
@@ -30,15 +30,30 @@ trees by hand, on paper.
|
||||
|
||||
## Chapter 3: UD syntax analysis
|
||||
|
||||
Take a bilingual corpus with English and your own language, and annotate with UD.
|
||||
In this lab, you will annotate a bilingual corpus with English and your own language with UD.
|
||||
The English text is given in the file `comp-syntax-corpus-english.txt` in this directory.
|
||||
|
||||
Your task is to
|
||||
1. write an English CoNLL file analysing this corpus
|
||||
2. translate the corpus to your language
|
||||
3. write a CoNLL file analysing your translation
|
||||
|
||||
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
|
||||
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD. You can also compare your analyses with those of an automatic parser, such as [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/), which you can try directly from your browser. These automatic analyses must of course be taken with a grain of salt.
|
||||
|
||||
The first 12 sentences are POS-tagged, with each word having the form
|
||||
|
||||
`word:<POS>`
|
||||
|
||||
Hint: you can initialize the task by converting each word or word:<POS> to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course.
|
||||
|
||||
The UD annotation that you produce manually can be simplified CoNLL, with just the fields
|
||||
|
||||
`position word postag head label`
|
||||
|
||||
Make sure that each field is exactly one token, so that the whole line has exactly 5 tokens.
|
||||
|
||||
This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already)
|
||||
This input can be automatically expanded to full CoNLL by adding undescores for the lemma, morphology, and other missing fields, as well as tabs between the fields (if you didn't use tabs already).
|
||||
|
||||
`position word _ postag _ _ head label _ _`
|
||||
|
||||
@@ -51,37 +66,16 @@ expands to
|
||||
`7 world _ NOUN _ _ 4 nmod _ _`
|
||||
|
||||
(Unfortunately, the tabs are not visible in the md output.)
|
||||
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll`
|
||||
The conversion to full CoNLL can be done using Python or `gf-ud reduced2conll` (available on eduserv) or with [this script](https://gist.github.com/harisont/612a87d20f729aa3411041f873367fa2).
|
||||
|
||||
Once you have full CoNLL, you can use for instance the gfud tool to visualize it.
|
||||
|
||||
Your task is to
|
||||
1. write an English CoNLL file analysing this corpus
|
||||
2. translate the corpus to your language
|
||||
3. write a CoNLL file analysing your translation
|
||||
|
||||
|
||||
The corpus is a combination of different sources, including the Parallel UD treebank (PUD).
|
||||
If you want to cheat - or just check your own answer - you can look for those sentences in the official PUD.
|
||||
|
||||
The first 12 sentences are POS-tagged, with each word having the form
|
||||
|
||||
`word:<POS>`
|
||||
|
||||
Hint: you can initialize the task by converting each word or word:<POS> to a simplified CoNLL line with a dummy head (0) and label (dep), with proper position number of course.
|
||||
|
||||
Extra: If you want to see the visual trees, you can build the gfud program from
|
||||
|
||||
`https://github.com/GrammaticalFramework/gf-ud`
|
||||
|
||||
and issue the command
|
||||
|
||||
`cat my-file.conllu | ./gfud conll2pdf`
|
||||
|
||||
You will need Haskell and GF libraries to build gfud, and LaTeX to show the pdf.
|
||||
Once you have full CoNLL, you can use for instance the gf-ud tool or [the online CoNNL-U viewer](https://universaldependencies.org/conllu_viewer.html) to visualize it.
|
||||
|
||||
If you use the gf-ud tool, you will need to issue the command
|
||||
|
||||
`cat my-file.conllu | ./gf-ud conll2pdf`
|
||||
|
||||
It is possible that you won't be able to visualize the trees directly on eduserv.
|
||||
Building gf-ud and running this command on your machine requires Haskell and the GF libraries, as well as LaTeX to show the pdf output.
|
||||
|
||||
## Chapter 4: phrase structure analysis
|
||||
|
||||
|
||||
Reference in New Issue
Block a user