From 0f1826a8689a302d2babb5e6ded2b73739158760 Mon Sep 17 00:00:00 2001 From: aarne Date: Mon, 20 Jan 2014 08:32:09 +0000 Subject: [PATCH] subtitles in translation doc --- lib/doc/translation.html | 56 ++++++++++++++++++++++++++++------------ lib/doc/translation.txt | 55 +++++++++++++++++++++++++++------------ 2 files changed, 77 insertions(+), 34 deletions(-) diff --git a/lib/doc/translation.html b/lib/doc/translation.html index c9c744401..f16d00065 100644 --- a/lib/doc/translation.html +++ b/lib/doc/translation.html @@ -10,6 +10,9 @@ Work in progress, January 2014 + +

GF and the RGL

+

GF, Grammatical Framework, was originally designed for the purpose of multilingual controlled language systems, which would enable high-quality translation on limited domains. The abstract syntax of GF defines the semantic @@ -19,20 +22,24 @@ and thereby translation where the abstract syntax functions as an inte

As a bottle-neck of GF applications, it was soon realized that the definition of concrete syntax requires a lot -of manual work and linguistic skill, due to the complexities of natural language syntax and morphology. Some of +of manual work and linguistic skill, because of the complexities of natural language syntax and morphology. Some of the complexities can be ignored in a small system. For instance, in a mathematical system, it may be enough to use verbs in the present tense only. But very much the same linguistic problems must be solved again and again -in new applications: French verb inflection is much the same in mathematics as in a tourist phrasebook. To solve +in new applications: French verb inflection is the same in mathematics as in a tourist phrasebook. To solve this problem, the GF Resource Grammar Library (RGL) was developed, to take care of "low-level" linguistic rules such as inflection, agreement, and word order. This enables the authors of application grammars to focus on the semantics (when designing the abstract syntax) and on selecting RGL functions that produce the idioms they want. The RGL grew into an international open-source project, where more than 50 persons have contributed to -implementing it for 29 languages at the time of writing. +implementing it for 29 languages by the time of writing this.

+ +

Scaling up GF translation

+

The RGL was thus originally designed to be used just as its name says: as a library -for application grammars, which were the ones used as top-level grammars, i.e. for -parsing, generation, and translation at run time. Little attention was paid to the usability of RGL as a top-level +for application grammars. Only the latter were meant to be used as top-level grammars, i.e. for +parsing, generation, and translation at run time. Little attention was therefore +paid to the usability of RGL as a top-level grammar by itself. But when applications accumulated, ranging from technical text to spoken dialogue, the coverage of the RGL grew into a coverage that approximates a "complete grammar" of many of the languages. And recently, there has indeed been success in using the RGL as a wide-coverage translation grammar, @@ -46,23 +53,25 @@ to open-text processing. This success is a result of four lines of development: This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.

  • Large-scale dictionaries, both manually built and extracted from free sources, and linked into a multilingual - translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert - porting the Oxford Advanced Learner's Dictionary for English to GF. + translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert, + who ported the Oxford Advanced Learner's Dictionary of English to GF.

  • Probabilistic disambiguation, using a model trained from the Penn Treebank. Due to the common abstract syntax, the same model can be readily used for other languages as well, even though the adequacy of this transfer has not been systematically evaluated.

  • Robust parsing, which recovers from unknown words and syntax by introducing metavariables ("question marks") - and returning chunk-by-chunk translations; this leads to loss of quality, but fulfills the principle that + and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that "something is better than nothing". +

    Remaining problems

    +

    -The result of this work is indeed a large-coverage translation system, which can be used in the same way as Google +The result of all this work is a wide-coverage translation system, which can be used in the same way as Google Translate, Bing, Systran, and Apertium - to "translate anything", albeit with a varying quality. At the moment of writing, the performance is not yet generally on the level with the best of the competition, but shows some promising -improvements in e.g. long-distance agreement and word order. In order to make these into absolute improvements, we +improvements in e.g. long-distance agreement and word order. To make these advantages into absolute improvements, we will need to fix problems that the other systems (or at least some of them) get right but where GF translation often fails:

    @@ -74,29 +83,33 @@ often fails:

  • Speed, which gets worse with long sentences and with more complex languages.

    -
  • Idiomacy, due to lack of idiomatic constructions that are not compositional in the RGL but which are - often correct in phrase-based SMT. +
  • Idiomacy, due to the lack of idiomatic constructions that are not compositional and therefore don't get right + in the RGL but are often correct in phrase-based SMT. +

    Advantages of GF translation

    +

    Given that these issues get resolved, the strengths of the GF approach can be made more visible:

      -
    • Grammaticality, in particular with the already mentioned agreement and word order. +
    • Grammaticality, in particular the already mentioned issues of agreement and word order.

      -
    • Predictability, in the sense that a local change in the input usually results in just a corresponding +
    • Predictability, in the sense that a local change in the input usually results in a corresponding local change in the output (unless otherwise required by idiomacy).

    • Feedback, i.e. the ease of showing the confidence level of the translation, alternative translations, and linguistic information.

    • Adaptability, i.e. the ease of fixing bugs, adapting the system to special domains, and personalizing it. - This can be done with great precision, e.g. fixing a bug without breaking anything else. + This can be done with great precision. For instance, a bug in a grammar can be fixed without + breaking anything else.

    • Light weight. The system runs on standard laptops and even on mobile phones; the size of the run-time - system for all pairs of 8 languages is under 20MB, and recompiling the whole system (e.g. after bug fixes or - domain adaptation) is a matter of a few minutes, where corresponding sizes for SMT systems are gigabytes of size + system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole + system (e.g. after bug fixes or + domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size and days of retraining.

    • Multilinguality, in the sense that once the parsing of the input is settled, the output can be readily @@ -104,6 +117,8 @@ Given that these issues get resolved, the strengths of the GF approach can be ma and also in the sense that the GF model works equally well for any language pair.
    +

    Wanted: more work, new ideas

    +

    The recipes for improvement are, as always, more work and new ideas. Each of the four weaknesses mentioned above can be relieved by more work - in particular, lexical coverage by more work on the lexicon, since @@ -147,6 +162,8 @@ it has serious restrictions: Example: the category Adv. +

    What speaks for using RGL

    +

    Despite these problems, the RGL has shown to be a possible starting point for large-scale translation. It has a couple of advantages speaking for this: @@ -167,6 +184,9 @@ similar to the ResourceDemo with flatter and more semantic structures. But this the replication of many rules, even though it would be to a large extent doable by using a functor, that is, by just one set of rules covering all languages.

    + +

    The structure of the wide-coverage translation grammar

    +

    Thus the path chosen is a mixture of RGL and application grammar. In brief, the translation grammar consists of

    @@ -224,6 +244,8 @@ Here is a description of each of the modules:
  • RGLCategories stands for the type system of the standard RGL, the module named Cat. +

    Where and why the translation grammar differs from the RGL

    +

    A guiding principle is thus that the translation grammar preserves as much as possible of the RGL, so that duplicated work is avoided. But as the purposes of the two are different, not everything is possible. Two diff --git a/lib/doc/translation.txt b/lib/doc/translation.txt index 3d5281a81..1a2345e5b 100644 --- a/lib/doc/translation.txt +++ b/lib/doc/translation.txt @@ -3,6 +3,8 @@ Aarne Ranta et al. Work in progress, January 2014 +==GF and the RGL== + GF, Grammatical Framework, was originally designed for the purpose of **multilingual controlled language systems**, which would enable high-quality translation on limited domains. The **abstract syntax** of GF defines the semantic structures relevant for the domain, and the **concrete syntaxes** map these structures to grammatically correct @@ -10,19 +12,23 @@ and idiomatic text in each target language. The **reversibility** of GF enables and thereby **translation** where the abstract syntax functions as an **interlingua**. As a bottle-neck of GF applications, it was soon realized that the definition of concrete syntax requires a lot -of manual work and linguistic skill, due to the complexities of natural language syntax and morphology. Some of +of manual work and linguistic skill, because of the complexities of natural language syntax and morphology. Some of the complexities can be ignored in a small system. For instance, in a mathematical system, it may be enough to use verbs in the present tense only. But very much the same linguistic problems must be solved again and again -in new applications: French verb inflection is much the same in mathematics as in a tourist phrasebook. To solve +in new applications: French verb inflection is the same in mathematics as in a tourist phrasebook. To solve this problem, the **GF Resource Grammar Library** (RGL) was developed, to take care of "low-level" linguistic rules such as inflection, agreement, and word order. This enables the authors of **application grammars** to focus on the semantics (when designing the abstract syntax) and on selecting RGL functions that produce the idioms they want. The RGL grew into an international open-source project, where more than 50 persons have contributed to -implementing it for 29 languages at the time of writing. +implementing it for 29 languages by the time of writing this. + + +==Scaling up GF translation== The RGL was thus originally designed to be used just as its name says: as a library -for application grammars, which were the ones used as **top-level grammars**, i.e. for -parsing, generation, and translation at run time. Little attention was paid to the usability of RGL as a top-level +for application grammars. Only the latter were meant to be used as **top-level grammars**, i.e. for +parsing, generation, and translation at run time. Little attention was therefore +paid to the usability of RGL as a top-level grammar by itself. But when applications accumulated, ranging from technical text to spoken dialogue, the coverage of the RGL grew into a coverage that approximates a "complete grammar" of many of the languages. And recently, there has indeed been success in using the RGL as a wide-coverage translation grammar, @@ -34,22 +40,24 @@ to open-text processing. This success is a result of four lines of development: This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime. - **Large-scale dictionaries**, both manually built and extracted from free sources, and linked into a multilingual - translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert - porting the Oxford Advanced Learner's Dictionary for English to GF. + translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert, + who ported the Oxford Advanced Learner's Dictionary of English to GF. - **Probabilistic disambiguation**, using a model trained from the Penn Treebank. Due to the common abstract syntax, the same model can be readily used for other languages as well, even though the adequacy of this transfer has not been systematically evaluated. - **Robust parsing**, which recovers from unknown words and syntax by introducing **metavariables** ("question marks") - and returning chunk-by-chunk translations; this leads to loss of quality, but fulfills the principle that + and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that "something is better than nothing". -The result of this work is indeed a large-coverage translation system, which can be used in the same way as Google +==Remaining problems== + +The result of all this work is a wide-coverage translation system, which can be used in the same way as Google Translate, Bing, Systran, and Apertium - to "translate anything", albeit with a varying quality. At the moment of writing, the performance is not yet generally on the level with the best of the competition, but shows some promising -improvements in e.g. long-distance agreement and word order. In order to make these into absolute improvements, we +improvements in e.g. long-distance agreement and word order. To make these advantages into absolute improvements, we will need to fix problems that the other systems (or at least some of them) get right but where GF translation often fails: @@ -59,26 +67,30 @@ often fails: - **Speed**, which gets worse with long sentences and with more complex languages. -- **Idiomacy**, due to lack of idiomatic constructions that are not compositional in the RGL but which are - often correct in phrase-based SMT. +- **Idiomacy**, due to the lack of idiomatic constructions that are not compositional and therefore don't get right + in the RGL but are often correct in phrase-based SMT. +==Advantages of GF translation== + Given that these issues get resolved, the strengths of the GF approach can be made more visible: -- **Grammaticality**, in particular with the already mentioned agreement and word order. +- **Grammaticality**, in particular the already mentioned issues of agreement and word order. -- **Predictability**, in the sense that a local change in the input usually results in just a corresponding +- **Predictability**, in the sense that a local change in the input usually results in a corresponding local change in the output (unless otherwise required by idiomacy). - **Feedback**, i.e. the ease of showing the confidence level of the translation, alternative translations, and linguistic information. - **Adaptability**, i.e. the ease of fixing bugs, adapting the system to special domains, and personalizing it. - This can be done with great precision, e.g. fixing a bug without breaking anything else. + This can be done with great precision. For instance, a bug in a grammar can be fixed without + breaking anything else. - **Light weight**. The system runs on standard laptops and even on mobile phones; the size of the run-time - system for all pairs of 8 languages is under 20MB, and recompiling the whole system (e.g. after bug fixes or - domain adaptation) is a matter of a few minutes, where corresponding sizes for SMT systems are gigabytes of size + system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole + system (e.g. after bug fixes or + domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size and days of retraining. - **Multilinguality**, in the sense that once the parsing of the input is settled, the output can be readily @@ -86,6 +98,8 @@ Given that these issues get resolved, the strengths of the GF approach can be ma and also in the sense that the GF model works equally well for any language pair. +==Wanted: more work, new ideas== + The recipes for improvement are, as always, **more work** and **new ideas**. Each of the four weaknesses mentioned above can be relieved by more work - in particular, lexical coverage by more work on the lexicon, since automatic extraction methods cannot really be trusted. As for disambiguation, new ideas about probabilistic @@ -125,6 +139,8 @@ it has serious restrictions: **Example**: the category ``Adv``. +==What speaks for using RGL== + Despite these problems, the RGL has shown to be a possible starting point for large-scale translation. It has a couple of advantages speaking for this: @@ -141,6 +157,9 @@ similar to the ResourceDemo with flatter and more semantic structures. But this the replication of many rules, even though it would be to a large extent doable by using a **functor**, that is, by just one set of rules covering all languages. + +==The structure of the wide-coverage translation grammar== + Thus the path chosen is a mixture of RGL and application grammar. In brief, the translation grammar consists of - **Selected RGL modules and functions**, as they are (using restricted inheritance); around 80% of the syntax. @@ -193,6 +212,8 @@ Here is a description of each of the modules: +==Where and why the translation grammar differs from the RGL== + A guiding principle is thus that the translation grammar preserves //as much as possible// of the RGL, so that duplicated work is avoided. But as the purposes of the two are different, not everything is possible. Two diverging principles have already been mentioned: