diff --git a/lib/doc/translation.dot b/lib/doc/translation.dot index 17d48ff4a..00dcd3d20 100644 --- a/lib/doc/translation.dot +++ b/lib/doc/translation.dot @@ -6,7 +6,9 @@ graph { Translate -- RGLSyntax [style = dashed] ; Translate -- Extensions ; Translate -- Dictionary ; + Translate -- Chunk ; Extensions -- RGLCategories ; + Chunk -- RGLCategories ; RGLCategories ; RGLSyntax -- RGLCategories ; Dictionary -- RGLCategories ; diff --git a/lib/doc/translation.html b/lib/doc/translation.html index b26d4057d..64b621ea9 100644 --- a/lib/doc/translation.html +++ b/lib/doc/translation.html @@ -2,26 +2,35 @@ + From Resource Grammar to Wide Coverage Translation with GF

From Resource Grammar to Wide Coverage Translation with GF

Aarne Ranta et al.
-Work in progress, January 2014 +January-May 2014
+

Scope

+ +

+Wide-coverage interlingual translator for +Bulgarian, Chinese, Dutch, English, Finnish, French, German, +Hindi, Italian, Spanish, Swedish. +

+

How to use it

-This is a document about a wide-coverage translation system in GF. If you just want to try it before reading more, -here are the main modes of getting started: +If you just want to try it before reading more, +here are the main ways to get started:

-1. Run on our server. Forthcoming. +1. Run on our server. http://www.grammaticalframework.org/demos/translation.html

-2. Get an Android app. Forthcoming. +2. Get an Android app. http://www.grammaticalframework.org/demos/app.html

3. Compile and run in the shell. Get the latest GF sources (with darcs or github) and then @@ -34,27 +43,31 @@ here are the main modes of getting started:

     cd GF/lib/src
-    make Translate8.pgf
+    make -j Translate11.pgf
 
-This will take a long time (ten minutes or more) and will probably require at least 8GB of RAM. +This will take a long time (fifteen minutes or more) and will probably require at least 8GB of RAM.

  • run the translator
    -    pgf-translate Translate8.pgf Phr TranslateEng TranslateSwe
    +    pgf-translate Translate11.pgf Phr TranslateEng TranslateSwe
     
    with obviously the possibility to vary the source and the target language. -

    + + +

    4. To modify the sources, work on the files in +

         GF/lib/src/translator/
     
    +

    It is these files that will be explained below. - +

    GF and the RGL

    @@ -98,15 +111,15 @@ to open-text processing. This success is a result of four lines of development: This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.

  • Large-scale dictionaries, both manually built and extracted from free sources, and linked into a multilingual - translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert, + translation dictionary now covering 10k to 60k entries for eleven languages. This work was started by Björn Bringert, who ported the Oxford Advanced Learner's Dictionary of English to GF.

  • Probabilistic disambiguation, using a model trained from the Penn Treebank. Due to the common abstract syntax, - the same model can be readily used for other languages as well, even though the adequacy of this transfer has not + the same model can be used for other languages as well, even though the adequacy of this transfer has not been systematically evaluated.

    -
  • Robust parsing, which recovers from unknown words and syntax by introducing metavariables ("question marks") - and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that +
  • Robust parsing, which recovers from unknown words and syntax + by using chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that "something is better than nothing". @@ -152,7 +165,7 @@ Given that these issues get resolved, the strengths of the GF approach can be ma breaking anything else.

  • Light weight. The system runs on standard laptops and even on mobile phones; the size of the run-time - system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole + system for all pairs of 11 languages is under 25MB (on the Android platform), and recompiling the whole system (e.g. after bug fixes or domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size and days of retraining. @@ -280,6 +293,10 @@ Here is a description of each of the modules: suffixed by categories and word sense information. This consists of the module named Dictionary.

  • RGLCategories stands for the type system of the standard RGL, the module named Cat. +

    +
  • Chunk is the grammar defining what chunks (noun phrases, verbs, + adverbs, etc) can be used and how they are combined, when exact + syntactic combination fails.

    Where and why the translation grammar differs from the RGL

    diff --git a/lib/doc/translation.png b/lib/doc/translation.png index 1dcd9f5e9..be3216b22 100644 Binary files a/lib/doc/translation.png and b/lib/doc/translation.png differ diff --git a/lib/doc/translation.txt b/lib/doc/translation.txt index f70f40796..6c3e7545e 100644 --- a/lib/doc/translation.txt +++ b/lib/doc/translation.txt @@ -1,16 +1,25 @@ From Resource Grammar to Wide Coverage Translation with GF Aarne Ranta et al. -Work in progress, January 2014 +January-May 2014 + +%!Encoding:utf8 + + +==Scope== + +Wide-coverage interlingual translator for +Bulgarian, Chinese, Dutch, English, Finnish, French, German, +Hindi, Italian, Spanish, Swedish. ==How to use it== -This is a document about a wide-coverage translation system in GF. If you just want to try it before reading more, -here are the main modes of getting started: +If you just want to try it before reading more, +here are the main ways to get started: -1. **Run on our server.** Forthcoming. +1. **Run on our server.** http://www.grammaticalframework.org/demos/translation.html -2. **Get an Android app.** Forthcoming. +2. **Get an Android app.** http://www.grammaticalframework.org/demos/app.html 3. **Compile and run in the shell.** Get the latest GF sources (with darcs or github) and then - compile and install the GF compiler and library and the C runtime (``pgf-translate``). @@ -18,13 +27,13 @@ here are the main modes of getting started: - compile the translator: ``` cd GF/lib/src - make Translate8.pgf + make -j Translate11.pgf ``` -This will take a long time (ten minutes or more) and will probably require at least 8GB of RAM. +This will take a long time (fifteen minutes or more) and will probably require at least 8GB of RAM. - run the translator ``` - pgf-translate Translate8.pgf Phr TranslateEng TranslateSwe + pgf-translate Translate11.pgf Phr TranslateEng TranslateSwe ``` with obviously the possibility to vary the source and the target language. @@ -73,15 +82,15 @@ to open-text processing. This success is a result of four lines of development: This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime. - **Large-scale dictionaries**, both manually built and extracted from free sources, and linked into a multilingual - translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert, + translation dictionary now covering 10k to 60k entries for eleven languages. This work was started by Björn Bringert, who ported the Oxford Advanced Learner's Dictionary of English to GF. - **Probabilistic disambiguation**, using a model trained from the Penn Treebank. Due to the common abstract syntax, - the same model can be readily used for other languages as well, even though the adequacy of this transfer has not + the same model can be used for other languages as well, even though the adequacy of this transfer has not been systematically evaluated. -- **Robust parsing**, which recovers from unknown words and syntax by introducing **metavariables** ("question marks") - and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that +- **Robust parsing**, which recovers from unknown words and syntax + by using chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that "something is better than nothing". @@ -121,7 +130,7 @@ Given that these issues get resolved, the strengths of the GF approach can be ma breaking anything else. - **Light weight**. The system runs on standard laptops and even on mobile phones; the size of the run-time - system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole + system for all pairs of 11 languages is under 25MB (on the Android platform), and recompiling the whole system (e.g. after bug fixes or domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size and days of retraining. @@ -236,6 +245,9 @@ Here is a description of each of the modules: - **RGLCategories** stands for the type system of the standard RGL, the module named ``Cat``. +- **Chunk** is the grammar defining what chunks (noun phrases, verbs, + adverbs, etc) can be used and how they are combined, when exact + syntactic combination fails. ==Where and why the translation grammar differs from the RGL==