forked from GitHub/gf-core
some updates in lib/doc/translation.html
This commit is contained in:
@@ -6,7 +6,9 @@ graph {
|
|||||||
Translate -- RGLSyntax [style = dashed] ;
|
Translate -- RGLSyntax [style = dashed] ;
|
||||||
Translate -- Extensions ;
|
Translate -- Extensions ;
|
||||||
Translate -- Dictionary ;
|
Translate -- Dictionary ;
|
||||||
|
Translate -- Chunk ;
|
||||||
Extensions -- RGLCategories ;
|
Extensions -- RGLCategories ;
|
||||||
|
Chunk -- RGLCategories ;
|
||||||
RGLCategories ;
|
RGLCategories ;
|
||||||
RGLSyntax -- RGLCategories ;
|
RGLSyntax -- RGLCategories ;
|
||||||
Dictionary -- RGLCategories ;
|
Dictionary -- RGLCategories ;
|
||||||
|
|||||||
@@ -2,26 +2,35 @@
|
|||||||
<HTML>
|
<HTML>
|
||||||
<HEAD>
|
<HEAD>
|
||||||
<META NAME="generator" CONTENT="http://txt2tags.org">
|
<META NAME="generator" CONTENT="http://txt2tags.org">
|
||||||
|
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf8">
|
||||||
<TITLE>From Resource Grammar to Wide Coverage Translation with GF</TITLE>
|
<TITLE>From Resource Grammar to Wide Coverage Translation with GF</TITLE>
|
||||||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||||||
<CENTER>
|
<CENTER>
|
||||||
<H1>From Resource Grammar to Wide Coverage Translation with GF</H1>
|
<H1>From Resource Grammar to Wide Coverage Translation with GF</H1>
|
||||||
<FONT SIZE="4"><I>Aarne Ranta et al.</I></FONT><BR>
|
<FONT SIZE="4"><I>Aarne Ranta et al.</I></FONT><BR>
|
||||||
<FONT SIZE="4">Work in progress, January 2014</FONT>
|
<FONT SIZE="4">January-May 2014</FONT>
|
||||||
</CENTER>
|
</CENTER>
|
||||||
|
|
||||||
|
|
||||||
|
<H2>Scope</H2>
|
||||||
|
|
||||||
|
<P>
|
||||||
|
Wide-coverage interlingual translator for
|
||||||
|
Bulgarian, Chinese, Dutch, English, Finnish, French, German,
|
||||||
|
Hindi, Italian, Spanish, Swedish.
|
||||||
|
</P>
|
||||||
|
|
||||||
<H2>How to use it</H2>
|
<H2>How to use it</H2>
|
||||||
|
|
||||||
<P>
|
<P>
|
||||||
This is a document about a wide-coverage translation system in GF. If you just want to try it before reading more,
|
If you just want to try it before reading more,
|
||||||
here are the main modes of getting started:
|
here are the main ways to get started:
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
1. <B>Run on our server.</B> Forthcoming.
|
1. <B>Run on our server.</B> <A HREF="http://www.grammaticalframework.org/demos/translation.html">http://www.grammaticalframework.org/demos/translation.html</A>
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
2. <B>Get an Android app.</B> Forthcoming.
|
2. <B>Get an Android app.</B> <A HREF="http://www.grammaticalframework.org/demos/app.html">http://www.grammaticalframework.org/demos/app.html</A>
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
3. <B>Compile and run in the shell.</B> Get the latest GF sources (with darcs or github) and then
|
3. <B>Compile and run in the shell.</B> Get the latest GF sources (with darcs or github) and then
|
||||||
@@ -34,27 +43,31 @@ here are the main modes of getting started:
|
|||||||
|
|
||||||
<PRE>
|
<PRE>
|
||||||
cd GF/lib/src
|
cd GF/lib/src
|
||||||
make Translate8.pgf
|
make -j Translate11.pgf
|
||||||
</PRE>
|
</PRE>
|
||||||
|
|
||||||
This will take a long time (ten minutes or more) and will probably require at least 8GB of RAM.
|
This will take a long time (fifteen minutes or more) and will probably require at least 8GB of RAM.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI>run the translator
|
<LI>run the translator
|
||||||
|
|
||||||
<PRE>
|
<PRE>
|
||||||
pgf-translate Translate8.pgf Phr TranslateEng TranslateSwe
|
pgf-translate Translate11.pgf Phr TranslateEng TranslateSwe
|
||||||
</PRE>
|
</PRE>
|
||||||
|
|
||||||
with obviously the possibility to vary the source and the target language.
|
with obviously the possibility to vary the source and the target language.
|
||||||
<P></P>
|
</UL>
|
||||||
|
|
||||||
|
<P>
|
||||||
4. To modify the sources, work on the files in
|
4. To modify the sources, work on the files in
|
||||||
|
</P>
|
||||||
|
|
||||||
<PRE>
|
<PRE>
|
||||||
GF/lib/src/translator/
|
GF/lib/src/translator/
|
||||||
</PRE>
|
</PRE>
|
||||||
|
|
||||||
|
<P>
|
||||||
It is these files that will be explained below.
|
It is these files that will be explained below.
|
||||||
</UL>
|
</P>
|
||||||
|
|
||||||
<H2>GF and the RGL</H2>
|
<H2>GF and the RGL</H2>
|
||||||
|
|
||||||
@@ -98,15 +111,15 @@ to open-text processing. This success is a result of four lines of development:
|
|||||||
This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.
|
This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI><B>Large-scale dictionaries</B>, both manually built and extracted from free sources, and linked into a multilingual
|
<LI><B>Large-scale dictionaries</B>, both manually built and extracted from free sources, and linked into a multilingual
|
||||||
translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert,
|
translation dictionary now covering 10k to 60k entries for eleven languages. This work was started by Björn Bringert,
|
||||||
who ported the Oxford Advanced Learner's Dictionary of English to GF.
|
who ported the Oxford Advanced Learner's Dictionary of English to GF.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI><B>Probabilistic disambiguation</B>, using a model trained from the Penn Treebank. Due to the common abstract syntax,
|
<LI><B>Probabilistic disambiguation</B>, using a model trained from the Penn Treebank. Due to the common abstract syntax,
|
||||||
the same model can be readily used for other languages as well, even though the adequacy of this transfer has not
|
the same model can be used for other languages as well, even though the adequacy of this transfer has not
|
||||||
been systematically evaluated.
|
been systematically evaluated.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI><B>Robust parsing</B>, which recovers from unknown words and syntax by introducing <B>metavariables</B> ("question marks")
|
<LI><B>Robust parsing</B>, which recovers from unknown words and syntax
|
||||||
and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that
|
by using chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that
|
||||||
"something is better than nothing".
|
"something is better than nothing".
|
||||||
</UL>
|
</UL>
|
||||||
|
|
||||||
@@ -152,7 +165,7 @@ Given that these issues get resolved, the strengths of the GF approach can be ma
|
|||||||
breaking anything else.
|
breaking anything else.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI><B>Light weight</B>. The system runs on standard laptops and even on mobile phones; the size of the run-time
|
<LI><B>Light weight</B>. The system runs on standard laptops and even on mobile phones; the size of the run-time
|
||||||
system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole
|
system for all pairs of 11 languages is under 25MB (on the Android platform), and recompiling the whole
|
||||||
system (e.g. after bug fixes or
|
system (e.g. after bug fixes or
|
||||||
domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size
|
domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size
|
||||||
and days of retraining.
|
and days of retraining.
|
||||||
@@ -280,6 +293,10 @@ Here is a description of each of the modules:
|
|||||||
suffixed by categories and word sense information. This consists of the module named <CODE>Dictionary</CODE>.
|
suffixed by categories and word sense information. This consists of the module named <CODE>Dictionary</CODE>.
|
||||||
<P></P>
|
<P></P>
|
||||||
<LI><B>RGLCategories</B> stands for the type system of the standard RGL, the module named <CODE>Cat</CODE>.
|
<LI><B>RGLCategories</B> stands for the type system of the standard RGL, the module named <CODE>Cat</CODE>.
|
||||||
|
<P></P>
|
||||||
|
<LI><B>Chunk</B> is the grammar defining what chunks (noun phrases, verbs,
|
||||||
|
adverbs, etc) can be used and how they are combined, when exact
|
||||||
|
syntactic combination fails.
|
||||||
</UL>
|
</UL>
|
||||||
|
|
||||||
<H2>Where and why the translation grammar differs from the RGL</H2>
|
<H2>Where and why the translation grammar differs from the RGL</H2>
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 25 KiB After Width: | Height: | Size: 31 KiB |
@@ -1,16 +1,25 @@
|
|||||||
From Resource Grammar to Wide Coverage Translation with GF
|
From Resource Grammar to Wide Coverage Translation with GF
|
||||||
Aarne Ranta et al.
|
Aarne Ranta et al.
|
||||||
Work in progress, January 2014
|
January-May 2014
|
||||||
|
|
||||||
|
%!Encoding:utf8
|
||||||
|
|
||||||
|
|
||||||
|
==Scope==
|
||||||
|
|
||||||
|
Wide-coverage interlingual translator for
|
||||||
|
Bulgarian, Chinese, Dutch, English, Finnish, French, German,
|
||||||
|
Hindi, Italian, Spanish, Swedish.
|
||||||
|
|
||||||
|
|
||||||
==How to use it==
|
==How to use it==
|
||||||
|
|
||||||
This is a document about a wide-coverage translation system in GF. If you just want to try it before reading more,
|
If you just want to try it before reading more,
|
||||||
here are the main modes of getting started:
|
here are the main ways to get started:
|
||||||
|
|
||||||
1. **Run on our server.** Forthcoming.
|
1. **Run on our server.** http://www.grammaticalframework.org/demos/translation.html
|
||||||
|
|
||||||
2. **Get an Android app.** Forthcoming.
|
2. **Get an Android app.** http://www.grammaticalframework.org/demos/app.html
|
||||||
|
|
||||||
3. **Compile and run in the shell.** Get the latest GF sources (with darcs or github) and then
|
3. **Compile and run in the shell.** Get the latest GF sources (with darcs or github) and then
|
||||||
- compile and install the GF compiler and library and the C runtime (``pgf-translate``).
|
- compile and install the GF compiler and library and the C runtime (``pgf-translate``).
|
||||||
@@ -18,13 +27,13 @@ here are the main modes of getting started:
|
|||||||
- compile the translator:
|
- compile the translator:
|
||||||
```
|
```
|
||||||
cd GF/lib/src
|
cd GF/lib/src
|
||||||
make Translate8.pgf
|
make -j Translate11.pgf
|
||||||
```
|
```
|
||||||
This will take a long time (ten minutes or more) and will probably require at least 8GB of RAM.
|
This will take a long time (fifteen minutes or more) and will probably require at least 8GB of RAM.
|
||||||
|
|
||||||
- run the translator
|
- run the translator
|
||||||
```
|
```
|
||||||
pgf-translate Translate8.pgf Phr TranslateEng TranslateSwe
|
pgf-translate Translate11.pgf Phr TranslateEng TranslateSwe
|
||||||
```
|
```
|
||||||
with obviously the possibility to vary the source and the target language.
|
with obviously the possibility to vary the source and the target language.
|
||||||
|
|
||||||
@@ -73,15 +82,15 @@ to open-text processing. This success is a result of four lines of development:
|
|||||||
This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.
|
This development is also based on the work of Peter Ljunglöf on GF parsing and Lauri Alanko on the C runtime.
|
||||||
|
|
||||||
- **Large-scale dictionaries**, both manually built and extracted from free sources, and linked into a multilingual
|
- **Large-scale dictionaries**, both manually built and extracted from free sources, and linked into a multilingual
|
||||||
translation dictionary now covering 10k to 60k entries for eight languages. This work was started by Björn Bringert,
|
translation dictionary now covering 10k to 60k entries for eleven languages. This work was started by Björn Bringert,
|
||||||
who ported the Oxford Advanced Learner's Dictionary of English to GF.
|
who ported the Oxford Advanced Learner's Dictionary of English to GF.
|
||||||
|
|
||||||
- **Probabilistic disambiguation**, using a model trained from the Penn Treebank. Due to the common abstract syntax,
|
- **Probabilistic disambiguation**, using a model trained from the Penn Treebank. Due to the common abstract syntax,
|
||||||
the same model can be readily used for other languages as well, even though the adequacy of this transfer has not
|
the same model can be used for other languages as well, even though the adequacy of this transfer has not
|
||||||
been systematically evaluated.
|
been systematically evaluated.
|
||||||
|
|
||||||
- **Robust parsing**, which recovers from unknown words and syntax by introducing **metavariables** ("question marks")
|
- **Robust parsing**, which recovers from unknown words and syntax
|
||||||
and returning chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that
|
by using chunk-by-chunk translations. This leads to loss of quality, but fulfills the principle that
|
||||||
"something is better than nothing".
|
"something is better than nothing".
|
||||||
|
|
||||||
|
|
||||||
@@ -121,7 +130,7 @@ Given that these issues get resolved, the strengths of the GF approach can be ma
|
|||||||
breaking anything else.
|
breaking anything else.
|
||||||
|
|
||||||
- **Light weight**. The system runs on standard laptops and even on mobile phones; the size of the run-time
|
- **Light weight**. The system runs on standard laptops and even on mobile phones; the size of the run-time
|
||||||
system for all pairs of 8 languages is under 20MB (on the Android platform), and recompiling the whole
|
system for all pairs of 11 languages is under 25MB (on the Android platform), and recompiling the whole
|
||||||
system (e.g. after bug fixes or
|
system (e.g. after bug fixes or
|
||||||
domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size
|
domain adaptation) is a matter of a few minutes, where corresponding figures for SMT systems are gigabytes of size
|
||||||
and days of retraining.
|
and days of retraining.
|
||||||
@@ -236,6 +245,9 @@ Here is a description of each of the modules:
|
|||||||
|
|
||||||
- **RGLCategories** stands for the type system of the standard RGL, the module named ``Cat``.
|
- **RGLCategories** stands for the type system of the standard RGL, the module named ``Cat``.
|
||||||
|
|
||||||
|
- **Chunk** is the grammar defining what chunks (noun phrases, verbs,
|
||||||
|
adverbs, etc) can be used and how they are combined, when exact
|
||||||
|
syntactic combination fails.
|
||||||
|
|
||||||
|
|
||||||
==Where and why the translation grammar differs from the RGL==
|
==Where and why the translation grammar differs from the RGL==
|
||||||
|
|||||||
Reference in New Issue
Block a user