diff --git a/lib/doc/translation.dot b/lib/doc/translation.dot new file mode 100644 index 000000000..17d48ff4a --- /dev/null +++ b/lib/doc/translation.dot @@ -0,0 +1,14 @@ +graph { + Translate ; + RGLSyntax ; + Extensions ; + Dictionary ; + Translate -- RGLSyntax [style = dashed] ; + Translate -- Extensions ; + Translate -- Dictionary ; + Extensions -- RGLCategories ; + RGLCategories ; + RGLSyntax -- RGLCategories ; + Dictionary -- RGLCategories ; +} + \ No newline at end of file diff --git a/lib/doc/translation.html b/lib/doc/translation.html new file mode 100644 index 000000000..bbd2764a5 --- /dev/null +++ b/lib/doc/translation.html @@ -0,0 +1,229 @@ + + + + +From Resource Grammar to Wide Coverage Translation with GF + +
+

From Resource Grammar to Wide Coverage Translation with GF

+Aarne Ranta et al.
+Work in progress, January 2014 +
+ +

+GF, Grammatical Framework, was originally designed for the purpose of multilingual controlled language systems, +which would enable high-quality translation on limited domains. The abstract syntax of GF defines the semantic +structures relevant for the domain, and the concrete syntaxes map these structures to grammatically correct +and idiomatic text in each target language. The reversibility of GF enables both generation and parsing, +and thereby translation where the abstract syntax functions as an interlingua. +

+

+As a bottle-neck of GF applications, it was soon realized that the definition of concrete syntax requires a lot +of manual work and linguistic skill, due to the complexities of natural language syntax and morphology. Some of +the complexities can be ignored in a small system. For instance, in a mathematical system, it may be enough to +use verbs in the present tense only. But very much the same linguistic problems must be solved again and again +in new applications: French verb inflection is much the same in mathematics as in a tourist phrasebook. To solve +this problem, the GF Resource Grammar Library (RGL) was developed, to take care of "low-level" linguistic +rules such as inflection, agreement, and word order. This enables the authors of application grammars to focus +on the semantics (when designing the abstract syntax) and on selecting RGL functions that produce the idioms they +want. The RGL grew into an international open-source project, where more than 50 persons have contributed to +implementing it for 29 languages at the time of writing. +

+

+The RGL was thus originally designed to be used just as its name says: as a library +for application grammars, which were the ones used as top-level grammars, i.e. for +parsing, generation, and translation at run time. Little attention was paid to the usability of RGL as a top-level +grammar by itself. But when applications accumulated, ranging from technical text to spoken dialogue, the coverage +of the RGL grew into a coverage that approximates a "complete grammar" of many of the languages. +And recently, there has indeed been success in using the RGL as a wide-coverage translation grammar, +mainly due to Krasimir Angelov's efforts to scale up the size of GF applications from language fragments +to open-text processing. This success is a result of four lines of development: +

+ + + +

+The result of this work is indeed a large-coverage translation system, which can be used in the same way as Google +Translate, Bing, Systran, and Apertium - to "translate anything", albeit with a varying quality. At the moment of +writing, the performance is not yet generally on the level with the best of the competition, but shows some promising +improvements in e.g. long-distance agreement and word order. In order to make these into absolute improvements, we +will need to fix problems that the other systems (or at least some of them) get right but where GF translation +often fails: +

+ + + +

+Given that these issues get resolved, the strengths of the GF approach can be made more visible: +

+ + + +

+The recipes for improvement are, as always, more work and new ideas. Each of the four weaknesses mentioned +above can be relieved by more work - in particular, lexical coverage by more work on the lexicon, since +automatic extraction methods cannot really be trusted. As for disambiguation, new ideas about probabilistic +tree models are being discussed. As for speed, new ideas on parsing (in particular, the integration of disambiguation +with parsing) would help, but also the complexity of grammatical structures plays a major role. As for idiomacy, +more work is being done in introducing constructions (non-compositional syntax rules, generalizing the notion of +multiword expressions, in particular, phrases in SMT), but also new ideas are being discussed on how to +extract such constructions from e.g. phrase tables. +

+

+In the following, we will focus on describing the role of grammar in the GF translation system - in particular, how +RGL can be modified to become usable as a top-level grammar for translating open text. +As RGL was not meant to be used for parsing open text, but rather for the controlled language generation task, +it has serious restrictions: +

+ + + +

+Despite these problems, the RGL has shown to be a possible starting point for large-scale translation. It has a couple +of advantages speaking for this: +

+ + + +

+Of course, we are still left with the other +option of addressing translation with an application grammar, something +similar to the ResourceDemo with flatter and more semantic structures. But this would in turn require +the replication of many rules, even though it would be to a large extent doable by using a functor, that is, +by just one set of rules covering all languages. +

+

+Thus the path chosen is a mixture of RGL and application grammar. In brief, the translation grammar consists of +

+ + + +

+The following picture shows the principal module structure of the translation grammar. +

+

+ +

+

+Notice: the current module structure and naming do not yet quite correspond to the description here. +Thus currently the top module is "Parse" and contains both "Translate" and "Extensions". +The Dictionary module is "Dict", and coincides in the case of English with the monolingual +morphological dictionary. However, the more sense distinctions are introduced for the needs +of translation, the less adequate it becomes to keep these two together. +

+

+Here is a description of each of the modules: +

+ + + + + + diff --git a/lib/doc/translation.png b/lib/doc/translation.png new file mode 100644 index 000000000..1dcd9f5e9 Binary files /dev/null and b/lib/doc/translation.png differ diff --git a/lib/doc/translation.txt b/lib/doc/translation.txt index a347fbb30..f8e626537 100644 --- a/lib/doc/translation.txt +++ b/lib/doc/translation.txt @@ -1,5 +1,6 @@ From Resource Grammar to Wide Coverage Translation with GF -Aarne Ranta +Aarne Ranta et al. +Work in progress, January 2014 GF, Grammatical Framework, was originally designed for the purpose of **multilingual controlled language systems**, @@ -73,6 +74,12 @@ Given that these issues get resolved, the strengths of the GF approach can be ma and linguistic information. - **Adaptability**, i.e. the ease of fixing bugs, adapting the system to special domains, and personalizing it. + This can be done with great precision, e.g. fixing a bug without breaking anything else. + +- **Light weight**. The system runs on standard laptops and even on mobile phones; the size of the run-time + system for all pairs of 8 languages is under 20MB, and recompiling the whole system (e.g. after bug fixes or + domain adaptation) is a matter of a few minutes, where corresponding sizes for SMT systems are gigabytes of size + and days of retraining. - **Multilinguality**, in the sense that once the parsing of the input is settled, the output can be readily rendered into all other languages, @@ -153,4 +160,37 @@ Thus the path chosen is a mixture of RGL and application grammar. In brief, the +The following picture shows the principal module structure of the translation grammar. + +[translation.png] + + +//Notice: the current module structure and naming do not yet quite correspond to the description here.// +//Thus currently the top module is "Parse" and contains both "Translate" and "Extensions".// +//The Dictionary module is "Dict", and coincides in the case of English with the monolingual// +//morphological dictionary. However, the more sense distinctions are introduced for the needs// +//of translation, the less adequate it becomes to keep these two together.// + +Here is a description of each of the modules: + +- **Translate** is the top module, which combines the RGL syntax with syntax extensions and a dictionary. + The RGL syntax is not inherited in its entirety, which is indicated by a dashed line. The overridden abstract + syntax functions (common to all languages) are replaced by functions in the Extensions module, whereas the + overridden concrete syntax definitions (specific to each language) are defined in this Translate module. + This consists of the module named ``Translate``. + +- **RGLSyntax** stands for the standard RGL module for syntax, excluding the RGL test lexicon and + the language-specific extensions of it. This consists of the standard module named ``Grammar`` and + the emerging module named ``Construction``. + +- **Extensions** stands for the syntax extensions added to the RGL syntax. This consists of the module + named ``Extensions``. + +- **Dictionary** is a large-scale multilingual dictionary. Its abstract syntax uses as identifiers English words + suffixed by categories and word sense information. This consists of the module named ``Dictionary``. + +- **RGLCategories** stands for the type system of the standard RGL, the module named ``Cat``. + + +