From 57006b6296c271bff657be48962fafc5dd207c98 Mon Sep 17 00:00:00 2001 From: "prasanth.kolachina" Date: Wed, 22 Apr 2015 13:14:26 +0000 Subject: [PATCH] README for Python translation pipeline --- src/runtime/python/examples/README | 167 +++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 src/runtime/python/examples/README diff --git a/src/runtime/python/examples/README b/src/runtime/python/examples/README new file mode 100644 index 000000000..b6791a368 --- /dev/null +++ b/src/runtime/python/examples/README @@ -0,0 +1,167 @@ +~runtime/python/examples/README + +(c) Prasanth Kolachina, 22 April 2015 + +====================== +TRANSLATION PIPELINE +====================== + +The module translation_pipeline.py is a Python replica of the +translation pipeline used in Wide-coverage Translation demo. +The pipeline allows for + 1. simulataneous batch translation from one language into multiple languages + 2. K-best translations + 3. translate both text files and sgm files. + +The module defines functions for the standard lexer used in the pipeline, +the callbacks used in robust parsing to partially deal with unknown words +and proper nouns etc. + +Basic example usage: +> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -i -e +> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -K 20 -i -e +> python translation_pipeline.py -g Translate11.pgf -s Eng -t Fin Swe Ger -i -e +> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -f sgm -i -e + +The full list and description of options accepted by the translation_pipeline +module can be seen using the -h option. + +> python translation_pipeline.py -h +——— +usage: translation_pipeline.py [-h] -g PGFFILE [-s SRCLANG] + [-t [TGTLANGS [TGTLANGS ...]]] [-i INPUT] + [-e EXP_DIRECTORY] [-f {txt,sgm}] + [-p PROPSFILE] [-K BESTK] + +Run the GF translation pipeline on standard test-sets + +optional arguments: + -h, --help show this help message and exit + -g PGFFILE, --pgf PGFFILE + PGF grammar file to run the pipeline + -s SRCLANG, --source SRCLANG + Source language of input sentences + -t [TGTLANGS [TGTLANGS ...]], --target [TGTLANGS [TGTLANGS ...]] + Target languages to linearize (default is all other + languages) + -i INPUT, --input INPUT + input file (default will accept STDIN) + -e EXP_DIRECTORY, --exp EXP_DIRECTORY + experiement directory to write translation files + -f {txt,sgm}, --format {txt,sgm} + input file format (output files will be written in the + same format) + -p PROPSFILE, --props PROPSFILE + properties file for the translation pipeline (specify + the above arguments in a file) + -K BESTK K value for K-best translation + + +====================== +PREREQUISITES +====================== +In order to use the examples in this directory, the following components +are required: + 1. GF C runtime (~runtime/c/) + 2. Python bindings to the C runtime (~runtime/python/) + 3. The path to Python library is added to PYTHONPATH environment variable + (Note: by default, the setuptools installs the bindings to a location + available for everyone, so this step is only required if you have + done a custom installation of the Python bindings and you know what + you are doing) +> export PYTHONPATH="$GF/src/runtime/python/build/lib.*:$PYTHONPATH" + + +====================== +WEB GF PARSING +====================== +NEW!!! +In it current state, we carry out parsing of large web texts using +GF grammars. The same functions described in gf_utils.py are used, but +we make it faster using multithreading. The multiprocessing module in +Python allows for trivial parallelization, where each batch of sentences +are parsed by different threads in the pool. + +We noticed one thing during these experiments: the GF parser can +take an unusually long time for long and ambiguous sentences. Therefore, +to avoid resource starvation, we use a `timeout' setting to raise a +PgfParseError if it takes more than 5 minutes for a single sentence. +With this simple trick, we manage to parse large corpora (Europarl +texts) in both English and Swedish. Please contact +prasanth.kolachina@cse.gu.se +if you have any questions about this. + + +====================== +GENERIC GF UTILITIES +====================== + +The module gf_utils.py contains functions to carry out four +basic tasks: +1. 1-best parsing (parse) +2. K-best parsing (kparse) +3. 1-best linearization (linearize) +4. K-best linearization (klinearize) + +> usage: gf_utils.py [-h] {parse,kparse,linearize,klinearize} ... + +Detailed arguments for each function can be found using the "-h" option. +An exhaustive list of options for all the functions are given below. Options +marked with (*) are required, the others are optional. + +(*) -g/--pgf PGF Grammar file +(*) -s/--src-lang Source language name i.e. code used in GF (for e.g. TranslateEng, TranslateFin). For parsing, the option specifies the language of the input sentences. +(*) -t/--tgt-lang Target language name. For linearization, the option specifies the language into which they are linearized. +(*) -K Prespecified K value for K-best parsing and linearization. +-i/--input Input file name, either raw text sentences or abstract trees for linearization. +-o/--output Output file name. +-p/--start-sym Start symbol used for parsing + +Basic example usage: +> python gf_utils.py parse -g TranslateEng.pgf -s TranslateEng -i -o [-p Phr] +> python gf_utils.py kparse -g TranslateEng.pgf -s TranslateEng -K 20 -i -o [-p Phr] +> python gf_utils.py linearize -g TranslateFin.pgf -t TranslateFin -i -o +> python gf_utils.py klinearize -g TranslateFin.pgf -t TranslateFin -i -o + +====================== +File I/O formats +====================== + +1. One-sentence-per-line +The input sentences to the parser/kparser are written one sentence +per line. This is also the standard format used in the translation +pipeline. + +2. SGM format +The translation pipeline accepts SGM format file as both input +and output files. The format is specifically used by automatic +evaluation metrics used to measure quality of MT systems. The format +is primarily used in by the NIST evaluation and the WMT Shared +Task evaluations. + +3. Parser output format +The parser writes four columns, seperated by the -character +for each sentence in a single line. The sentence index, time taken +by the parser, the tree probability value and the abstract syntax tree. + +4. K-best parser output format +The k-best parser uses a representation that has come to be called +CJ (Charniak-Johnson) format in the parsing community. + a. The output consists of parsed blocks for each sentence. Two blocks + are seperated by an empty line. + b. The first line in the block contains two numbers: the number + of parses in that block, and a identifier for that sentence. + c. Each subsequent pair of lines contains the log probability of the + abstract tree in one line followed by the actual parse tree in the + next line. + +5. K-best translations output format +The k-best linearizer and k-best translation use the same format as +Moses and other SMT toolkits to write K-best translation lists. + a. The output consists of translation blocks for each sentence. + b. Each block consists of several translations, one per each line. + c. Each translation (or line) consists of four columns seperated by + '|||' string. The first column contains a sentence identifier, + the second column is the actual translation, followed by + word-alignment information between the input sentence and the + translation and the scores from statistical models used in parsing.