mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-09 13:09:33 -06:00
168 lines
7.5 KiB
Plaintext
168 lines
7.5 KiB
Plaintext
~runtime/python/examples/README
|
|
|
|
(c) Prasanth Kolachina, 22 April 2015
|
|
|
|
======================
|
|
TRANSLATION PIPELINE
|
|
======================
|
|
|
|
The module translation_pipeline.py is a Python replica of the
|
|
translation pipeline used in Wide-coverage Translation demo.
|
|
The pipeline allows for
|
|
1. simulataneous batch translation from one language into multiple languages
|
|
2. K-best translations
|
|
3. translate both text files and sgm files.
|
|
|
|
The module defines functions for the standard lexer used in the pipeline,
|
|
the callbacks used in robust parsing to partially deal with unknown words
|
|
and proper nouns etc.
|
|
|
|
Basic example usage:
|
|
> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -i <input-file> -e <exp-directory>
|
|
> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -K 20 -i <input-file> -e <exp-directory>
|
|
> python translation_pipeline.py -g Translate11.pgf -s Eng -t Fin Swe Ger -i <input-file> -e <exp-directory>
|
|
> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -f sgm -i <sgm-input-file> -e <exp-directory>
|
|
|
|
The full list and description of options accepted by the translation_pipeline
|
|
module can be seen using the -h option.
|
|
|
|
> python translation_pipeline.py -h
|
|
———
|
|
usage: translation_pipeline.py [-h] -g PGFFILE [-s SRCLANG]
|
|
[-t [TGTLANGS [TGTLANGS ...]]] [-i INPUT]
|
|
[-e EXP_DIRECTORY] [-f {txt,sgm}]
|
|
[-p PROPSFILE] [-K BESTK]
|
|
|
|
Run the GF translation pipeline on standard test-sets
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
-g PGFFILE, --pgf PGFFILE
|
|
PGF grammar file to run the pipeline
|
|
-s SRCLANG, --source SRCLANG
|
|
Source language of input sentences
|
|
-t [TGTLANGS [TGTLANGS ...]], --target [TGTLANGS [TGTLANGS ...]]
|
|
Target languages to linearize (default is all other
|
|
languages)
|
|
-i INPUT, --input INPUT
|
|
input file (default will accept STDIN)
|
|
-e EXP_DIRECTORY, --exp EXP_DIRECTORY
|
|
experiement directory to write translation files
|
|
-f {txt,sgm}, --format {txt,sgm}
|
|
input file format (output files will be written in the
|
|
same format)
|
|
-p PROPSFILE, --props PROPSFILE
|
|
properties file for the translation pipeline (specify
|
|
the above arguments in a file)
|
|
-K BESTK K value for K-best translation
|
|
|
|
|
|
======================
|
|
PREREQUISITES
|
|
======================
|
|
In order to use the examples in this directory, the following components
|
|
are required:
|
|
1. GF C runtime (~runtime/c/)
|
|
2. Python bindings to the C runtime (~runtime/python/)
|
|
3. The path to Python library is added to PYTHONPATH environment variable
|
|
(Note: by default, the setuptools installs the bindings to a location
|
|
available for everyone, so this step is only required if you have
|
|
done a custom installation of the Python bindings and you know what
|
|
you are doing)
|
|
> export PYTHONPATH="$GF/src/runtime/python/build/lib.*:$PYTHONPATH"
|
|
|
|
|
|
======================
|
|
WEB GF PARSING
|
|
======================
|
|
NEW!!!
|
|
In it current state, we carry out parsing of large web texts using
|
|
GF grammars. The same functions described in gf_utils.py are used, but
|
|
we make it faster using multithreading. The multiprocessing module in
|
|
Python allows for trivial parallelization, where each batch of sentences
|
|
are parsed by different threads in the pool.
|
|
|
|
We noticed one thing during these experiments: the GF parser can
|
|
take an unusually long time for long and ambiguous sentences. Therefore,
|
|
to avoid resource starvation, we use a `timeout' setting to raise a
|
|
PgfParseError if it takes more than 5 minutes for a single sentence.
|
|
With this simple trick, we manage to parse large corpora (Europarl
|
|
texts) in both English and Swedish. Please contact
|
|
prasanth.kolachina@cse.gu.se
|
|
if you have any questions about this.
|
|
|
|
|
|
======================
|
|
GENERIC GF UTILITIES
|
|
======================
|
|
|
|
The module gf_utils.py contains functions to carry out four
|
|
basic tasks:
|
|
1. 1-best parsing (parse)
|
|
2. K-best parsing (kparse)
|
|
3. 1-best linearization (linearize)
|
|
4. K-best linearization (klinearize)
|
|
|
|
> usage: gf_utils.py [-h] {parse,kparse,linearize,klinearize} ...
|
|
|
|
Detailed arguments for each function can be found using the "-h" option.
|
|
An exhaustive list of options for all the functions are given below. Options
|
|
marked with (*) are required, the others are optional.
|
|
|
|
(*) -g/--pgf <pgf-file> PGF Grammar file
|
|
(*) -s/--src-lang <lang> Source language name i.e. code used in GF (for e.g. TranslateEng, TranslateFin). For parsing, the option specifies the language of the input sentences.
|
|
(*) -t/--tgt-lang <lang> Target language name. For linearization, the option specifies the language into which they are linearized.
|
|
(*) -K <int> Prespecified K value for K-best parsing and linearization.
|
|
-i/--input <filename> Input file name, either raw text sentences or abstract trees for linearization.
|
|
-o/--output <filename> Output file name.
|
|
-p/--start-sym <sym> Start symbol used for parsing
|
|
|
|
Basic example usage:
|
|
> python gf_utils.py parse -g TranslateEng.pgf -s TranslateEng -i <input-file> -o <parse-output-file> [-p Phr]
|
|
> python gf_utils.py kparse -g TranslateEng.pgf -s TranslateEng -K 20 -i <input-file> -o <kparse-output-file> [-p Phr]
|
|
> python gf_utils.py linearize -g TranslateFin.pgf -t TranslateFin -i <parse-output-file> -o <output-file>
|
|
> python gf_utils.py klinearize -g TranslateFin.pgf -t TranslateFin -i <kparse-output-file> -o <output-file>
|
|
|
|
======================
|
|
File I/O formats
|
|
======================
|
|
|
|
1. One-sentence-per-line
|
|
The input sentences to the parser/kparser are written one sentence
|
|
per line. This is also the standard format used in the translation
|
|
pipeline.
|
|
|
|
2. SGM format
|
|
The translation pipeline accepts SGM format file as both input
|
|
and output files. The format is specifically used by automatic
|
|
evaluation metrics used to measure quality of MT systems. The format
|
|
is primarily used in by the NIST evaluation and the WMT Shared
|
|
Task evaluations.
|
|
|
|
3. Parser output format
|
|
The parser writes four columns, seperated by the <tab>-character
|
|
for each sentence in a single line. The sentence index, time taken
|
|
by the parser, the tree probability value and the abstract syntax tree.
|
|
|
|
4. K-best parser output format
|
|
The k-best parser uses a representation that has come to be called
|
|
CJ (Charniak-Johnson) format in the parsing community.
|
|
a. The output consists of parsed blocks for each sentence. Two blocks
|
|
are seperated by an empty line.
|
|
b. The first line in the block contains two numbers: the number
|
|
of parses in that block, and a identifier for that sentence.
|
|
c. Each subsequent pair of lines contains the log probability of the
|
|
abstract tree in one line followed by the actual parse tree in the
|
|
next line.
|
|
|
|
5. K-best translations output format
|
|
The k-best linearizer and k-best translation use the same format as
|
|
Moses and other SMT toolkits to write K-best translation lists.
|
|
a. The output consists of translation blocks for each sentence.
|
|
b. Each block consists of several translations, one per each line.
|
|
c. Each translation (or line) consists of four columns seperated by
|
|
'|||' string. The first column contains a sentence identifier,
|
|
the second column is the actual translation, followed by
|
|
word-alignment information between the input sentence and the
|
|
translation and the scores from statistical models used in parsing.
|