Files
gf-core/doc/gf-summerschool.html

369 lines
14 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>European Resource Grammar Summer School</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<P ALIGN="center"><CENTER><H1>European Resource Grammar Summer School</H1>
<FONT SIZE="4">
<I>Gothenburg, August 2009</I><BR>
Aarne Ranta (aarne at chalmers.se)
</FONT></CENTER>
<P>
<IMG ALIGN="middle" SRC="eu-langs.png" BORDER="0" ALT="">
</P>
<H3>Executive summary</H3>
<P>
We plan to organize a summer school with the goal of implementing the GF
resource grammar library for 15 new languages, so that the library will
cover all the 23 official EU languages of year 2009.
As a test application of the grammars, also an extension of
the WebALT mathematical exercise translator will be built for each
language.
</P>
<P>
2 students per language are selected to the summer school, after a phase of
self-studies and on the basis of assignments that consist of parts of the resource
grammars. Travel and accommodation are paid to these participants.
If funding gets arranged, the call of participation for the summer school will
be announced in February 2009, and the summer school itself will take place
in August 2009, in Gothenburg.
</P>
<H2>Introduction</H2>
<P>
Since 2007, EU-27 has 23 official languages, listed in the diagram on top of this
document.
There is a growing need of translation between
these languages. The traditional language-to-language method requires 23*22 = 506
translators (humans or computer programs) to cover all possible translation needs.
</P>
<P>
An alternative to language-to-language translation is the use of an <B>interlingua</B>:
a language-independent representation such that all translation problems can
be reduced to translating to and from the interlingua. With 23 languages,
only 2*23 = 46 translators are needed.
</P>
<P>
Interlingua sounds too good to be true. In a sense, it is. All attempts to
create an interlingua that would solve all translation problems have failed.
However, interlinguas for restricted applications have shown more
success. For instance, mathematical texts and weather reports can be translated
by using interlinguas tailor-made for the domains of mathematics and weather reports,
respectively.
</P>
<P>
What is required of an interlingua is
</P>
<UL>
<LI>semantic accuracy: correspondence to what you want to say in the application
<LI>language-independence: abstraction from individual languages
</UL>
<P>
Thus, for instance, an interlingua for mathematical texts may be based on
mathematical logic, which at the same time gives semantic accuracy and
language independence. In other domains, something else than mathematical
logic may be needed; the <B>ontologies</B> defined within the semantic
web technology are often good starting points for interlinguas.
</P>
<H2>GF: a framework for multilingual grammars</H2>
<P>
The interlingua is just one part of a translation system. We also need
the mappings between the interlingua and the involved languages. As the
number of languages increases, this part grows while the interlingua remains
constant.
</P>
<P>
GF (Grammatical Framework,
<A HREF="http://gf.digitalgrammars.com"><CODE>gf.digitalgrammars.com</CODE></A>)
is a programming language designed to support interlingua-based translation.
A "program" in GF is a <B>multilingual grammar</B>, which consists of an
<B>abstract syntax</B> and a set of <B>concrete syntaxes</B>. A concrete
syntaxes is a mapping from the abstract syntax to a particular language.
These mappings are <B>reversible</B>, which means that they can be used for
translating in both directions. This means that creating an interlingua-based
translator for 23 languages just requires 1 + 23 = 24 grammar modules (the abstract
syntax and the concrete syntaxes).
</P>
<P>
The diagram first in this document shows a system covering the 23 EU languages.
Languages marked in
red are of particular interest for the summer school, since they are those
on which the effort will be concentrated.
</P>
<H2>The GF resource grammar library</H2>
<P>
The GF resource grammar library is a set of grammars used as libraries when
building interlingua-based translation systems. The library currently covers
the 9 languages coloured in green in the diagram above; in addition,
Catalan, Norwegian, and Russian are covered, and there is ongoing work on
Arabic, Hindi/Urdu, and Thai.
</P>
<P>
The purpose of the resource grammar library is to define the "low-level" structure
of a language: inflection, word order, agreement. This structure belongs to what
linguists call morphology and syntax. It can be very complex and requires
a lot of knowledge. Yet, when translating from one language to another, knowing
morphology and syntax is but a part of what is needed. The translator (whether human
or machine) must understand the meaning of what is translated, and must also know
the idiomatic way to express the meaning in the target language. This knowledge
can be very domain-dependent and requires in general an expert in the field to
reach high quality: a mathematician in the field of mathematics, a meteorologist
in the field of weather reports, etc.
</P>
<P>
The problem is to find a person who is an expert in both the domain of translation
and in the low-level linguistic details. It is the rareness of this combination
that has made it difficult to build interlingua-based translation systems.
The GF resource grammar library has the mission of helping in this task. It encapsulates
the low-level linguistics in program modules accessed through easy-to-use interfaces.
Experts on different domains can build translation systems by using the library,
without knowing low-level linguistics. The idea is much the same as when a
programmer builds a graphical user interface (GUI) from high-level elements such as
buttons and menus, without having to care about pixels or geometrical forms.
</P>
<H3>Applications of the library</H3>
<P>
In addition to translation, the library is also useful in <B>localization</B>,
that is, porting a piece of software to new languages.
The GF resource grammar library has been used in three major projects that need
interlingua-based translation or localization of systems to new languages:
</P>
<UL>
<LI>in KeY,
<A HREF="http://www.key-project.org/"><CODE>http://www.key-project.org/</CODE></A>,
for writing formal and informal software specifications (3 languages)
<LI>in WebALT,
<A HREF="http://webalt.math.helsinki.fi/content/index_eng.html"><CODE>http://webalt.math.helsinki.fi/content/index_eng.html</CODE></A>,
for translating mathematical exercises to 7 languages
<LI>in TALK <A HREF="http://www.talk-project.org"><CODE>http://www.talk-project.org</CODE></A>,
where the library was used for localizing spoken dialogue systems to six languages
</UL>
<P>
The library is also a generic linguistic resource, which can be used for tasks
such as language teaching and information retrieval. The liberal license (GPL)
makes it usable for anyone and for any task. GF also has tools supporting the
use of grammars in programs written in other programming languages: C, C++, Haskell,
Java, JavaScript, and Prolog. In connection with the TALK project, support has also been
developed for translating GF grammars to language models used in speech
recognition.
</P>
<H3>The structure of the library</H3>
<P>
The library has the following main parts:
</P>
<UL>
<LI><B>Inflection paradigms</B>, covering the inflection of each language.
<LI><B>Common Syntax API</B>, covering a large set of syntax rule that
can be implemented for all languages involved.
<LI><B>Common Test Lexicon</B>, giving ca. 500 common words that can be used for
testing the library.
<LI><B>Language-Specific Syntax Extensions</B>, covering syntax rules that are
not implementable for all languages.
<LI><B>Language-Specific Lexica</B>, word lists for each language, with
accurate morphological and syntactic information.
</UL>
<P>
The goal of the summer school is to implement, for each language, at least
the first three components. The latter three are more open-ended in character.
</P>
<H2>The summer school</H2>
<P>
The goal of the summer school is to extend the GF resource grammar library
to covering all 23 EU languages, which means we need 15 new languages.
</P>
<P>
The amount of work and skill is between a Master's thesis and a PhD thesis.
The Russian implementation was made by Janna Khegai as a part of her
PhD thesis; the thesis contains other material, too.
The Arabic implementation was started by Ali El Dada in his Master's thesis,
but the thesis does not cover the whole API. The realistic amount of work is
somewhere around 8 person months, but this is very much language-dependent.
Dutch, for instance, can profit from previous implementations of German and
Scandinavian languages, and will probably require less work.
Latvian and Lithuanian are the first languages of the Baltic family and
will probably require much more work.
</P>
<P>
In any case, the proposed allocation of work power is 2 participants per
language. They will have 6 months to work at home, followed
by 2 weeks of summer school. Who are these participants?
</P>
<H3>Selecting participants</H3>
<P>
After the call has been published, persons interested to participate in
the project are expected to learn GF by self-study from the
<A HREF="http://www.cs.chalmers.se/Cs/Research/Language-technology/GF/doc/gf-tutorial.html">tutorial</A>.
This should take a couple of weeks.
</P>
<P>
After and perhapts in parallel with
working out the tutorial, the participants should continue to
implement selected parts of the resource grammar, following the advice from
the
<A HREF="http://www.cs.chalmers.se/Cs/Research/Language-technology/GF/doc/Resource-HOWTO.html">Resource-HOWTO document</A>.
What parts exactly are selected will be announced later.
This work will take another couple of weeks.
</P>
<P>
This sample resource grammar fragment
will be submitted to the Summer School Committee in the beginning of May.
The Committee then decides who is invited to represent which language
in the summer school.
</P>
<P>
After the Committee decision, the participants have around three months
to work on their languages. The work is completed in the summer school itself. It is also
thoroughly tested by using it to add a new language to the WebALT mathematical
exercise translator.
</P>
<P>
Depending on the quality of submitted work, and on the demands of different
languages, the Committee may decide to select another number than 2 participants
for a language. We will also consider accepting participants who want to
pay their own expenses.
</P>
<P>
Also good proposals from non-EU languages will be considered. Proponents of
such languages should contact the summer school organizers as early as possible.
</P>
<P>
To keep track on who is working on which language, we will establish a web page
(Wiki or similar) soon after the call is published. The participants are encourage
to contact each other and even work in groups.
</P>
<H3>Who is qualified</H3>
<P>
Writing a resource grammar implementation requires good general programming
skills, and a good explicit knowledge of the grammar of the target language.
A typical participant could be
</P>
<UL>
<LI>native or fluent speaker of the target language
<LI>interested in languages on the theoretical level, and preferably familiar
with many languages (to be able to think about them on an abstract level)
<LI>familiar with functional programming languages such as ML or Haskell
(GF itself is a language similar to these)
<LI>on Master's or PhD level in linguistics, computer science, or mathematics
</UL>
<P>
But it is the quality of the assignment that is assessed, not any formal
requirements. The "typical participant" was described to give an idea of
who is likely to succeed in this.
</P>
<H3>Costs</H3>
<P>
Our aim is to make the summer school free of charge for the participants
who are selected on the basis of their assignments. And not only that:
we plan to cover their travel and accommodation costs, up to 1000 EUR
per person.
</P>
<P>
We want to get the funding question settled by mid-February 2009, and make
the final decision on the summer school then.
</P>
<H3>Teachers</H3>
<P>
Krasimir Angelov
</P>
<P>
?Olga Caprotti
</P>
<P>
?Lauri Carlson
</P>
<P>
?Robin Cooper
</P>
<P>
?Björn Bringert
</P>
<P>
Håkan Burden
</P>
<P>
?Elisabet Engdahl
</P>
<P>
?Markus Forsberg
</P>
<P>
?Janna Khegai
</P>
<P>
?Peter Ljunglöf
</P>
<P>
?Wanjiku Ng'ang'a
</P>
<P>
Aarne Ranta
</P>
<P>
?Jordi Saludes
</P>
<P>
In addition, we will look for consultants who can help to assess the results
for each language
</P>
<H3>The Summer School Committee</H3>
<P>
This committee consists of a number of teachers and consultants,
who will select the participants.
</P>
<H3>Time and Place</H3>
<P>
The summer school will
be organized in Gothenburg in the latter half of August 2009.
</P>
<P>
Time schedule (2009):
</P>
<UL>
<LI>February: announcement of summer school and the grammar
writing contest to get participants
<LI>March-April: work on the contest assignment (ca 1 month)
<LI>May: submission deadline and notification of acceptance
<LI>June-July: more work on the grammars
<LI>August: summer school
</UL>
<H3>Dissemination and intellectual property</H3>
<P>
The new resource grammars will be released under the GPL just like
the current resource grammars,
with the copyright held by respective authors.
</P>
<P>
The grammars will be distributed via the GF web site.
</P>
<P>
The WebALT-specific grammars will have special licenses agreed between the
authors and WebALT Inc.
</P>
<H2>Why I should participate</H2>
<P>
Seven reasons:
</P>
<OL>
<LI>free trip and stay in Gothenburg (to be confirmed)
<LI>participation in a pioneering language technology work in an enthusiastic atmosphere
<LI>work and fun with people from all over Europe
<LI>job opportunities and business ideas
<LI>credits: the school project will be established as a course worth
15 ETCS points per person, but extensions to Master's thesis will
also be considered
<LI>merits: the resulting grammar can easily lead to a published paper
<LI>contribution to the multilingual and multicultural development of Europe
</OL>
<!-- html code generated by txt2tags 2.4 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags gf-summerschool.txt -->
</BODY></HTML>