forked from GitHub/gf-core
534 lines
19 KiB
Plaintext
534 lines
19 KiB
Plaintext
GF Resource Grammar Summer School
|
|
Gothenburg, 17-28 August 2009
|
|
Aarne Ranta (aarne at chalmers.se)
|
|
|
|
%!Encoding : iso-8859-1
|
|
|
|
%!target:html
|
|
%!postproc(html): #BECE <center>
|
|
%!postproc(html): #ENCE </center>
|
|
%!postproc(html): #GRAY <font color="green" size="-1">
|
|
%!postproc(html): #EGRAY </font>
|
|
%!postproc(html): #RED <font color="red">
|
|
%!postproc(html): #YELLOW <font color="orange">
|
|
%!postproc(html): #ERED </font>
|
|
|
|
#BECE
|
|
[school-langs.png]
|
|
#ENCE
|
|
|
|
|
|
//red=wanted, green=exists, orange=in-progress, solid=official-eu, dotted=non-eu//
|
|
|
|
|
|
==News==
|
|
|
|
An on-line course //GF for Resource Grammar Writers// will start on
|
|
Monday 20 April at 15.30 CEST. The slides and recordings of the five
|
|
45-minute lectures will be made available via this web page. If requested,
|
|
the course may be repeated in the beginning of the summer school.
|
|
|
|
|
|
==Executive summary==
|
|
|
|
GF Resource Grammar Library is an open-source computational grammar resource
|
|
that currently covers 12 languages.
|
|
The Summer School is a part of a collaborative effort to extend the library
|
|
to all of the 23 official EU languages. Also other languages
|
|
chosen by the participants are welcome.
|
|
|
|
The missing EU languages are:
|
|
Czech, Dutch, Estonian, Greek, Hungarian, Irish, Latvian, Lithuanian,
|
|
Maltese, Portuguese, Slovak, and Slovenian. There is also more work to
|
|
be done on Polish and Romanian.
|
|
|
|
The linguistic coverage of the library includes the inflectional morphology
|
|
and basic syntax of each language. It can be used in GF applications
|
|
and also ported to other formats. It can also be used for building other
|
|
linguistic resources, such as morphological lexica and parsers.
|
|
The library is licensed under LGPL.
|
|
|
|
In the summer school, each language will be implemented by one or two students
|
|
working together. A morphology implementation will be credited
|
|
as a Chalmers course worth 7.5 ETCS points; adding a syntax implementation
|
|
will be worth more. The estimated total work load is 1-2 months for the
|
|
morphology, and 3-6 months for the whole grammar.
|
|
|
|
Participation in the course is free. Registration is done via the courses's
|
|
Google group, [``groups.google.com/group/gf-resource-school-2009/`` http://groups.google.com/group/gf-resource-school-2009/]. The registration deadline is 15 June 2009.
|
|
|
|
Some travel grants will be available. They are distributed on the basis of a
|
|
GF programming contest in April and May.
|
|
|
|
The summer school will be held on 17-28 August 2009, at the campus of
|
|
Chalmers University of Technology in Gothenburg, Sweden.
|
|
|
|
|
|
[align6.png]
|
|
|
|
//Word alignment produced by GF from the resource grammar in Bulgarian, English, Italian, German, Finnish, French, and Swedish.//
|
|
|
|
==Introduction==
|
|
|
|
Since 2007, EU-27 has 23 official languages, listed in the diagram on top of this
|
|
document. There is a growing need of linguistic resources for these
|
|
languages, to help in tasks such as translation and information retrieval.
|
|
These resources should be **portable** and **freely accessible**.
|
|
Languages marked in red in the diagram are of particular interest for
|
|
the summer school, since they are those on which the effort will be concentrated.
|
|
|
|
GF (Grammatical Framework,
|
|
[``digitalgrammars.com/gf`` http://digitalgrammars.com/gf])
|
|
is a **functional programming language** designed for writing natural
|
|
language grammars. It provides an efficient platform for this task, due to
|
|
its modern characteristics:
|
|
- It is a functional programming language, similar to Haskell and ML.
|
|
- It has a static type system and type checker.
|
|
- It has a powerful module system supporting separate compilation
|
|
and data abstraction.
|
|
- It has an optimizing compiler to **Portable Grammar Format** (PGF).
|
|
- PGF can be further compiled to other formats, such as JavaScript and
|
|
speech recognition language models.
|
|
- GF has a **resource grammar library** giving access to the morphology and
|
|
basic syntax of 12 languages.
|
|
|
|
|
|
In addition to "ordinary" grammars for single languages, GF
|
|
supports **multilingual grammars**. A multilingual GF grammar consists of an
|
|
**abstract syntax** and a set of **concrete syntaxes**.
|
|
An abstract syntax is system of **trees**, serving as a semantic
|
|
model or an ontology. A concrete syntax is a mapping from abstract syntax
|
|
trees to strings of a particular language.
|
|
|
|
These mappings defined in concrete syntax are **reversible**: they
|
|
can be used both for **generating** strings from trees, and for
|
|
**parsing** strings into trees. Combinations of generation and
|
|
parsing can be used for **translation**, where the abstract
|
|
syntax works as an **interlingua**. Thus GF has been used as a
|
|
framework for building translation systems in several areas
|
|
of application and large sets of languages.
|
|
|
|
|
|
|
|
==The GF resource grammar library==
|
|
|
|
The GF resource grammar library is a set of grammars usable as libraries when
|
|
building translation systems and other applications.
|
|
The library currently covers
|
|
the 9 languages coloured in green in the diagram above; in addition,
|
|
Catalan, Norwegian, and Russian are covered, and there is ongoing work on
|
|
Arabic, Hindi/Urdu, Polish, Romanian, and Thai.
|
|
|
|
The purpose of the resource grammar library is to define the "low-level" structure
|
|
of a language: inflection, word order, agreement. This structure belongs to what
|
|
linguists call morphology and syntax. It can be very complex and requires
|
|
a lot of knowledge. Yet, when translating from one language to
|
|
another, knowing morphology and syntax is but a part of what is needed.
|
|
The translator (whether human
|
|
or machine) must understand the meaning of what is translated, and must also know
|
|
the idiomatic way to express the meaning in the target language. This knowledge
|
|
can be very domain-dependent and requires in general an expert in the field to
|
|
reach high quality: a mathematician in the field of mathematics, a meteorologist
|
|
in the field of weather reports, etc.
|
|
|
|
The problem is to find a person who is an expert in both the domain of translation
|
|
and in the low-level linguistic details. It is the rareness of this combination
|
|
that has made it difficult to build interlingua-based translation systems.
|
|
The GF resource grammar library has the mission of helping in this task.
|
|
It encapsulates the low-level linguistics in program modules
|
|
accessed through easy-to-use interfaces.
|
|
Experts on different domains can build translation systems by using the library,
|
|
without knowing low-level linguistics. The idea is much the same as when a
|
|
programmer builds a graphical user interface (GUI) from high-level elements such as
|
|
buttons and menus, without having to care about pixels or geometrical forms.
|
|
|
|
|
|
===Missing EU languages, by the family===
|
|
|
|
Writing a grammar for a language is usually easier if other languages
|
|
from the same family already have grammars. The colours have the same
|
|
meaning as in the diagram above.
|
|
|
|
Baltic:
|
|
#RED Latvian #ERED
|
|
#RED Lithuanian #ERED
|
|
|
|
Celtic:
|
|
#RED Irish #ERED
|
|
|
|
Fenno-Ugric:
|
|
#RED Estonian #ERED
|
|
#GRAY Finnish #EGRAY
|
|
#RED Hungarian #ERED
|
|
|
|
Germanic:
|
|
#GRAY Danish #EGRAY
|
|
#RED Dutch #ERED
|
|
#GRAY English #EGRAY
|
|
#GRAY German #EGRAY
|
|
#GRAY Swedish #EGRAY
|
|
|
|
Hellenic:
|
|
#RED Greek #ERED
|
|
|
|
Romance:
|
|
#GRAY French #EGRAY
|
|
#GRAY Italian #EGRAY
|
|
#RED Portuguese #ERED
|
|
#YELLOW Romanian #ERED
|
|
#GRAY Spanish #EGRAY
|
|
|
|
Semitic:
|
|
#RED Maltese #ERED
|
|
|
|
Slavonic:
|
|
#GRAY Bulgarian #EGRAY
|
|
#RED Czech #ERED
|
|
#YELLOW Polish #ERED
|
|
#RED Slovak #ERED
|
|
#RED Slovenian #ERED
|
|
|
|
|
|
|
|
|
|
|
|
|
|
===Applications of the library===
|
|
|
|
In addition to translation, the library is also useful in **localization**,
|
|
that is, porting a piece of software to new languages.
|
|
The GF resource grammar library has been used in three major projects that need
|
|
interlingua-based translation or localization of systems to new languages:
|
|
- in KeY,
|
|
[``http://www.key-project.org/`` http://www.key-project.org/],
|
|
for writing formal and informal software specifications (3 languages)
|
|
- in WebALT,
|
|
[``http://webalt.math.helsinki.fi/content/index_eng.html`` http://webalt.math.helsinki.fi/content/index_eng.html],
|
|
for translating mathematical exercises to 7 languages
|
|
- in TALK [``http://www.talk-project.org`` http://www.talk-project.org],
|
|
where the library was used for localizing spoken dialogue systems
|
|
to six languages
|
|
|
|
|
|
The library is also a generic **linguistic resource**,
|
|
which can be used for tasks
|
|
such as language teaching and information retrieval. The liberal license (LGPL)
|
|
makes it usable for anyone and for any task. GF also has tools supporting the
|
|
use of grammars in programs written in other
|
|
programming languages: C, C++, Haskell,
|
|
Java, JavaScript, and Prolog. In connection with the TALK project,
|
|
support has also been
|
|
developed for translating GF grammars to language models used in speech
|
|
recognition (GSL/Nuance, HTK/ATK, SRGS, JSGF).
|
|
|
|
|
|
|
|
===The structure of the library===
|
|
|
|
The library has the following main parts:
|
|
- **Inflection paradigms**, covering the inflection of each language.
|
|
- **Core Syntax**, covering a large set of syntax rule that
|
|
can be implemented for all languages involved.
|
|
- **Common Test Lexicon**, giving ca. 500 common words that can be used for
|
|
testing the library.
|
|
- **Language-Specific Syntax Extensions**, covering syntax rules that are
|
|
not implementable for all languages.
|
|
- **Language-Specific Lexica**, word lists for each language, with
|
|
accurate morphological and syntactic information.
|
|
|
|
|
|
The goal of the summer school is to implement, for each language, at least
|
|
the first three components. The latter three are more open-ended in character.
|
|
|
|
|
|
==The summer school==
|
|
|
|
The goal of the summer school is to extend the GF resource grammar library
|
|
to covering all 23 EU languages, which means we need 15 new languages.
|
|
We also welcome other languages than these 23,
|
|
if there are interested participants.
|
|
|
|
The amount of work and skill is between a Master's thesis and a PhD thesis.
|
|
The Russian implementation was made by Janna Khegai as a part of her
|
|
PhD thesis; the thesis contains other material, too.
|
|
The Arabic implementation was started by Ali El Dada in his Master's thesis,
|
|
but the thesis does not cover the whole API. The realistic amount of work is
|
|
somewhere between 3 and 8 person months,
|
|
but this is very much language-dependent.
|
|
Dutch, for instance, can profit from previous implementations of German and
|
|
Scandinavian languages, and will probably require less work.
|
|
Latvian and Lithuanian are the first languages of the Baltic family and
|
|
will probably require more work.
|
|
|
|
In any case, the proposed allocation of work power is 2 participants per
|
|
language. They will do 1 months' worth of home work, followed
|
|
by 2 weeks of summer school, followed by 4 months work at home.
|
|
Who are these participants?
|
|
|
|
|
|
===Selecting participants===
|
|
|
|
Persons interested to participate in the Summer School should sign up in
|
|
the **Google Group** of the course,
|
|
|
|
[``groups.google.com/group/gf-resource-school-2009/`` http://groups.google.com/group/gf-resource-school-2009/]
|
|
|
|
The registration deadline is 15 June 2009.
|
|
|
|
Notice: you can sign up in the Google
|
|
group even if you are not planning to attend the summer school, but are
|
|
just interested in the topic. There will be a separate registration to the
|
|
school itself later.
|
|
|
|
The participants are recommended to learn GF in advance, by self-study from the
|
|
[tutorial http://digitalgrammars.com/gf/doc/gf-tutorial.html].
|
|
This should take a couple of weeks. An **on-line course** will be
|
|
arranged on 20-29 April to help in getting started with GF.
|
|
|
|
At the end of the on-line course, a **programming assignment** will be published.
|
|
This assignment will test skills required in resource grammar programming.
|
|
Work on the assignment will take a couple of weeks.
|
|
Those who are interested in getting a travel grant will submit
|
|
their sample resource grammar fragment
|
|
to the Summer School Committee by 12 May.
|
|
The Committee then decides who is given a travel grant of up to 1000 EUR.
|
|
|
|
Notice: you can participate in the summer school without following the on-line
|
|
course or participating in the contest. These things are required only if you
|
|
want a travel grant. If requested by enough many participants, the lectures of
|
|
the on-line course will be repeated in the beginning of the summer school.
|
|
|
|
The summer school itself is devoted for working on resource grammars.
|
|
In addition to grammar writing itself, testing and evaluation is
|
|
performed. One way to do this is via adding new languages
|
|
to resource grammar applications - in particular, to the WebALT mathematical
|
|
exercise translator.
|
|
|
|
The resource grammars are expected to be completed by December 2009. They will
|
|
be published at GF website and licensed under LGPL.
|
|
|
|
The participants are encouraged to contact each other and even work in groups.
|
|
|
|
|
|
|
|
===Who is qualified===
|
|
|
|
Writing a resource grammar implementation requires good general programming
|
|
skills, and a good explicit knowledge of the grammar of the target language.
|
|
A typical participant could be
|
|
- native or fluent speaker of the target language
|
|
- interested in languages on the theoretical level, and preferably familiar
|
|
with many languages (to be able to think about them on an abstract level)
|
|
- familiar with functional programming languages such as ML or Haskell
|
|
(GF itself is a language similar to these)
|
|
- on Master's or PhD level in linguistics, computer science, or mathematics
|
|
|
|
|
|
But it is the quality of the assignment that is assessed, not any formal
|
|
requirements. The "typical participant" was described to give an idea of
|
|
who is likely to succeed in this.
|
|
|
|
|
|
===Costs===
|
|
|
|
The summer school is free of charge.
|
|
|
|
Some travel grants are given, on the basis of a programming contest,
|
|
to cover travel and accommodation costs up to 1000 EUR
|
|
per person.
|
|
|
|
The number of grants will be decided during Spring 2009, and the grand
|
|
holders will be notified before the beginning of June.
|
|
|
|
Special terms will apply to students in
|
|
[GSLT http://www.gslt.hum.gu.se/] and
|
|
[NGSLT http://ngslt.org/].
|
|
|
|
|
|
|
|
|
|
|
|
===Teachers===
|
|
|
|
A list of teachers will be published here later. Some of the local teachers
|
|
probably involved are the following:
|
|
- Krasimir Angelov
|
|
- Robin Cooper
|
|
- Håkan Burden
|
|
- Markus Forsberg
|
|
- Harald Hammarström
|
|
- Peter Ljunglöf
|
|
- Aarne Ranta
|
|
|
|
|
|
More teachers are welcome! If you are interested, please contact us so that
|
|
we can discuss your involvement and travel arrangements.
|
|
|
|
In addition to teachers, we will look for consultants who can help to assess
|
|
the results for each language. Please contact us!
|
|
|
|
|
|
|
|
===The Summer School Committee===
|
|
|
|
This committee consists of a number of teachers and informants,
|
|
who will select the participants. It will be selected by April 2009.
|
|
|
|
|
|
===Time and Place===
|
|
|
|
The summer school will
|
|
be organized at the campus of Chalmers University of Technology in Gothenburg,
|
|
Sweden, on 17-28 August 2009.
|
|
|
|
Time schedule:
|
|
- February: announcement of summer school
|
|
- 20-29 April: on-line course
|
|
- 12 May: submission deadline for assignment work
|
|
- 31 May: review of assignments, notifications of acceptance
|
|
- 15 June: **registration deadline**
|
|
- 17-28 August: Summer School
|
|
- September-December: homework on resource grammars
|
|
- December: release of the extended Resource Grammar Library
|
|
|
|
|
|
===Dissemination and intellectual property===
|
|
|
|
The new resource grammars will be released under the LGPL just like
|
|
the current resource grammars,
|
|
with the copyright held by respective authors.
|
|
|
|
The grammars will be distributed via the GF web site.
|
|
|
|
|
|
|
|
==Why I should participate==
|
|
|
|
Seven reasons:
|
|
+ participation in a pioneering language technology work in an
|
|
enthusiastic atmosphere
|
|
+ work and fun with people from all over Europe and the world
|
|
+ job opportunities and business ideas
|
|
+ credits: the school project will be established as a course at Chalmers worth
|
|
7.5 or 15 ETCS points per person, depending on the work accompliched; also
|
|
extensions to Master's thesis will be considered (special credit arrangements
|
|
for [GSLT http://www.gslt.hum.gu.se/] and [NGSLT http://ngslt.org/])
|
|
+ merits: the resulting grammar can easily lead to a published paper (see below)
|
|
+ contribution to the multilingual and multicultural development of Europe and the
|
|
world
|
|
+ free trip and stay in Gothenburg (for travel grant students)
|
|
|
|
|
|
==More information==
|
|
|
|
[Course Google Group http://groups.google.com/group/gf-resource-school-2009/]
|
|
|
|
[GF web page http://digitalgrammars.com/gf/]
|
|
|
|
[GF tutorial http://digitalgrammars.com/gf/doc/gf-tutorial.html]
|
|
|
|
[GF resource synopsis http://digitalgrammars.com/gf/lib/resource/doc/synopsis.html]
|
|
|
|
[Resource-HOWTO document http://digitalgrammars.com/gf/doc/Resource-HOWTO.html]
|
|
|
|
|
|
===Contact===
|
|
|
|
Håkan Burden: burden at chalmers se
|
|
|
|
Aarne Ranta: aarne at chalmers se
|
|
|
|
|
|
|
|
===Selected publications from earlier resource grammar projects===
|
|
|
|
K. Angelov.
|
|
Type-Theoretical Bulgarian Grammar.
|
|
In B. Nordström and A. Ranta (eds),
|
|
//Advances in Natural Language Processing (GoTAL 2008)//,
|
|
LNCS/LNAI 5221, Springer,
|
|
2008.
|
|
|
|
B. Bringert.
|
|
//Programming Language Techniques for Natural Language Applications//.
|
|
Phd thesis, Computer Science, University of Gothenburg,
|
|
2008.
|
|
|
|
A. El Dada and A. Ranta.
|
|
Implementing an Open Source Arabic Resource Grammar in GF.
|
|
In M. Mughazy (ed),
|
|
//Perspectives on Arabic Linguistics XX. Papers from the Twentieth Annual Symposium on Arabic Linguistics, Kalamazoo, March 26//
|
|
John Benjamins Publishing Company.
|
|
2007.
|
|
|
|
A. El Dada.
|
|
Implementation of the Arabic Numerals and their Syntax in GF.
|
|
Computational Approaches to Semitic Languages: Common Issues and Resources,
|
|
ACL-2007 Workshop,
|
|
June 28, 2007, Prague.
|
|
2007.
|
|
|
|
H. Hammarström and A. Ranta.
|
|
Cardinal Numerals Revisited in GF.
|
|
//Workshop on Numerals in the World's Languages//.
|
|
Dept. of Linguistics Max Planck Institute for Evolutionary Anthropology, Leipzig,
|
|
2004.
|
|
|
|
M. Humayoun, H. Hammarström, and A. Ranta.
|
|
Urdu Morphology, Orthography and Lexicon Extraction.
|
|
//CAASL-2: The Second Workshop on Computational Approaches to Arabic Script-based Languages//,
|
|
July 21-22, 2007, LSA 2007 Linguistic Institute, Stanford University.
|
|
2007.
|
|
|
|
K. Johannisson.
|
|
//Formal and Informal Software Specifications.//
|
|
Phd thesis, Computer Science, University of Gothenburg,
|
|
2005.
|
|
|
|
J. Khegai.
|
|
GF parallel resource grammars and Russian.
|
|
In proceedings of ACL2006
|
|
(The joint conference of the International Committee on Computational
|
|
Linguistics and the Association for Computational Linguistics) (pp. 475-482),
|
|
Sydney, Australia, July 2006.
|
|
|
|
J. Khegai.
|
|
//Language engineering in Grammatical Framework (GF)//.
|
|
Phd thesis, Computer Science, Chalmers University of Technology,
|
|
2006.
|
|
|
|
W. Ng'ang'a.
|
|
Multilingual content development for eLearning in Africa.
|
|
eLearning Africa: 1st Pan-African Conference on ICT for Development,
|
|
Education and Training. 24-26 May 2006, Addis Ababa, Ethiopia.
|
|
2006.
|
|
|
|
N. Perera and A. Ranta.
|
|
Dialogue System Localization with the GF Resource Grammar Library.
|
|
//SPEECHGRAM 2007: ACL Workshop on Grammar-Based Approaches to Spoken Language Processing//,
|
|
June 29, 2007, Prague.
|
|
2007.
|
|
|
|
A. Ranta.
|
|
Modular Grammar Engineering in GF.
|
|
//Research on Language and Computation//,
|
|
5:133-158, 2007.
|
|
|
|
A. Ranta.
|
|
How predictable is Finnish morphology? An experiment on lexicon construction.
|
|
In J. Nivre, M. Dahllöf and B. Megyesi (eds),
|
|
//Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein//,
|
|
University of Uppsala,
|
|
2008.
|
|
|
|
A. Ranta. Grammars as Software Libraries.
|
|
To appear in
|
|
Y. Bertot, G. Huet, J-J. Lévy, and G. Plotkin (eds.),
|
|
//From Semantics to Computer Science//,
|
|
Cambridge University Press, Cambridge, 2009.
|
|
|
|
A. Ranta and K. Angelov.
|
|
Implementing Controlled Languages in GF.
|
|
To appear in the proceedings of //CNL 2009//.
|
|
|