1
0
forked from GitHub/gf-core

getting into syntax in the grammar document

This commit is contained in:
aarne
2013-08-28 05:59:06 +00:00
parent b60cf09a9f
commit cc259fea43
5 changed files with 323 additions and 5 deletions

View File

@@ -23,14 +23,16 @@ of the verb, but it may be separated from it by an object: //Please switch it of
| ``Person`` | person | first, second, third
| ``Case`` | case | nominative, genitive
| ``Degree`` | degree | positive, comparative, superlative
| ``AForm`` | adjective form | degrees, adverbial
| ``VForm`` | verb form | infinitive, present, past, past participle, present participle
| ``VVType`` | infinitive form (for a VV) | bare infinitive, //to// infinitive, //ing// form
The assignment of parameter types and the identification of the separate parts of categories defines a
The assignment of parameter types and the identification of the separate parts of categories defines
the **data structures** in which the words are stored in a lexicon.
This data structure is in GF called the **linearization type** of the category. From the computational
This data structure is in GF called the **linearization type** of the category. From the computer's
point of view, it is important that the data structures are well defined for all words, even if this may
sound unnecessary. For instance, since some verbs need a particle part, all verbs must uniformly have a
sound unnecessary for the human. For instance, since some verbs need a particle part, all verbs must uniformly have a
storage for this particle, even if it is empty most of the time. This property is guaranteed by
an operation called **type checking**. It is performed by GF as a part of **grammar compilation**, which
is the process in which the human-readable description of the grammar is converted to bits executable
@@ -42,7 +44,7 @@ by the computer.
|| GF name | text name | example | inflectional features | inherent features ||
| ``N`` | noun | //house// | number, case | (none)
| ``PN`` | proper name | //Paris// | case | (none)
| ``A`` | adjective | //blue// | degree | (none)
| ``A`` | adjective | //blue// | adjective form | (none)
| ``V`` | verb | //sleep// | verb form | particle
| ``Adv`` | adverb | //here// | (none) | (none)
| ``V2`` | two-place verb | //love// | verb form | particle, preposition
@@ -56,3 +58,149 @@ but a string.
We have done the same with the preposition strings that define the complement features of verb
and other subcategories.
The "digital grammar" representations of these types are **records**, where for instance the ``VV``
record type is formally written
```
{s : VForm => Str ; p : Str ; i : InfForm}
```
The record has **fields** for different types of data. In the record above, there are three fields:
- the field labelled ``s``, storing an **inflection table** that produces a **string** (``Str``) depending on verb form,
- the field labelled ``p``, storing a string representing the particle,
- the field labelled ``i``, storing an inherent feature for the infinitive form required
Thus for instance the record for verb-complement verb //try// (//to do something//) in the lexicon looks as follows:
```
{s = table {
VInf => "try" ;
VPres => "tries" ;
VPast => "tried" ;
VPastPart => "tried" ;
VPresPart => "trying"
} ;
p = "" ;
i = VVInf
}
```
We have not introduce the GF names of the features, as we will not make essential use of them: we will prefer
informal explanations for all rules. So these records are a little hint for those who want to understand the
whole chain, from the rules as we state them in natural language, down to machine-readable digital grammars,
which ultimately have the same structure as our statements.
++Inflection paradigms++
In many languages, the description of inflectional forms occupies a large part of grammar books. Words, in particular
verbs, can have dozens of forms, and there can be dozens of different ways of building those forms. Each type of
inflection is described in a **paradigm**, which is a table including all forms of an example verb. For other
verbs, it is enough to indicate the number of the paradigm, to say that this verb is inflected "in the same way"
as the model verb.
===Nouns===
Computationally, inflection paradigms are **functions** that take as their arguments **stems**, to which suffixes
(and sometime prefixes) are added. Here is, for instance, the English **regular noun** paradigm:
|| form | singular | plural ||
| nominative | //dog// | //dogs//
| genitive | //dog's// | //dogs'//
As a function, it is interpreted as follows: the word //dog// is the stem to which endings are added. Replacing it
with //cat//, //donkey//, //rabbit//, etc, will yield the forms of these words.
In addition to nouns that are inflected with exactly the same suffixes as //dog//, English has
inflection types such as //fly-flies//, //kiss-kisses//, //bush-bushes//, //echo-echoes//. Each of these inflection types
could be described by a paradigm of its own. However, it is more attractive to see these as variations of the regular
paradigm, which are predictable by studying the singular nominative. This leads to a generalization of paradigms which
in the RGL are called **smart paradigms**.
Here is the smart paradigm of English nouns. It tells how the plural nominative is formed from the singular; the
genitive forms are always formed by just adding //'s// in the singular and //'// in the plural.
- for nouns ending with //s//, //z//, //x//, //sh//, //ch//, the forms are like //kiss - kisses//
- for nouns ending with a vowel (one of //a//,//e//,//i//,//o//,//u//) followed by //y//, the forms are like //boy - boys//
- for all other nouns ending with //y//, the forms are like //baby - babies//
- for nouns ending with a vowel or //y// and followed by //o//, the forms are like //embryo - embryos//
- for all other nouns ending with //o//, the forms are like //echo - echoes//
- for all other nouns, the forms are like //dog - dogs//
The same rules are in GF expressed by **regular expression pattern matching** which, although formal and machine-readable,
might in fact be a nice notation for humans to read as well:
```
"s" | "z" | "x" | "sh" | "ch" => <word, word + "es">
#vowel + "y" => <word, word + "s">
"y" => <word, init word + "ies">
(#vowel | "y") + "o" => <word, word + "s">
"o" => <word, word + "es">
_ => <word, word + "s">
```
In this notation, ``|`` means "or" and ``+`` means "followed by". The pattern that is matched is followed by
an arrow ``=>``, after which the two forms appear within angel brackets. The patterns are matched in the given
order, and ``_`` means "anything that was not matched before". Finally, the function ``init`` returns the
initial segment of a word (e.g. //happ// for //happy//), and the pattern ``#vowel`` is defined as
``"a" | "e" | "i" | "o" | "u".
In addition to regular and predictable nouns, English has **irregular nouns**, such as //man - men//,
//formula - formulae//, //ox - oxen//. These nouns have their plural genitive formed by //'s//: //men's//.
===Adjectives===
English adjectives inflect for degree, with three values, and also have an adverbial form in their linearization type.
Here are some regular variations:
- for adjectives ending with consonant + vowel + consonant: //dim, dimmer, dimmest, dimly//
- for adjectives ending with //y// not preceded by a vowel: //happy, happier, happier, happily//
- for other adjectives: //quick, quicker, quickest, quickly//
The comparison forms only work for adjectives with at most two syllables. For longer ones,
they are formed syntactically: //expensive, more expensive, most expensive//. There are also
some irregular adjectives, the most extreme one being perhaps //good, better, best, well//.
===Verbs===
English verbs have five different forms, except for the verb //be//, which has some more forms, e.g.
//sing, sings, sang, sung, singing//.
But //be// is also special syntactically and semantically, and is in the RGL introduced
in the syntax rather than in the lexicon.
Two forms, the past (indicative) and the past participle are the same for the so-called **regular verbs**
(e.g. //play, plays, played, played, playing//). The regular verb paradigm thus looks as follows:
|| feature | form ||
| infinitive | //play//
| present | //plays//
| past | //played//
| past participle | //played//
| present participle | //plays//
The predictable variables are related to the ones we have seen in nouns and adjectives:
the present tense of verbs varies in the same way as the plural of nouns,
and the past varies in the same way as the comparative of adjectives. The most important variations are
- for verbs ending with //s//, //z//, //x//, //sh//, //ch//: //kiss, kisses, kissed, kissing//
- for verbs ending with consonant + vowel + consonant: //dim, dims, dimmed, dimming//
- for verbs ending with //y// not preceded by a vowel: //cry, cries, cried, crying//
- for verbs ending with //ee//: //free, frees, freed, freeing//
- for verbs ending with //ie//: //die, dies, died, dying//
- for other verbs ending with //e//: //use, uses, used, using//
- for other verbs: //walk, walks, walked, walking//
English also has a couple of hundred **irregular verbs**, whose infinitive, past, and past participle forms have to stored
separately. These free forms determine the other forms in the same way as regular verbs. Thus
- from //cut, cut, cut//, you also get //cuts, cutting//
- from //fly, flew, flown//, you also get //flies, flying//
- from //write, wrote, written//, you also get //writes, writing//
===Structural words===

View File

@@ -3,6 +3,8 @@ Aarne Ranta
%!Encoding:utf8
%!style(html): ../revealpopup.css
%!postproc(tex) : "#BECE" "begin{center}"
%!postproc(html) : "#BECE" "<center>"
%!postproc(tex) : "#ENCE" "end{center}"
@@ -233,7 +235,7 @@ future additions of more subcategories for verbs.
===Table: subcategories of verbs===
|| GF name | text name | example | inherent complement features | semantics ||
| ``V2`` | two-place verb | //love// (//someone// | case or preposition | ``e -> e -> t``
| ``V2`` | two-place verb | //love// (//someone//) | case or preposition | ``e -> e -> t``
| ``V3`` | three-place verb | //give// (//something to someone//) | two cases or prepositions | ``e -> e -> e -> t``
| ``VV`` | verb-complement verb | //try// (//to do something//) | infinitive form | ``e -> v -> t``
| ``VS`` | sentence-complement verb | //know// (//that something happens//) | sentence mood | ``e -> t -> t``

View File

@@ -2,4 +2,172 @@
+Syntax: general rules+
The rules of syntax specify how words are combined to **phrases**, and how phrases are combined to even longer phrases.
Phrases, just like words, belong to different categories, which are equipped with inflectional and inherent features and
with semantic types. Moreover, each syntactic rule has a corresponding **semantic rule**, which specifies how the meaning
of the new phrases is constructed from the meanings of its parts.
The RGL has around 30 categories of phrases, on top of the lexical categories. The widest category is ``Text``, which cover
entire texts consisting of sentences, questions, interjections, etc, with punctuation. The following picture shows all RGL
categories as a dependency tree, where ``Text`` is in the root (so it is an upside-down tree), and the lexical categories
in the leaves. Being above another category in the tree means that phrases of higher categories can have phrases of lower
categories as parts. But these dependencies can work in both directions: for instance, the noun phrase (``NP``)
//every man who owns a donkey// has as its part the relative clause (``RCl``), which in turn has its part the noun phrase
//a donkey//.
===Figure: the principal dependences of phrasal and lexical categories===
[../categories.png]
Lexical categories appear in boxes rather than ellipses, with several categories gathered in some of the boxes.
++The structure of a clause++
It is convenient to start from the middle of the RGL: from the structure of a **clause** (``Cl``). A clause is an application
of a verb to its arguments. For instance, //John paints the house yellow// is an application of the ``V2V`` verb //paint//
to the arguments //John//, //the house//, and //yellow//. Recalling the table of lexical categories from Chapter 1,
we can summarize the semantic types of these parts as follows:
```
paint : e -> e -> (e -> t) -> t
John : e
the house : e
yellow : e -> t
```
Hence the verb //paint// is a **predicate**, a function that can be applied to arguments to return a proposition.
In this case, we can build the application
```
paint John (the house) yellow : t
```
which is thus an object of type ``t``.
Applying verbs to arguments is how clauses work on the semantic level. However, the syntactic fine-structure is
a bit more complex. The predication process is hence divided to several steps, which involve intermediate categories.
Following these steps, a clause is built by adding one argument at a time. Doing in this way, rather than adding
all arguments at once, has two advantages:
- the grammar doesn't need to specify the same things again and again for different verb categories
- at each step of construction, some other rule could apply than adding an argument - for instance, adding an adverb
Here are the steps in which //John paints the house yellow// is constructed from its arguments in the RGL:
- //paints// and //yellow// are combined to a **verb phrase missing a noun phrase** (``VPSlash``)
- //paints - yellow// and //the house// are combined to a **verb phrase** (``VP``)
- //John// and //paints the house yellow// are combined to a **clause** (``Cl``)
The structure is shown by the following tree:
#BECE
[paint-abstract.png]
#ENCE
This tree is called the **abstract syntax tree** of the sentence. It shows the structural components from which the
sentence has been constructed. Its nodes show the GF names associated with syntax rules and internally used for building
structures. Thus for instance ``PredVP`` encodes the rule that combines a noun phrase and a verb phrase into a clause,
``UsePN`` converts a proper name to a noun phrase, and so on. Mathematically, these names
denote **functions** that build abstract syntax trees from other tree. Every tree belongs to some category.
The GF notation for the ``PredVP`` rule is
```
PredVP : NP -> VP -> Cl
```
in words, ``PredVP`` //is a function that takes a noun phrase and a verb phrase and returns a clause//.
The tree is thus in fact built by function applications. A computer-friendly notation for trees uses
parentheses rather than graphical trees:
```
PredVP
(UsePN john_PN)
(ComplSlash
(SlashV2A paint_V2A (PositA yellow_A))
(DetCN (DetQuant DefArt NumSg) (UseN house_N)))
```
Before going to the details of phrasal categories and rules, let us compare the abstract syntax tree with
another tree, known as **parse tree** or **concrete syntax tree**:
#BECE
[paint-concrete.png]
#ENCE
This tree shows, on its leaves, the clause that results from the combination of categories. Each node
is labelled with the category to which the part of the clause under it belongs to. As shown by the label
``VPSlash``, this part can consist of many separate groups of words, where words from constructions from
higher up are inserted.
As parse trees display the actual words of a particular language, in a language-specific
order, they are less interesting from the multilingual point of view than the abstract syntax trees.
A GF grammar is thus primarily specified by its abstract syntax functions, which are language-neutral,
and secondarily by the **linearization rules** that convert them to different languages.
Let us specify the phrasal categories that are used for making up predications. The lexical category ``V2A`` of
two-place adjective-complement verbs was explained in Chapter 1.
===Table: phrasal categories involved in predication===
|| GF name | text name | example | inflection features | inherent features | parts | semantics ||
| ``Cl`` | clause | //he paints it blue// | temporal, polarity | (none) | one | ``t``
| ``VP`` | verb phrase | //paints it blue// | temporal, polarity, agreement | subject case | verb, complement | ``e -> t``
| ``VPSlash`` | slash verb phrase | //paints - blue// | temporal, polarity, agreement | subject and complement case | verb, complement | ``e -> e -> t``
| ``NP`` | noun phrase | //the house// | case | agreement | one | ``(e -> t) -> t``
| ``AP`` | adjectival phrase | //very blue// | gender, numeber, case | position | one | ``a`` = ``e -> t``
TODO explain **agreement** and **temporal**.
TODO explain the semantic type of ``NP``.
The functions that build up the clause in our example tree are given in the following table, together with functions that
build the semantics of the constructed trees. The latter functions operate on variables belonging to the semantic types of
the arguments of the function.
===Table: abstract syntax functions involved in predication===
|| GF name | type | example | semantics ||
| ``PredVP`` | ``NP -> VP -> S`` | //he// + //paints the house blue// | ``np vp``
| ``ComplSlash`` | ``VPSlash -> NP -> VP`` | //paints - blue// + //the house// | ``\x -> np (\y -> vpslash x y)``
| ``SlashV2A`` | ``V2A -> AP -> VPSlash`` | //paints// + //blue// | ``\x,y -> v2a x y ap``
TODO explain lambda abstraction.
The semantics of the clause //John paints the house yellow// can now be computed from the assumed meanings
```
John* : e
paint* : e -> e -> (e -> t) -> t
the_house* : e
yellow* : e -> t
```
as follows:
```
(PredVP John (ComplSlash (SlashV2A paint yellow) the-house))*
= (ComplSlash (SlashV2A paint yellow) the_house)* John*
= (SlashV2A paint yellow)* John* the_house*
= paint* John* the_house* yellow*
```
for the moment ignoring the internal structure of noun phrases, which will be explained later.
The linearization rules work very much in the same way as the semantic rules. They obey the definitions of
inflectional and inherent features and discontinuous parts, which together define linearization types of
the phrasal categories. These types are at this point schematic, because we don't assume any particular
language. But what we can read out from the category table above is as follows:
===Table: schematic linearization types===
|| GF name | text name | linearization type ||
| ``Cl`` | clause | ``{s : Temp => Pol => Str}``
| ``VP`` | verb phrase | ``{s : Temp => Pol => Agr => {verb,compl : Str} ; sc : Case}``
| ``VPSlash`` | slash verb phrase | ``{s : Temp => Pol => Agr => {verb,compl : Str} ; sc, cc : Case}``
| ``NP`` | noun phrase | ``{s : Case => Str ; a : Agr}``
| ``AP`` | adjectival phrase | ``{s : Gender => Number => Case => Str ; isPre : Bool}``
These types suggest the following linearization rules:
```
PredVP np vp = {s = \\t,p => np.s ! vp.sc ++ vps.verb ++ vps.compl where vps = vp.s ! t ! p ! np.a}
```
TODO linearization of the example
Similar rules as to ``V2A`` apply to all subcategories of verbs. The ``V2`` verbs are first made into ``VPSlash``
by giving the non-NP complement. ``V3`` verbs can take their two NP complements in either order, which
means that there are two ``VPSlash``-producing rules. This
makes it possible to form both the questions //what did she give him// and //whom did she give it//.
The other ``V`` categories are turned into ``VP`` without going through ``VPSlash``, since they have
no noun phrase complements.

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB