gf-core/lib/doc/languages/gf-english-1.txt


+Words: English-specific rules+

++Morphological features++

The first task when defining the language-specific rules for linguistic structures in the RGL is to give the
actual ranges of the features attached to the categories. We have to tell whether the language has the grammatical
number (as e.g. Chinese has not), and which values it takes (as many languages have two numbers but e.g. Arabic has three).
We have to do likewise for case, gender, person, tense - in other words, to specify the **parameter types** of
the language. Then we have to proceed to specifying which features belong to which lexical categories and how (i.e.
as inflectional or inherent features). In this process, we may also note that we need some special features that
are complex combinations of the "standard" features (as happens with English verbs: their forms are depend on tense,
number, and person, but not as a straightforward combination of them). We may also notice that a "words" in some
category may in fact consist of several words, which may even appear separated from each other. English verbs such as
//switch off//, called **particle verbs**, are a case in point. The particle contributes essentially to the meaning
of the verb, but it may be separated from it by an object: //Please switch it off!//


===Table: parameter types needed for content words in English===

|| GF name   | text name | values             ||
| ``Number`` | number    | singular, plural
| ``Person`` | person    | first, second, third
| ``Case``   | case      | nominative, genitive
| ``Degree`` | degree    | positive, comparative, superlative
| ``AForm``  | adjective form | degrees, adverbial
| ``VForm``  | verb form | infinitive, present, past, past participle, present participle
| ``VVType`` | infinitive form (for a VV) | bare infinitive, //to// infinitive, //ing// form


The assignment of parameter types and the identification of the separate parts of categories defines
the **data structures** in which the words are stored in a lexicon.
This data structure is in GF called the **linearization type** of the category. From the computer's
point of view, it is important that the data structures are well defined for all words, even if this may
sound unnecessary for the human. For instance, since some verbs need a particle part, all verbs must uniformly have a
storage for this particle, even if it is empty most of the time. This property is guaranteed by
an operation called **type checking**. It is performed by GF as a part of **grammar compilation**, which
is the process in which the human-readable description of the grammar is converted to bits executable
by the computer.


===Table: linearization types of English content words===

|| GF name  | text name        | example    | inflectional features       | inherent features  ||
| ``N``     | noun             | //house//  | number, case                | (none)
| ``PN``    | proper name      | //Paris//  | case                        | (none)
| ``A``     | adjective        | //blue//   | adjective form              | (none)
| ``V``     | verb             | //sleep//  | verb form                   | particle
| ``Adv``   | adverb           | //here//   | (none)                      | (none)
| ``V2``    | two-place verb   | //love//   | verb form                   | particle, preposition
| ``VV``    | verb-complement verb   | //try//   | verb form              | particle, infinitive form
| ``VS``    | sentence-complement verb   | //know//   | verb form              | particle
| ``VQ``    | question-complement verb   | //ask//    | verb form              | particle
| ``VA``    | adjective-complement verb  | //become// | verb form              | particle

Notice that we have placed the particle of verbs in the inherent feature column. It is not a parameter
but a string.
We have done the same with the preposition strings that define the complement features of verb
and other subcategories.

The "digital grammar" representations of these types are **records**, where for instance the ``VV``
record type is formally written
```
  {s : VForm => Str ; p : Str ; i : InfForm}
```
The record has **fields** for different types of data. In the record above, there are three fields:
- the field labelled ``s``, storing an **inflection table** that produces a **string** (``Str``) depending on verb form,
- the field labelled ``p``, storing a string representing the particle,
- the field labelled ``i``, storing an inherent feature for the infinitive form required


Thus for instance the record for verb-complement verb //try// (//to do something//) in the lexicon looks as follows:
```
  {s = table {
     VInf => "try" ;
     VPres => "tries" ;
     VPast => "tried" ;
     VPastPart => "tried" ;
     VPresPart => "trying"
     } ;
   p = "" ;
   i = VVInf
   }
```
We have not introduce the GF names of the features, as we will not make essential use of them: we will prefer
informal explanations for all rules. So these records are a little hint for those who want to understand the
whole chain, from the rules as we state them in natural language, down to machine-readable digital grammars,
which ultimately have the same structure as our statements.


++Inflection paradigms++

In many languages, the description of inflectional forms occupies a large part of grammar books. Words, in particular
verbs, can have dozens of forms, and there can be dozens of different ways of building those forms. Each type of
inflection is described in a **paradigm**, which is a table including all forms of an example verb. For other
verbs, it is enough to indicate the number of the paradigm, to say that this verb is inflected "in the same way"
as the model verb.


===Nouns===

Computationally, inflection paradigms are **functions** that take as their arguments **stems**, to which suffixes
(and sometime prefixes) are added. Here is, for instance, the English **regular noun** paradigm:

|| form      | singular  | plural  ||
| nominative | //dog//   | //dogs//
| genitive   | //dog's// | //dogs'//

As a function, it is interpreted as follows: the word //dog// is the stem to which endings are added. Replacing it
with //cat//, //donkey//, //rabbit//, etc, will yield the forms of these words.

In addition to nouns that are inflected with exactly the same suffixes as //dog//, English has
inflection types such as //fly-flies//, //kiss-kisses//, //bush-bushes//, //echo-echoes//. Each of these inflection types
could be described by a paradigm of its own. However, it is more attractive to see these as variations of the regular
paradigm, which are predictable by studying the singular nominative. This leads to a generalization of paradigms which
in the RGL are called **smart paradigms**.

Here is the smart paradigm of English nouns. It tells how the plural nominative is formed from the singular; the
genitive forms are always formed by just adding //'s// in the singular and //'// in the plural.
- for nouns ending with //s//, //z//, //x//, //sh//, //ch//, the forms are like //kiss - kisses//
- for nouns ending with a vowel (one of //a//,//e//,//i//,//o//,//u//) followed by //y//, the forms are like //boy - boys//
- for all other nouns ending with //y//, the forms are like //baby - babies//
- for nouns ending with a vowel or //y// and followed by //o//, the forms are like //embryo - embryos//
- for all other nouns ending with //o//, the forms are like //echo - echoes//
- for all other nouns, the forms are like //dog - dogs//


The same rules are in GF expressed by **regular expression pattern matching** which, although formal and machine-readable,
might in fact be a nice notation for humans to read as well:
```
  "s" | "z" | "x" | "sh" | "ch" => <word, word + "es">
  #vowel + "y"                  => <word, word + "s">
  "y"                           => <word, init word + "ies">
  (#vowel | "y") + "o"          => <word, word + "s">
  "o"                           => <word, word + "es">
  _                             => <word, word + "s">
```
In this notation, ``|`` means "or" and ``+`` means "followed by". The pattern that is matched is followed by
an arrow ``=>``, after which the two forms appear within angel brackets. The patterns are matched in the given
order, and ``_`` means "anything that was not matched before". Finally, the function ``init`` returns the
initial segment of a word (e.g. //happ// for //happy//), and the pattern ``#vowel`` is defined as
``"a" | "e" | "i" | "o" | "u".

In addition to regular and predictable nouns, English has **irregular nouns**, such as //man - men//,
//formula - formulae//, //ox - oxen//. These nouns have their plural genitive formed by //'s//: //men's//.


===Adjectives===

English adjectives inflect for degree, with three values, and also have an adverbial form in their linearization type.
Here are some regular variations:
- for adjectives ending with consonant + vowel + consonant: //dim, dimmer, dimmest, dimly//
- for adjectives ending with //y// not preceded by a vowel: //happy, happier, happier, happily//
- for other adjectives: //quick, quicker, quickest, quickly//


The comparison forms only work for adjectives with at most two syllables. For longer ones,
they are formed syntactically: //expensive, more expensive, most expensive//. There are also
some irregular adjectives, the most extreme one being perhaps //good, better, best, well//.


===Verbs===

English verbs have five different forms, except for the verb //be//, which has some more forms, e.g.
//sing, sings, sang, sung, singing//.
But //be// is also special syntactically and semantically, and is in the RGL introduced
in the syntax rather than in the lexicon.

Two forms, the past (indicative) and the past participle are the same for the so-called **regular verbs**
(e.g. //play, plays, played, played, playing//). The regular verb paradigm thus looks as follows:

|| feature   | form ||
| infinitive | //play//
| present    | //plays//
| past       | //played//
| past participle    | //played//
| present participle | //plays//

The predictable variables are related to the ones we have seen in nouns and adjectives:
the present tense of verbs varies in the same way as the plural of nouns,
and the past varies in the same way as the comparative of adjectives. The most important variations are
- for verbs ending with //s//, //z//, //x//, //sh//, //ch//: //kiss, kisses, kissed, kissing//
- for verbs ending with consonant + vowel + consonant: //dim, dims, dimmed, dimming//
- for verbs ending with //y// not preceded by a vowel: //cry, cries, cried, crying//
- for verbs ending with //ee//: //free, frees, freed, freeing//
- for verbs ending with //ie//: //die, dies, died, dying//
- for other verbs ending with //e//: //use, uses, used, using//
- for other verbs: //walk, walks, walked, walking//


English also has a couple of hundred **irregular verbs**, whose infinitive, past, and past participle forms have to stored
separately. These free forms determine the other forms in the same way as regular verbs. Thus
- from //cut, cut, cut//, you also get //cuts, cutting//
- from //fly, flew, flown//, you also get //flies, flying//
- from //write, wrote, written//, you also get //writes, writing//


===Structural words===