move the runtime documentation to the main doc folder

2026-06-22 17:56:14 -06:00 · 2021-10-11 08:59:28 +02:00
parent 95c81ec2b7
commit 8d075b1d57
5 changed files with 0 additions and 0 deletions
--- a/doc/hackers-guide/DESIDERATA.md
+++ b/doc/hackers-guide/DESIDERATA.md
@@ -0,0 +1,51 @@
+This is an experiment to develop **a majestic new GF runtime**.
+
+The reason is that there are several features that we want to have and they all require a major rewrite of the existing C runtime.
+Instead of beating the old code until it starts doing what we want, it is time to start from scratch.
+
+# New Features
+
+The features that we want are:
+
+- We want to support **even bigger grammars that don't fit in the main memory** anymore. Instead, they should reside on the disc and parts will be loaded on demand.
+The current design is that all memory allocated for the grammars should be from memory-mapped files. In this way the only limit for the grammar size will
+be the size of the virtual memory, i.e. 2^64 bytes. The swap file is completely circumvented, while all of the available RAM can be used as a cache for loading parts
+of the grammar.
+
+- We want to be able to **update grammars dynamically**. This is a highly desired feature since recompiling large grammars takes hours.
+Instead, dynamic updates should happen instantly.
+
+- We want to be able to **store additional information in the PGF**. For example that could be application specific semantic data.
+Another example is to store the source code of the different grammar rules, to allow the compiler to recompile individual rules.
+
+- We want to **allow a single file to contain slightly different versions of the grammar**. This will be a kind of a version control system,
+which will allow different users to store their own grammar extensions while still using the same core content.
+
+- We want to **avoid the exponential explosion in the size of PMCFG** for some grammars. This happens because PMCFG as a formalism is too low-level.
+By enriching it with light-weight variables, we can make it more powerful and hopefully avoid the exponential explosion.
+
+- We want to finally **ditch the old Haskell runtime** which has long outlived its time.
+
+There are also two bugs in the old C runtime whose fixes will require a lot of changes, so instead of fixing the old runtime we do it here:
+
+- **Integer literals in the C runtime** are implemented as 32-bit integers, while the Haskell runtime used unlimited integers.
+Python supports unlimited integers too, so it would be nice to support them in the new runtime as well.
+
+- The old C runtime assumed that **String literals are terminated with the NULL character**. None of the modern languages (Haskell, Python, Java, etc) make
+that assumption, so we should drop it too.
+
+# Consequences
+
+The desired features will have the following implementation cosequences.
+
+- The switch from memory-based to disc-based runtime requires one big change. Before it was easy to just keep a pointer from one object to another.
+Unfortunately this doesn't work with memory-mapped files, since every time when you map a file into memory it may end up at a different virtual address.
+Instead we must use file offsets. In order to make programming simpler, the new runtime will be **implemented in C++ instead of C**. This allows us to overload
+the arrow operator (`->`) which will dynamically convert file offsets to in-memory pointers.
+
+- The choice of C++ also allows us to ditch the old `libgu` library and **use STL** instead.
+
+- The content of the memory mapped files is platform-specific. For that reason there will be two grammar representations:
+     - **Native Grammar Format** (`.ngf`) - which will be instantly loadable by just mapping it to memory, but will be platform-dependent.
+     - **Portable Grammar Format** (`.pgf`) - which will take longer to load but will be more compact and platform independent.
+  The runtime will be able to load `.pgf` files and convert them to `.ngf`. Conversely `.pgf` can be exported from the current `.ngf`.
--- a/doc/hackers-guide/README.md
+++ b/doc/hackers-guide/README.md
@@ -0,0 +1,15 @@
+# The Hacker's Guide to GF
+
+This is the hacker's guide to GF, for the guide to the galaxy, see the full edition [here](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy).
+Here we will limit outselves to the vastly narrower domain of the [GF](https://www.grammaticalframework.org) runtime. This means that we will not meet
+any [Vogons](https://en.wikipedia.org/wiki/Vogon), but we will touch upon topics like memory management, databases, transactions, compilers,
+functional programming, theorem proving and sometimes even languages. Subjects that no doubt would interest any curious hacker.
+
+So, **Don't Panic!** and keep reading. This is a live document and will develop together with the runtime itself.
+
+**TABLE OF CONTENTS**
+
+1. [Desiderata](DESIDERATA.md)
+2. [Memory Model](memory_model.md)
+3. [Abstract Expressions](abstract_expressions.md)
+3. [Transactions](transactions.md)
--- a/doc/hackers-guide/abstract_expressions.md
+++ b/doc/hackers-guide/abstract_expressions.md
@@ -0,0 +1,192 @@
+# Data Marshalling Strategies
+
+The runtime is designed to be used from a high-level programming language, which means that there are frequent foreign calls between the host language and C. This also implies that all the data must be frequently marshalled between the binary representations of the two languages. This is usually trivial and well supported for primitive types like numbers and strings but for complex data structures we need to design our own strategy.
+
+The most central data structure in GF is of course the abstract syntax expression. The other two secondary but closely related structures are types and literals. These are complex structures and no high-level programming language will let us to manipulate them directly unless if they are in the format that the runtime of the language understands. There are three main strategies to deal with complex data accross a language boundry:
+
+1. Keep the data in the C world and provide only an opaque handle to the host language. This means that all operations over the data must be done in C via foreign calls.
+2. Design a native host-language representation. For each foreign call the data is copied from the host language to the C representation and vice versa. Copying is obviously bad, but not too bad if the data is small. The added benefit is that now both languages have first-class access to the data. As a bonus, the garbage collector of the host language now understands the data and can immediately release it if part of it becomes unreachable.
+3. Keep the data in the host language. The C code has only an indirect access via opaque handles and calls back to the host language. The program in the host language has first-class access and the garbage collector can work with the data. No copying is needed.
+
+The old C runtime used option 1. Obviously, this means that abstract expressions cannot be manipulated directly, but this is not the only problem. When the application constructs abstract expressions from different pieces, a whole a lot of overhead is added. First, the design was such that data in C must always be allocated from a memory pool. This means that even if we want to make a simple function application, we first must allocate a pool which adds memory overhead. In addition, the host language must allocate an object which wraps arround the C structure. The net effect is that while the plain abstract function application requires the allocation of only two pointers, the actually allocated data may be several times bigger if the application builds the expression piece by piece. The situation is better if the expression is entirely created from the runtime and the application just needs to keep a reference to it.
+
+Another problem is that when the runtime has to create a whole bunch of expressions, for instance as a result from parsing or random and exhaustive generation, then all the expressions are allocated in the same memory pool. The application gets separate handles to each of the produced expressions, but the memory pool is released only after all of the handles become unreachable. Obviously the problem here is that different expressions share the same pool. Unfortunately this is hard to avoid since although the expressions are different, they usually share common subexpression. Identifying the shared parts would be expensive and at the end it might mean that each expression node must be allocated in its own pool.
+
+The path taken in the new runtime is a combination of strategies 2 and 3. The abstract expressions are stored in the heap of the host language and use a native for that language representation.
+
+# Abstract Expressions in Different Languages
+
+In Haskell, abstract expressions are represented with an algebraic data type:
+```Haskell
+data Expr =
+   EAbs BindType Var Expr
+ | EApp Expr Expr
+ | ELit Literal
+ | EMeta  MetaId
+ | EFun   Fun
+ | EVar   Int
+ | ETyped Expr Type
+ | EImplArg Expr
+```
+while in Python and all other object-oriented languages an expression is represented with objects of different classes:
+```Python
+class Expr: pass
+class ExprAbs(Expr): pass
+class ExprApp(Expr): pass
+class ExprLit(Expr): pass
+class ExprMeta(Expr): pass
+class ExprFun(Expr): pass
+class ExprVar(Expr): pass
+class ExprTyped(Expr): pass
+class ExprImplArg(Expr): pass
+```
+
+The runtime needs its own representation as well but only when an expression is stored in a .ngf file. This happens for instance with all types in the abstract syntax of the grammar. Since the type system allows dependent types, some type signature might contain expressions too. Another appearance for abstract expressions is in function definitions, i.e. in the def rules.
+
+Expressions in the runtime are represented with C structures which on the other hand may contain tagged references to other structures. The lowest four bits of each reference encode the type of structure that it points to, while the rest contain the file offsets in the memory mapped file. For example, function application is represented as:
+```C++
+struct PgfExprApp {
+  static const uint8_t tag = 1;
+
+  PgfExpr fun;
+  PgfExpr arg;
+};
+```
+Here the constant `tag` says that any reference to a PgfExprApp structure must contain the value 1 in its lowest four bits. The fields `fun` and `arg` refer to the function and the argument for that application. The type PgfExpr is defined as:
+```C++
+typedef uintptr_t object;
+typedef object PgfExpr;
+```
+In order to dereference an expression, we first neeed to pattern match and then obtain a `ref<>` object:
+```C++
+switch (ref<PgfExpr>::get_tag(e)) {
+...
+case PgfExprApp::tag: {
+    auto eapp = ref<PgfExprApp>::untagged(e);
+    // do something with eapp->fun and eapp->arg
+    ...
+    break;
+}
+...
+}
+```
+
+The representation in the runtime is internal and should never be exposed to the host language. Moreover, these structures live in the memory mapped file and as we discussed in Section "[Memory Model](memory_model.md)" accessing them requires special care. This also means that occasionally the runtime must make a copy from the native representation to the host representation and vice versa. For example, function:
+```Haskell
+functionType :: PGF -> Fun -> Maybe Type
+```
+must look up the type of an abstract syntax function in the .ngf file and return its type. The type, however, is in the native representation and it must first be copied in the host representation. The converse also happens. When the compiler wants to add a new abstract function to the grammar, it creates its type in the Haskell heap, which the runtime later copies to the native representation in the .ngf file. This is not much different from any other database. The database file usually uses a different data representation than what the host language has.
+
+In most other runtime operations, copying is not necessary. The only thing that the runtime needs to know is how to create new expressions in the heap of the host and how to pattern match on them. For that it calls back to code implemented differently for each host language. For example in:
+```Haskell
+readExpr :: String -> Maybe Expr
+```
+the runtime knows how to read an abstract syntax expression, while for the construction of the actual value it calls back to Haskell. Similarly:
+```Haskell
+showExpr :: [Var] -> Expr -> String
+```
+uses code implemented in Haskell to pattern match on the different algebraic constructors, while the text generation itself happens inside the runtime.
+
+# Marshaller and Unmarshaller
+
+The marshaller and the unmarshaller are the two key data structures which bridge together the different representation realms for abstract expressions and types. The structures have two equivalent definitions, one in C++:
+```C++
+struct PgfMarshaller {
+    virtual object match_lit(PgfUnmarshaller *u, PgfLiteral lit)=0;
+    virtual object match_expr(PgfUnmarshaller *u, PgfExpr expr)=0;
+    virtual object match_type(PgfUnmarshaller *u, PgfType ty)=0;
+};
+
+struct PgfUnmarshaller {
+    virtual PgfExpr eabs(PgfBindType btype, PgfText *name, PgfExpr body)=0;
+    virtual PgfExpr eapp(PgfExpr fun, PgfExpr arg)=0;
+    virtual PgfExpr elit(PgfLiteral lit)=0;
+    virtual PgfExpr emeta(PgfMetaId meta)=0;
+    virtual PgfExpr efun(PgfText *name)=0;
+    virtual PgfExpr evar(int index)=0;
+    virtual PgfExpr etyped(PgfExpr expr, PgfType typ)=0;
+    virtual PgfExpr eimplarg(PgfExpr expr)=0;
+    virtual PgfLiteral lint(size_t size, uintmax_t *v)=0;
+    virtual PgfLiteral lflt(double v)=0;
+    virtual PgfLiteral lstr(PgfText *v)=0;
+    virtual PgfType dtyp(int n_hypos, PgfTypeHypo *hypos,
+                         PgfText *cat,
+                         int n_exprs, PgfExpr *exprs)=0;
+    virtual void free_ref(object x)=0;
+};
+```
+and one in C:
+```C
+typedef struct PgfMarshaller PgfMarshaller;
+typedef struct PgfMarshallerVtbl PgfMarshallerVtbl;
+struct PgfMarshallerVtbl {
+    object (*match_lit)(PgfUnmarshaller *u, PgfLiteral lit);
+    object (*match_expr)(PgfUnmarshaller *u, PgfExpr expr);
+    object (*match_type)(PgfUnmarshaller *u, PgfType ty);
+};
+struct PgfMarshaller {
+    PgfMarshallerVtbl *vtbl;
+};
+
+typedef struct PgfUnmarshaller PgfUnmarshaller;
+typedef struct PgfUnmarshallerVtbl PgfUnmarshallerVtbl;
+struct PgfUnmarshallerVtbl {
+    PgfExpr (*eabs)(PgfUnmarshaller *this, PgfBindType btype, PgfText *name, PgfExpr body);
+    PgfExpr (*eapp)(PgfUnmarshaller *this, PgfExpr fun, PgfExpr arg);
+    PgfExpr (*elit)(PgfUnmarshaller *this, PgfLiteral lit);
+    PgfExpr (*emeta)(PgfUnmarshaller *this, PgfMetaId meta);
+    PgfExpr (*efun)(PgfUnmarshaller *this, PgfText *name);
+    PgfExpr (*evar)(PgfUnmarshaller *this, int index);
+    PgfExpr (*etyped)(PgfUnmarshaller *this, PgfExpr expr, PgfType typ);
+    PgfExpr (*eimplarg)(PgfUnmarshaller *this, PgfExpr expr);
+    PgfLiteral (*lint)(PgfUnmarshaller *this, size_t size, uintmax_t *v);
+    PgfLiteral (*lflt)(PgfUnmarshaller *this, double v);
+    PgfLiteral (*lstr)(PgfUnmarshaller *this, PgfText *v);
+    PgfType (*dtyp)(PgfUnmarshaller *this,
+                    int n_hypos, PgfTypeHypo *hypos,
+                    PgfText *cat,
+                    int n_exprs, PgfExpr *exprs);
+    void (*free_ref)(PgfUnmarshaller *this, object x);
+};
+struct PgfUnmarshaller {
+    PgfUnmarshallerVtbl *vtbl;
+};
+```
+Which one you will get, depends on whether you import `pgf/pgf.h` from C or C++.
+
+As we can see, most of the arguments for the different methods are of type `PgfExpr`, `PgfType` or `PgfLiteral`. These are all just type synonyms for the type `object`, which on the other hand is nothing else but a number with enough bits to hold an address if necessary. The interpretation of the number depends on the realm in which the object lives. The following table shows the interpretations for four languages as well as the one used internally in the .ngf files:
+|          | PgfExpr        | PgfLiteral        | PgfType        |
+|----------|----------------|-------------------|----------------|
+| Haskell  | StablePtr Expr | StablePtr Literal | StablePtr Type |
+| Python   | ExprObject *   | PyObject *        | TypeObject *   |
+| Java     | jobject        | jobject           | jobject        |
+| .NET     | GCHandle       | GCHandle          | GCHandle       |
+| internal | file offset    | file offset       | file offset    |
+
+The marshaller is the structure that lets the runtime to pattern match on an expression. When one of the match methods is executed, it checks the kind of expr, literal or type and calls the corresponding method from the unmarshaller which it gets as an argument. The method on the other hand gets as arguments the corresponding sub-expressions and attributes.
+
+Generally the role of an unmarshaller is to construct things. For example, the variable `unmarshaller` in `PGF2.FFI` is an object which can construct new expressions in the Haskell heap from the already created children. Function `readExpr`, for instance, passes that one to the runtime to instruct it that the result must be in the Haskell realm.
+
+Constructing objects is not the only use of an unmarshaller. The implementation of `showExpr` passes to `pgf_print_expr` an abstract expression in Haskell and the `marshaller` defined in PGF2.FFI. That marshaller knows how to pattern match on Haskell expressions and calls the right methods from whatever unmarhaller is given to it. What it will get in that particular case is a special unmarshaller which does not produce new representations of abstract expressions, but generates a string.
+
+
+# Literals
+
+Finally, we should have a few remarks about how values of the literal types `String`, `Int` and `Float` are represented in the runtime.
+
+`String` is represented as the structure:
+```C
+typedef struct {
+    size_t size;
+    char text[];
+} PgfText;
+```
+Here the first field is the size of the string in number of bytes. The second field is the string itself, encoded in UTF-8. Just like in most modern languages, the string may contain the zero character and that is not an indication for end of string. This means that functions like `strlen` and `strcat` should never be used when working with PgfText. Despite that the text is not zero terminated, the runtime always allocates one more last byte for the text content and sets it to zero. That last byte is not included when calculating the field `size`. The purpose is that with that last zero byte the GDB debugger knows how to show the string properly. Most of the time, this doesn't incur any memory overhead either since `malloc` always allocates memory in size divisible by the size of two machine words. The consequence is that usually there are some byte left unused at the end of every string anyway.
+
+`Int` is like the integers in Haskell and Python and can have arbitrarily many digits. In the runtime, the value is represented as an array of `uintmax_t` values. Each of these values contains as many decimal digits as it is possible to fit in `uintmax_t`. For example on a 64-bit machine, 
+the maximal value that fits is 18446744073709551616. However, the left-most digit here is at most 1, this means that if we want to represend an arbitrary sequence of digits, the maximal length of the sequence must be at most 19. Similarly on a 32-bit machine each value in the array will store 9 decimal digits. Finally the sign of the number is stored as the sign of the first number in the array which is always threated as `intmax_t`.
+
+Just to have an example, the number `-774763251095801167872` is represented as the array `{-77, 4763251095801167872}`. Note that this representation is not at all suitable for implementing arithmetics with integers, but is very simple to use for us since the runtime only needs to to parse and linearize numbers.
+
+`Float` is trivial and is just represented as the type `double` in C/C++. This can also be seen in the type of the method `lflt` in the unmarshaller.
+
--- a/doc/hackers-guide/memory_model.md
+++ b/doc/hackers-guide/memory_model.md
@@ -0,0 +1,136 @@
+# The different storage files
+
+The purpose of the `.ngf` files is to be used as on-disk databases that store grammars. Their format is platform-dependent and they should not be copied from
+one platform to another. In contrast the `.pgf` files are platform-independent and can be moved around. The runtime can import a `.pgf` file and create an `.ngf` file.
+Conversely a `.pgf` file can be exported from an already existing `.ngf` file.
+
+The internal relation between the two files is more interesting. The runtime uses its own memory allocator which always allocates memory from a memory mapped file.
+The file may be explicit or an anonymous one. The `.ngf` is simply a memory image saved in a file. This means that loading the file is always immediate.
+You just create a new mapping and the kernel will load memory pages on demand.
+
+On the other hand a `.pgf` file is a version of the grammar serialized in a platform-independent format. This means that loading this type of file is always slower.
+Fortunately, you can always create an `.ngf` file from it to speed up later reloads.
+
+The runtime has three ways to load a grammar:
+
+#### 1. Loading a `.pgf`
+```Haskell
+readPGF :: FilePath -> IO PGF
+```
+This loads the `.pgf` into an anonymous memory-mapped file. In practice, this means that instead of allocating memory from an explicit file, the runtime will still
+use the normal swap file.
+
+#### 2. Loading a `.pgf` and booting a new `.ngf`
+```Haskell
+bootPGF :: FilePath -> FilePath -> IO PGF
+```
+The grammar is loaded from a `.pgf` (the first argument) and the memory is mapped to an explicit `.ngf` (second argument). The `.ngf` file is created by the function
+and a file with the same name should not exist before the call.
+
+#### 3. Loading an existing memory image
+```Haskell
+readNGF :: FilePath -> IO PGF
+```
+Once an `.ngf` file exists, it can be mapped back to memory by using this function. This call is always guaranteed to be fast. The same function can also
+create new empty `.ngf` files. If the file does not exist, then a new one will be created which contains an empty grammar. The grammar could then be extended
+by dynamically adding functions and categories.
+
+# The content of an `.ngf` file
+
+The `.ngf` file is a memory image but this is not the end of the story. The problem is that there is no way to control at which address the memory image would be
+mapped. On Posix systems, `mmap` takes as hint the mapping address but the kernel may choose to ignore it. There is also the flag `MAP_FIXED`, which makes the hint
+into a constraint, but then the kernel may fail to satisfy the constraint. For example that address may already be used for something else. Furthermore, if the
+same file is mapped from several processes (if they all load the same grammar), it would be difficult to find an address which is free in all of them.
+Last but not least using `MAP_FIXED` is considered a security risk.
+
+Since the start address of the mapping can change, using traditional memory pointers withing the mapped area is not possible. The only option is to use offsets
+relative to the beginning of the area. In other words, if normally we would have written `p->x`, now we have the offset `o` which we must use like this:
+```C++
+((A*) (current_base+o))->x
+```
+
+Writing the explicit pointer arithmetics and typecasts, each time when we dereference a pointer, is not better than Vogon poetry and it
+becomes worse when using a chain of arrow operators. The solution is to use the operator overloading in C++.
+There is the type `ref<A>` which wraps around a file offset to a data item of type `A`. The operators `->` and `*`
+are overloaded for the type and they do the necessary pointer arithmetics and type casts.
+
+This solves the problem with code readability but creates another problem. How do `->` and `*` know the address of the memory mapped area? Obviously,
+`current_base` must be a global variable and there must be a way to initialize it. More specifically it must be thread-local to allow different threads to
+work without collisions.
+
+A database (a memory-mapped file) in the runtime is represented by the type `DB`. Before any of the data in the database is accessed, the database must
+be brought into scope. Bringing into scope means that `current_base` is initialized to point to the mapping area for that database. After that any dereferencing
+of a reference will be done relative to the corresponding database. This is how scopes are defined:
+```C++
+{
+    DB_scope scope(db, READER_SCOPE);
+    ...
+}
+```
+Here `DB_scope` is a helper type and `db` is a pointer to the database that you want to bring into scope. The constructor for `DB_scope` saves the old value
+for `current_base` and then sets it to point to the area of the given database. Conversely, the destructor restores the previous value.
+
+The use of `DB_scope` is reentrant, i.e. you can do this:
+```C++
+{
+    DB_scope scope(db1, READER_SCOPE);
+    ...
+    {
+        DB_scope scope(db2, READER_SCOPE);
+        ...
+    }
+    ...
+}
+```
+What you can't do is to have more than one database in scope simultaneously. Fortunately, that is not needed. All API functions start a scope
+and the internals of the runtime always work with the current database in scope.
+
+Note the flag `READER_SCOPE`. You can use either `READER_SCOPE` or `WRITER_SCOPE`. In addition to selecting the database, the `DB_scope` also enforces
+the single writer/multiple readers policy. The main problem is that a writer may have to enlarge the current file, which consequently may mean
+that the kernel should relocate the mapping area to a new address. If there are readers at the same time, they may break since they expect that the mapped
+area is at a particular location.
+
+# Developing writers
+
+There is one important complication when developing procedures modifying the database. Every call to `DB::malloc` may potentially have to enlarge the mapped area
+which sometimes leads to changing `current_base`. That would not have been a problem if GCC was not sometimes caching variables in registers. Look at the following code:
+```C++
+p->r = foo();
+```
+Here `p` is a reference which is used to access another reference `r`. On the other hand, `foo()` is a procedure which directly or indirectly calls `DB::malloc`.
+GCC compiles assignments by first computing the address to modify, and then it evaluates the right hand side. This means that while `foo()` is being evaluated the address computed on the left-hand side is saved in a register or somewhere in the stack. But now, if it happens that the allocation in `foo()` has changed
+`current_base`, then the saved address is no longer valid.
+
+That first problem is solved by overloading the assignment operator for `ref<A>`:
+```C++
+ref<A>& operator= (const ref<A>& r) {
+    offset = r.offset;
+    return *this;
+}
+```
+On first sight, nothing special happens here and it looks like the overloading is redundant. However, now the assignments are compiled in a very different way.
+The overloaded operator is inlined, so there is no real method call and we don't get any overhead. The real difference is that now, whatever is on the left-hand side of the assignment becomes the value of the `this` pointer, and `this` is always the last thing to be evaluated in a method call. This solves the problem.
+`foo()` is evaluated first and if it changes `current_base`, the change will be taken into account when computing the left-hand side of the assignment.
+
+Unfortunately, this is not the only problem. A similar thing happens when the arguments of a function are calls to other functions. See this:
+```C++
+foo(p->r,bar(),q->r)
+```
+Where now `bar()` is the function that performs allocation. The compiler is free to keep in a register the value of `current_base` that it needs for the evaluation of
+`p->r`, while it evaluates `bar()`. But if `current_base` has changed, then the saved value would be invalid while computing `q->r`. There doesn't seem to be
+a work around for this. The only solution is to:
+
+**Never call a function that allocates as an argument to another function**
+
+Instead we call allocating functions on a separate line and we save the result in a temporary variable.
+
+
+# Thread-local variables
+
+A final remark is the compilation of thread-local variables. When a thread-local variable is compiled in a position-dependent code, i.e. in executables, it is
+compiled efficiently by using the `fs` register which points to the thread-local segment. Unfortunately, that is not the case by default for shared
+libraries like our runtime. In that case, GCC applies the global-dynamic model which means that access to a thread local variable is internally implemented
+with a call to the function `__tls_get_addr`. Since `current_base` is used all the time, this adds overhead.
+
+The solution is to define the variable with the attribute `__attribute__((tls_model("initial-exec")))` which says that it should be treated as if it is defined
+in an executable. This removes the overhead, but adds the limitation that the runtime should not be loaded with `dlopen`.
--- a/doc/hackers-guide/transactions.md
+++ b/doc/hackers-guide/transactions.md
@@ -0,0 +1,131 @@
+# Transactions
+
+The `.ngf` files that the runtime creates are actual databases which are used to get quick access to the grammars. Like in any database, we also make it possible to dynamically change the data. In our case this means that we can add and remove functions and categories at any time. Moreover, any changes happen in transactions which ensure that changes are not visible until the transaction is commited. The rest of the document describes how the transactions are implemented.
+
+# Databases and Functional Languages
+
+The database model of the runtime is specifically designed to be friendly towards pure functional languages like Haskell. In a usual database, updates happen constantly and therefore executing one and the same query at different times would yield different results. In our grammar databases, queries correspond to operations like parsing, linearization and generation. This means that if we had used the usual database model, all these operations would have to be bound to the IO monad. Consider this example:
+```Haskell
+main = do
+  gr <- readNGF "Example.ngf"
+  functionType gr "f" >>= print
+  -- modify the grammar gr
+  functionType gr "f" >>= print
+```
+Here we ask for the type of a function before and after an arbitrary update in the grammar `gr`. Obviously if we allow that then `functionType` would have to be in the IO monad, e.g.:
+
+```Haskell
+functionType :: PGF -> Fun -> IO Type
+```
+
+Although this is a possible way to go, it would mean that the programmer would have to do all grammar related work in the IO. This is not nice and against the spirit of functional programming. Moreover, all previous implementations of the runtime have assumed that most operations are pure. If we go along that path then this will cause a major breaking change.
+
+Fortunately there is an alternative. Read-only operations remain pure functions, but any update should create a new revision of the database rather than modifying the existing one. Compare this example with the previous:
+```Haskell
+main = do
+  gr <- readNGF "Example.ngf"
+  print (functionType gr "f")
+  gr2 <- modifyPGF gr $ do
+           -- do all updates here
+  print (functionType gr2 "f")
+```
+Here `modifyPGF` allows us to do updates but the updates are performed on a freshly created clone of the grammar `gr`. The original grammar is never ever modified. After the changes the variable `gr2` is a reference to the new revision. While the transaction is in progress we cannot see the currently changing revision, and therefore all read-only operations can remain pure. Only after the transaction is complete do we get to use `gr2`, which will not change anymore.
+
+Note also that above `functionType` is used with its usual pure type:
+```Haskell
+functionType :: PGF -> Fun -> Type
+```
+This is safe since the API never exposes database revisions which are not complete. Furthermore, the programmer is free to keep several revisions of the same database simultaneously. In this example:
+```Haskell
+main = do
+  gr <- readNGF "Example.ngf"
+  gr2 <- modifyPGF gr $ do
+           -- do all updates here
+  print (functionType gr "f", functionType gr2 "f")
+```
+The last line prints the type of function `"f"` in both the old and the new revision. Both are still available.
+
+The API as described so far would have been complete if all updates were happening in a single thread. In reality we can expect that there might be several threads or processes modifying the database. The database ensures a multiple readers/single writer exclusion but this doesn't mean that another process/thread cannot modify the database while the current one is reading an old revision. In a parallel setting, `modifyPGF` first merges the revision which the process is using with the latest revision in the database. On top of that the specified updates are performed. The final revision after the updates is returned as a result.
+
+**TODO: Interprocess synhronization is still not implemented**
+
+**TODO: Merges are still not implemented.**
+
+The process can also ask for the latest revision by calling `checkoutPGF`, see bellow.
+
+# Databases and Imperative Languages
+
+In imperative languages, the state of the program constantly changes and the considerations in the last section do not apply. All read-only operations always work with the latest revision. Bellow is the previous example translated to Python:
+```Python
+gr = readNGF("Example.ngf")
+print(functionType(gr,"f"))
+with gr.transaction() as t:
+  # do all updates here by using t
+print(functionType(gr,"f"))
+```
+Here the first call to `functionType` returns the old type of "f", while the second call retrives the type after the updates. The transaction itself is initiated by the `with` statement. Inside the with statement `gr` will still refer to the old revision since the new one is not complete yet. If the `with` statement is finished without exceptions then `gr` is updated to point to the new one. If an exception occurs then the new revision is discarded, which corresponds to a transaction rollback. Inside the `with` block, the object `t` of type `Transaction` provides methods for modifying the data.
+
+# Branches
+
+Since the database already supports revisions, it is a simple step to support branches as well. A branch is just a revision with a name. When you open a database with `readNGF`, the runtime looks up and returns the revision (branch) with name `master`. There might be other branches as well. You can retrieve a specific branch by calling:
+```Haskell
+checkoutPGF :: PGF -> String -> IO (Maybe PGF)
+```
+Here the string is the branch name. New branches can be created by using:
+```Haskell
+branchPGF :: PGF -> String -> Transaction a -> IO PGF
+```
+Here we start with an existing revision, apply a transaction and store the result in a new branch with the given name.
+
+# Implementation
+
+The low-level API for transactions consists of only four functions:
+```C
+PgfRevision pgf_clone_revision(PgfDB *db, PgfRevision revision,
+                               PgfText *name,
+                               PgfExn *err);
+
+void pgf_free_revision(PgfDB *pgf, PgfRevision revision);
+
+void pgf_commit_revision(PgfDB *db, PgfRevision revision,
+                         PgfExn *err);
+
+PgfRevision pgf_checkout_revision(PgfDB *db, PgfText *name,
+                                  PgfExn *err);
+```
+Here `pgf_clone_revision` makes a copy of an existing revision and — if `name` is not `NULL` — changes its name. The new revision is transient and exists only until it is released with `pgf_free_revision`. Transient revisions can be updated with the API for adding functions and categories. To make a revision persistent, call `pgf_commit_revision`. After the revision is made persistent it will stay in the database even after you call `pgf_free_revision`. Moreover, it will replace the last persistent revision with the same name. The old revision will then become transient and will exist only until all clients call `pgf_free_revision` for it.
+
+Persistent revisions can never be updated. Instead you clone it to create a new transient revision. That one is updated and finally it replaces the existing persistent revision.
+
+This design for transactions may sound unusual but it is just another way to present the copy-on-write strategy. There instead of transaction logs, each change to the data is written in a new place and the result is made available only after all changes are in place. This is for instance what the [LMDB](http://www.lmdb.tech/doc/) (Lightning Memory-Mapped Database) does and it has also served as an inspiration for us.
+
+## Functional Data Structures
+
+From an imperative point of view, it may sound wasteful that a new copy of the grammar is created for each transaction. Functional programmers on the other hand know that with a functional data structure, you can make a copy which shares as much of the data with the original as possible. Each new version copies only those bits that are different from the old one. For example the main data structure that we use to represent the abstract syntax of a grammar is a size-balanced binary tree as described by:
+
+- Stephen Adams, "Efficient sets: a balancing act", Journal of Functional Programming 3(4):553-562, October 1993, http://www.swiss.ai.mit.edu/~adams/BB/.
+
+- J. Nievergelt and E.M. Reingold, "Binary search trees of bounded balance", SIAM journal of computing 2(1), March 1973.
+
+
+## Garbage Collection
+
+We use reference counting to keep track of which objects should be kept alive. For instance, `pgf_free_revision` knows that a transient revision should be removed only when its reference count reaches zero. This means that there is no process or thread using it. The function also checks whether the revision is persistent. Persistent revisions are never removed since they can always be retrieved with `checkoutPGF`.
+
+Clients are supposed to correctly use `pgf_free_revision` to indicate that they don't need a revision any more. Unfortunately, this is not always possible to guarantee. For example many languages with garbage collection will call `pgf_free_revision` from a finalizer method. In some languages, however, the finalizer is not guaranteed to be executed if the process terminates before the garbage collection is done. Haskell is one of those languages. Even in languages with reference counting like Python, the process may get killed by the operating system and then the finalizer may still not be executed.
+
+The solution is that we count on the database clients to correctly report when a revision is not needed. However, on a fresh database restart we explictly clean all left over transient revisions. This means that even if a client is killed or if it does not correctly release its revisions, the worst that can happen is a memory leak until the next restart.
+
+
+## Atomicity
+
+The transactions serve two goals. First they make it possible to isolate readers from seeing unfinished changes from writers. Second, they ensure atomicity. A database change should be either completely done or not done at all. The use of transient revisions ensures the isolation but the atomicity is only partly taken care of.
+
+Think about what happens when a writer starts updating a transient revision. All the data is allocated in a memory mapped file. From the point of view of the runtime, all changes happen in memory. When all is done, the runtime calls `msync` which tells the kernel to flush all dirty pages to disk. The problem is that the kernel is also free to flush pages at any time. For instance, if there is not enough memory, it may decide to swap out pages earlier and reuse the released physical space to swap in other virtual pages. This would be fine if the transaction eventually succeeds. However, if this doesn't happen then the image in the file is already changed.
+
+We can avoid the situation by calling [mlock](https://man7.org/linux/man-pages/man2/mlock.2.html) and telling the kernel that certain pages should not be swapped out. The question is which pages to lock. We can lock them all, but this is too much. That would mean that as soon as a page is touched it will never leave the physical memory. Instead, it would have been nice to tell the kernel -- feel free to swap out clean pages but, as soon as they get dirty, keep them in memory until further notice. Unfortunately there is no way to do that directly.
+
+The work around is to first use [mprotect](https://man7.org/linux/man-pages/man2/mprotect.2.html) and keep all pages as read-only. Any attempt to change a page will cause segmentation fault which we can capture. If the change happens during a transaction then we can immediate lock the page and add it to the list of modified pages. When a transaction is successful we sync all modified pages. If an attempt to change a page happens outside of a transaction, then this is either a bug in the runtime or the client is trying to change an address which it should not change. In any case this prevents unintended changes in the data.
+
+
+**TODO: atomicity is not implemented yet**