From 72c51f4bf925dad571288334801d0320daf21d5d Mon Sep 17 00:00:00 2001
From: Krasimir Angelov <kr.angelov@gmail.com>
Date: Sat, 7 Aug 2021 09:44:50 +0200
Subject: [PATCH 1/5] Create memory_model.md

---
 src/runtime/c/doc/memory_model.md | 37 +++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 src/runtime/c/doc/memory_model.md

diff --git a/src/runtime/c/doc/memory_model.md b/src/runtime/c/doc/memory_model.md
new file mode 100644
index 000000000..4113d4924
--- /dev/null
+++ b/src/runtime/c/doc/memory_model.md
@@ -0,0 +1,37 @@
+# The different storage files
+
+The purpose of the .ngf files is to be used as on disk databases that store grammars. Their format is platform dependent and they should not be copied from
+one platform to another. In contrast the .pgf files are platform independent and can be moved around. The runtime can import a .pgf file and create an .ngf file.
+Conversely a .pgf file can be exported from an already existing .ngf file.
+
+The internal relation between the two file is more interesting. The runtime uses its own memory allocator which always allocates memory from a memory mapped file.
+The file may be explicit or an annonymous one. The .ngf is simply a memory image saved in a file. This means that loading the file is always immediate. 
+You just create a new mapping and the kernel will load memory pages on demand.
+
+On the other hand a .pgf file is a version of the grammar serialized in a platform independent format. This means that loading this type of file is always slower.
+Fortunately, you can always create an .ngf file from it to speedup later reloads.
+
+The runtime has three ways to load a grammar:
+
+* loading a .pgf:
+```Haskell
+readPGF :: FilePath -> IO PGF
+```
+This loads the .pgf into an annonymous memory mapped file. In practice, this means that instead of allocating memory from an explicit file, the runtime will still
+use the normal swap file.
+
+* loading a .pgf and booting a new .ngf:
+```Haskell
+bootPGF :: FilePath -> FilePath -> IO PGF
+```
+The grammar is loaded from a .pgf (the first argument) and the memory is mapped to an explicit .ngf (second argument). The .ngf file is created by the function
+and a file with the same name should not exist before the call.
+
+* loading an existing memory image:
+```Haskell
+readNGF :: FilePath -> IO PGF
+```
+Once an .ngf file exists, it can be mapped back to memory by using this function. This call is always guaranteed to be fast. The same function can also
+create new empty .ngf files. If the file does not exist, then a new one will be created which contains an empty grammar. The grammar could then be extended
+by dynamically adding functions and categories.
+

From cc8db24a46b9aaf352b836a6b04d932f5b939544 Mon Sep 17 00:00:00 2001
From: Krasimir Angelov <kr.angelov@gmail.com>
Date: Sat, 7 Aug 2021 10:36:34 +0200
Subject: [PATCH 2/5] Update memory_model.md

---
 src/runtime/c/doc/memory_model.md | 50 +++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/src/runtime/c/doc/memory_model.md b/src/runtime/c/doc/memory_model.md
index 4113d4924..a844c648f 100644
--- a/src/runtime/c/doc/memory_model.md
+++ b/src/runtime/c/doc/memory_model.md
@@ -35,3 +35,53 @@ Once an .ngf file exists, it can be mapped back to memory by using this function
 create new empty .ngf files. If the file does not exist, then a new one will be created which contains an empty grammar. The grammar could then be extended
 by dynamically adding functions and categories.
 
+# The content of an .ngf file
+
+The .ngf file is a memory image but this is not the end of the story. The problem is that there is no way to control at which address the memory image would be
+mapped. On Posix systems, `mmap` takes as hint the mapping address but the kernel may choose to ignore it. There is also the flag MAP_FIXED, which makes the hint
+into a constraint, but then the kernel may fail to satisfy the constraint. For example that address may already be used for something else. Furthermore, if the
+same file is mapped from several processes (if they all load the same grammar), it would be difficult to find an address which is free in all of them. 
+Last but not least using `MAP_FIXED` is considered a security risk.
+
+Since the start address of the mapping can change, using traditional memory pointers withing the mapped area is not possible. The only option is to use offsets
+relative to the beginning of the area. On other words, if normally we would have had a pointer `p`, now we have the offset `o` which is converted to a
+pointer by using `current_base+o`.
+
+Writing explicitly the pointer arithmetics and the corresponding typecasts, each time when we dereference a pointer, is too tedious and verbose. Instead,
+we use the operator overloading in C++. There is the type `ref<A>` which wraps around a file offset to a data item of type `A`. The operators `->` and `*`
+are overloaded for that type and they do the necessary pointer arithmetics and type casts.
+
+This solves the problem with code readability but creates another problem. How do `->` and `*` know the address of the memory mapped area? Obviously,
+`current_base` must be a static variable and there must be a way to initialize that variable.
+
+A database (a memory mapped file) in the runtime is represented by the type `DB`. Before any of the data in the database is accessed, the database must
+be brought into scope. Bringing into scope means that `current_base` is initialized to point to the mapping area for that database. After that any dereferencing
+of a reference will be done relative to the corresponding database. This is how scopes are defined:
+```C++
+{
+    DB_scope scope(db, READER_SCOPE);
+    ...
+}
+```
+Here `DB_scope` is a helper type and `db` is a pointer to the database that you want to bring into scope. The constructor for `DB_scope` saves the old value
+for `current_base` and then sets it to point to the area of the given database. Conversely the destructor, restores the previous value. 
+
+The use of `DB_scope` is reentrant, i.e. you can do this:
+```C++
+{
+    DB_scope scope(db1, READER_SCOPE);
+    ...
+    {
+        DB_scope scope(db2, READER_SCOPE);
+        ...
+    }
+    ...
+}
+```
+What you can't do is to have more than one database in scope simultaneously. Fortunately, that is not needed. All API functions start a scope
+and the internals of the runtime always work with the current database in scope.
+
+Note the flag `READER_SCOPE`. You can use either `READER_SCOPE` or `WRITER_SCOPE`. In addition to selecting the database, the DB_scope also enforces,
+the single writer, multiple readers policy. The main problem is that a writer may have to enlarge the current file, which consequently may mean
+that the kernel should relocate the mapping area to a new address. If there are readers at the same time, they way break since they expect that the mapped
+area is at a particular location.

From 78d6282da2f4daaaadbd54b104a3142054f5f95a Mon Sep 17 00:00:00 2001
From: Krasimir Angelov <kr.angelov@gmail.com>
Date: Sat, 7 Aug 2021 18:29:31 +0200
Subject: [PATCH 3/5] Create README.md

---
 src/runtime/c/doc/README.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
 create mode 100644 src/runtime/c/doc/README.md

diff --git a/src/runtime/c/doc/README.md b/src/runtime/c/doc/README.md
new file mode 100644
index 000000000..8d0ae7bda
--- /dev/null
+++ b/src/runtime/c/doc/README.md
@@ -0,0 +1,14 @@
+# The Hacker's Guide to GF
+
+This is the hacker's guide to GF, for the guide to the galaxy, see the full edition [here](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy).
+Here we will limit outselves to the vastly narrower domain of the [GF](https://www.grammaticalframework.org) runtime. This means that we will not meet 
+any [Vogons](https://en.wikipedia.org/wiki/Vogon), but we will touch upon topics like memory management, databases, transactions, compilers, 
+functional programming, theorem proving and sometimes even languages. Subjects that no doubt would interest any curious hacker.
+
+So, **Don't Panick!** and keep reading. This is a live document and will develop together with the runtime itself.
+
+**TABLE OF CONTENTS**
+
+1. [DESIDERATA](DESIDERATA.md)
+2. [Memory Model](memory_model.md)
+

From bfd839b7b01d0ad68bb69abd0a2520c57b2335c8 Mon Sep 17 00:00:00 2001
From: Krasimir Angelov <kr.angelov@gmail.com>
Date: Sat, 7 Aug 2021 18:29:59 +0200
Subject: [PATCH 4/5] Update README.md

---
 src/runtime/c/doc/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/runtime/c/doc/README.md b/src/runtime/c/doc/README.md
index 8d0ae7bda..298647f5a 100644
--- a/src/runtime/c/doc/README.md
+++ b/src/runtime/c/doc/README.md
@@ -9,6 +9,6 @@ So, **Don't Panick!** and keep reading. This is a live document and will develop
 
 **TABLE OF CONTENTS**
 
-1. [DESIDERATA](DESIDERATA.md)
+1. [Desiderata](DESIDERATA.md)
 2. [Memory Model](memory_model.md)
 

From db8843c8bf5e3500fbc2d146649ec6a0c019d484 Mon Sep 17 00:00:00 2001
From: Krasimir Angelov <kr.angelov@gmail.com>
Date: Sat, 7 Aug 2021 20:39:09 +0200
Subject: [PATCH 5/5] Update memory_model.md

---
 src/runtime/c/doc/memory_model.md | 49 +++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/src/runtime/c/doc/memory_model.md b/src/runtime/c/doc/memory_model.md
index a844c648f..33eb96ae3 100644
--- a/src/runtime/c/doc/memory_model.md
+++ b/src/runtime/c/doc/memory_model.md
@@ -44,12 +44,15 @@ same file is mapped from several processes (if they all load the same grammar),
 Last but not least using `MAP_FIXED` is considered a security risk.
 
 Since the start address of the mapping can change, using traditional memory pointers withing the mapped area is not possible. The only option is to use offsets
-relative to the beginning of the area. On other words, if normally we would have had a pointer `p`, now we have the offset `o` which is converted to a
-pointer by using `current_base+o`.
+relative to the beginning of the area. In other words, if normally we would have written `p->x`, now we have the offset `o` which we must use like this:
+```C++
+((A*) (current_base+o))->x
+```
 
-Writing explicitly the pointer arithmetics and the corresponding typecasts, each time when we dereference a pointer, is too tedious and verbose. Instead,
-we use the operator overloading in C++. There is the type `ref<A>` which wraps around a file offset to a data item of type `A`. The operators `->` and `*`
-are overloaded for that type and they do the necessary pointer arithmetics and type casts.
+Writing the explicit pointer arithmetics and typecasts, each time when we dereference a pointer, is not better than Vogon poetry and it
+becomes worse when using a chain of arrow operators. The solution is to use the operator overloading in C++. 
+There is the type `ref<A>` which wraps around a file offset to a data item of type `A`. The operators `->` and `*`
+are overloaded for the type and they do the necessary pointer arithmetics and type casts.
 
 This solves the problem with code readability but creates another problem. How do `->` and `*` know the address of the memory mapped area? Obviously,
 `current_base` must be a static variable and there must be a way to initialize that variable.
@@ -85,3 +88,39 @@ Note the flag `READER_SCOPE`. You can use either `READER_SCOPE` or `WRITER_SCOPE
 the single writer, multiple readers policy. The main problem is that a writer may have to enlarge the current file, which consequently may mean
 that the kernel should relocate the mapping area to a new address. If there are readers at the same time, they way break since they expect that the mapped
 area is at a particular location.
+
+# Developing Writers
+
+There is one important complication when developing procedures modifying the database. Every call to `DB::malloc` may potentially have to enlarge the mapped area
+which sometimes leads to changing `current_base`. That would not have been a problem if GCC was not sometimes caching variables in registers. Look at the following code:
+```C++
+p->r = foo();
+```
+Here `p` is a reference which is used to access another reference `r`. On the other hand, `foo()` is a procedure which directly or indirectly calls `DB::malloc`.
+GCC compiles assignments by first computing the address to modify, and then it evaluates the right hand side. This means that while `foo()` is beeing evaluated the address computed on the left-hand side is saved in a register or somewhere in the stack. But now, if it happens that the allocation in `foo()` has changed
+`current_base`, then the saved address is no longer valid.
+
+That first problem is solved by overloading the assignment operator for `ref<A>`:
+```C++
+ref<A>& operator= (const ref<A>& r) {
+    offset = r.offset;
+    return *this;
+}
+```
+On a first sight, nothing special happens here and it looks like the overloading is redundant. However, now the assignments are compiled in a very different way.
+The overloaded operator is inlined, so there is no real method call and we don't get any overhead. The real difference is that now, whatever is on the left-hand side of the assignment becomes the value of the `this` pointer, and `this` is always the last thing to be evaluated in a method call. This solves the problem.
+`foo()` is evaluated first and if it changes `current_base`, the change will be taken into account when computing the left-hand side of the assignment.
+    
+Unfortunately, this is not the only problem. A similar thing happens when the arguments of a function are calls to other functions. See this:
+```C++
+foo(p->r,bar(),q->r)
+```
+Where now `bar()` is the function that do allocation. The compiler is free to keep in a register the value of `current_base` that it needs for the evaluation of
+`p->r`, while it evaluates `bar()`. But if `current_base` has changed, then the saved value would be invalid while computing `q->r`. There doesn't seem to be
+a work around for this. The only solution is to:
+    
+**Never call a function that allocates as an argument to another function**
+    
+Instead we call allocating functions on a separate line and we save the result in a temporary variable.
+
+