Update memory_model.md

2026-05-24 18:28:55 -06:00 · 2021-08-07 10:36:34 +02:00
parent 72c51f4bf9
commit cc8db24a46
1 changed files with 50 additions and 0 deletions
--- a/src/runtime/c/doc/memory_model.md
+++ b/src/runtime/c/doc/memory_model.md
@@ -35,3 +35,53 @@ Once an .ngf file exists, it can be mapped back to memory by using this function
 create new empty .ngf files. If the file does not exist, then a new one will be created which contains an empty grammar. The grammar could then be extended
 by dynamically adding functions and categories.

+# The content of an .ngf file
+
+The .ngf file is a memory image but this is not the end of the story. The problem is that there is no way to control at which address the memory image would be
+mapped. On Posix systems, `mmap` takes as hint the mapping address but the kernel may choose to ignore it. There is also the flag MAP_FIXED, which makes the hint
+into a constraint, but then the kernel may fail to satisfy the constraint. For example that address may already be used for something else. Furthermore, if the
+same file is mapped from several processes (if they all load the same grammar), it would be difficult to find an address which is free in all of them. 
+Last but not least using `MAP_FIXED` is considered a security risk.
+
+Since the start address of the mapping can change, using traditional memory pointers withing the mapped area is not possible. The only option is to use offsets
+relative to the beginning of the area. On other words, if normally we would have had a pointer `p`, now we have the offset `o` which is converted to a
+pointer by using `current_base+o`.
+
+Writing explicitly the pointer arithmetics and the corresponding typecasts, each time when we dereference a pointer, is too tedious and verbose. Instead,
+we use the operator overloading in C++. There is the type `ref<A>` which wraps around a file offset to a data item of type `A`. The operators `->` and `*`
+are overloaded for that type and they do the necessary pointer arithmetics and type casts.
+
+This solves the problem with code readability but creates another problem. How do `->` and `*` know the address of the memory mapped area? Obviously,
+`current_base` must be a static variable and there must be a way to initialize that variable.
+
+A database (a memory mapped file) in the runtime is represented by the type `DB`. Before any of the data in the database is accessed, the database must
+be brought into scope. Bringing into scope means that `current_base` is initialized to point to the mapping area for that database. After that any dereferencing
+of a reference will be done relative to the corresponding database. This is how scopes are defined:
+```C++
+{
+    DB_scope scope(db, READER_SCOPE);
+    ...
+}
+```
+Here `DB_scope` is a helper type and `db` is a pointer to the database that you want to bring into scope. The constructor for `DB_scope` saves the old value
+for `current_base` and then sets it to point to the area of the given database. Conversely the destructor, restores the previous value. 
+
+The use of `DB_scope` is reentrant, i.e. you can do this:
+```C++
+{
+    DB_scope scope(db1, READER_SCOPE);
+    ...
+    {
+        DB_scope scope(db2, READER_SCOPE);
+        ...
+    }
+    ...
+}
+```
+What you can't do is to have more than one database in scope simultaneously. Fortunately, that is not needed. All API functions start a scope
+and the internals of the runtime always work with the current database in scope.
+
+Note the flag `READER_SCOPE`. You can use either `READER_SCOPE` or `WRITER_SCOPE`. In addition to selecting the database, the DB_scope also enforces,
+the single writer, multiple readers policy. The main problem is that a writer may have to enlarge the current file, which consequently may mean
+that the kernel should relocate the mapping area to a new address. If there are readers at the same time, they way break since they expect that the mapped
+area is at a particular location.