diff --git a/src/runtime/c/doc/memory_model.md b/src/runtime/c/doc/memory_model.md index 4113d4924..a844c648f 100644 --- a/src/runtime/c/doc/memory_model.md +++ b/src/runtime/c/doc/memory_model.md @@ -35,3 +35,53 @@ Once an .ngf file exists, it can be mapped back to memory by using this function create new empty .ngf files. If the file does not exist, then a new one will be created which contains an empty grammar. The grammar could then be extended by dynamically adding functions and categories. +# The content of an .ngf file + +The .ngf file is a memory image but this is not the end of the story. The problem is that there is no way to control at which address the memory image would be +mapped. On Posix systems, `mmap` takes as hint the mapping address but the kernel may choose to ignore it. There is also the flag MAP_FIXED, which makes the hint +into a constraint, but then the kernel may fail to satisfy the constraint. For example that address may already be used for something else. Furthermore, if the +same file is mapped from several processes (if they all load the same grammar), it would be difficult to find an address which is free in all of them. +Last but not least using `MAP_FIXED` is considered a security risk. + +Since the start address of the mapping can change, using traditional memory pointers withing the mapped area is not possible. The only option is to use offsets +relative to the beginning of the area. On other words, if normally we would have had a pointer `p`, now we have the offset `o` which is converted to a +pointer by using `current_base+o`. + +Writing explicitly the pointer arithmetics and the corresponding typecasts, each time when we dereference a pointer, is too tedious and verbose. Instead, +we use the operator overloading in C++. There is the type `ref` which wraps around a file offset to a data item of type `A`. The operators `->` and `*` +are overloaded for that type and they do the necessary pointer arithmetics and type casts. + +This solves the problem with code readability but creates another problem. How do `->` and `*` know the address of the memory mapped area? Obviously, +`current_base` must be a static variable and there must be a way to initialize that variable. + +A database (a memory mapped file) in the runtime is represented by the type `DB`. Before any of the data in the database is accessed, the database must +be brought into scope. Bringing into scope means that `current_base` is initialized to point to the mapping area for that database. After that any dereferencing +of a reference will be done relative to the corresponding database. This is how scopes are defined: +```C++ +{ + DB_scope scope(db, READER_SCOPE); + ... +} +``` +Here `DB_scope` is a helper type and `db` is a pointer to the database that you want to bring into scope. The constructor for `DB_scope` saves the old value +for `current_base` and then sets it to point to the area of the given database. Conversely the destructor, restores the previous value. + +The use of `DB_scope` is reentrant, i.e. you can do this: +```C++ +{ + DB_scope scope(db1, READER_SCOPE); + ... + { + DB_scope scope(db2, READER_SCOPE); + ... + } + ... +} +``` +What you can't do is to have more than one database in scope simultaneously. Fortunately, that is not needed. All API functions start a scope +and the internals of the runtime always work with the current database in scope. + +Note the flag `READER_SCOPE`. You can use either `READER_SCOPE` or `WRITER_SCOPE`. In addition to selecting the database, the DB_scope also enforces, +the single writer, multiple readers policy. The main problem is that a writer may have to enlarge the current file, which consequently may mean +that the kernel should relocate the mapping area to a new address. If there are readers at the same time, they way break since they expect that the mapped +area is at a particular location.