docs :3

2023-11-21 21:48:26 -07:00
parent d65ac970b1
commit 8d7020d5f4
3 changed files with 108 additions and 41 deletions
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -18,3 +18,4 @@ help:
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/src/commentary/layout-lexing.rst
+++ b/docs/src/commentary/layout-lexing.rst
@@ -92,9 +92,10 @@ styles often seen in Haskell are somewhat esoteric compared to Python's
       }
   -- another formatting oddity
-   -- note that this is not line contiation!
+   -- note that this is not a single
-   -- `look at`, `this`, and `alignment`
+   -- continued line! `look at`,
-   -- are all separate expressions!
+   -- `this`, and `alignment` are all
   -- separate expressions!
   anotherThing = do look at
                     this
                     alignment
@@ -109,56 +110,120 @@ Thankfully for us, our entry point is quite clear; layouts only appear after a
 select few keywords, (with a minor exception; TODO: elaborate) being :code:`let`
 (followed by supercombinators), :code:`where` (followed by supercombinators),
 :code:`do` (followed by expressions), and :code:`of` (followed by alternatives)
-(TODO: all of these terms need linked glossary entries). Under this assumption,
+(TODO: all of these terms need linked glossary entries). In order to manage the
-we give the following rule:
+cascade of layout contexts, our lexer will record a stack for which each element
-
+is either :math:`\varnothing`, denoting an explicit layout written with braces
-1. If a :code:`let`, :code:`where`, :code:`do`, or :code:`of` keyword is not
+and semicolons, or a :math:`\langle n \rangle`, denoting an implicitly laid-out
-   followed by the lexeme :code:`{`, the token :math:`\{n\}` is inserted after
+layout where the start of each item belonging to the layout is indented
-   the keyword, where :math:`n` is the indentation of the next lexeme if there
+:math:`n` columns.
   is one, or 0 if the end of file has been reached.
 Henceforth :math:`\{n\}` will denote the token representing the begining of a
 layout; similar in function to a brace, but it stores the indentation level for
 subsequent lines to compare with. We must introduce an additional input to the
 function handling layouts. Obviously, such a function would require the input
 string, but a helpful book-keeping tool which we will make good use of is a
 stack of "layout contexts", describing the current cascade of layouts. Each
 element is either a :code:`NoLayout`, indicating an explicit layout (i.e. the
 programmer inserted semicolons and braces herself) or a :code:`Layout n` where
 :code:`n` is a non-negative integer representing the indentation level of the
 enclosing context.
 .. code-block:: haskell
-    f x -- layout stack: []
+    -- layout stack: []
-        = let -- layout keyword; remember indentation of next token
+    module M where -- layout stack: [∅]
-              y = w * w -- layout stack: [Layout 10]
+
    f x = let -- layout keyword; remember indentation of next token
              y = w * w -- layout stack: [∅, <10>]
              w = x + x
              -- layout ends here
          in do -- layout keyword; next token is a brace!
-              { -- layout stack: [NoLayout]
+              { -- layout stack: [∅]
-              pure }
+                  print y;
                  print x;
              }
-In the code seen above, notice that :code:`let` allows for multiple definitions,
+Finally, we also need the concept of "virtual" brace tokens, which as far as
-separated by a newline. We accomate for this with a token :math:`\langle n
+we're concerned at this moment are exactly like normal brace tokens, except
-\rangle` which compliments :math:`\{n\}` in how it functions as a closing brace
+implicitly inserted by the compiler. With the presented ideas in mind, we may
-that stores indentation. We give a rule to describe the source of such a token:
+begin to introduce a small set of informal rules describing the lexer's handling
 of layouts, the first being:
-2. When the first lexeme on a line is preceeded by only whitespace a
+1. If a layout keyword is followed by the token '{', push :math:`\varnothing`
-   :math:`\langle n \rangle` token is inserted before the lexeme, where
+   onto the layout context stack. Otherwise, push :math:`\langle n \rangle` onto
-   :math:`n` is the indentation of the lexeme, provided that it is not, as a
+   the layout context stack where :math:`n` is the indentation of the token
-   consequence of rule 1 or rule 3 (as we'll see), preceded by {n}. 
+   following the layout keyword. Additionally, the lexer is to insert a virtual
   opening brace after the token representing the layout keyword.
-Lastly, to handle the top level we will initialise the stack with a
+Consider the following observations from that previous code sample:
 :math:`\{n\}` where :math:`n` is the indentation of the first lexeme.
-3. If the first lexeme of a module is not '{' or :code:`module`, then it is
+* Function definitions should belong to a layout, each of which may start at
-   preceded by :math:`\{n\}` where :math:`n` is the indentation of the lexeme. 
+  column 1.
 * A layout can enclose multiple bodies, as seen in the :code:`let`-bindings and
  the :code:`do`-expression.
 * Semicolons should *terminate* items, rather than *separate* them.
 Our current focus is the semicolons. In an implicit layout, items are on
 separate lines each aligned with the previous. A naïve implementation would be
 to insert the semicolon token when the EOL is reached, but this proves unideal
 when you consider the alignment requirement. In our implementation, our lexer
 will wait until the first token on a new line is reached, then compare
 indentation and insert a semicolon if appropriate. This comparison -- the
 nondescript measurement of "more, less, or equal indentation" rather than a
 numeric value -- is referred to as *offside* by myself internally and the
 Haskell report describing layouts. We informally formalise this rule as follows:
 2. When the first token on a line is preceeded only by whitespace, if the
   token's first grapheme resides on a column number :math:`m` equal to the
   indentation level of the enclosing context -- i.e. the :math:`\langle n
   \rangle` on top of the layout stack. Should no such context exist on the
   stack, assume :math:`m > n`.
 We have an idea of how to begin layouts, delimit the enclosed items, and last
 we'll need to end layouts. This is where the distinction between virtual and
 non-virtual brace tokens comes into play. The lexer needs only partial concern
 towards closing layouts; the complete responsibility is shared with the parser.
 This will be elaborated on in the next section. For now, we will be content with
 naïvely inserting a virtual closing brace when a token is indented right of the
 layout.
 3. Under the same conditions as rule 2., when :math:`m < n` the lexer shall
   insert a virtual closing brace and pop the layout stack.
 This rule covers some cases including the top-level, however, consider
 tokenising the :code:`in` in a :code:`let`-expression. If our lexical analysis
 framework only allows for lexing a single token at a time, we cannot return both
 a virtual right-brace and a :code:`in`. Under this model, the lexer may simply
 pop the layout stack and return the :code:`in` token. As we'll see in the next
 section, as long as the lexer keeps track of its own context (i.e. the stack),
 the parser will cope just fine without the virtual end-brace.
 Parsing Lonely Braces
 *********************
 When viewed in the abstract, parsing and tokenising are near-identical tasks yet
 the two are very often decomposed into discrete systems with very different
 implementations. Lexers operate on streams of text and tokens, while parsers
 are typically far less linear, using a parse stack or recursing top-down. A
 big reason for this separation is state management: the parser aims to be as
 context-free as possible, while the lexer tends to burden the necessary
 statefulness. Still, the nature of a stream-oriented lexer makes backtracking
 difficult and quite inelegant.
 However, simply declaring a parse error to be not an error at all
 counterintuitively proves to be an elegant solution our layout problem which
 minimises backtracking and state in both the lexer and the parser. Consider the
 following definitions found in rlp's BNF:
 .. productionlist:: rlp
   VOpen   : `vopen`
   VClose  : `vclose` | `error`
 A parse error is recovered and treated as a closing brace. Another point of note
 in the BNF is the difference between virtual and non-virtual braces (TODO: i
 don't like that the BNF is formatted without newlines :/):
 .. productionlist:: rlp
   LetExpr : `let` VOpen Bindings VClose `in` Expr | `let` `{` Bindings `}` `in` Expr
 This ensures that non-virtual braces are closed explicitly.
 This set of rules is adequete enough to satisfy our basic concerns about line
 continations and layout lists. For a more pedantic description of the layout
 system, see `chapter 10
 <https://www.haskell.org/onlinereport/haskell2010/haskellch10.html>`_ of the
-2010 Haskell Report, which I **heavily** referenced here.
+2010 Haskell Report, which I heavily referenced here.
 References
 ----------
@@ -166,5 +231,5 @@ References
 * `Python's lexical analysis
  <https://docs.python.org/3/reference/lexical_analysis.html>`_
-* `Haskell Syntax Reference
+* `Haskell syntax reference
  <https://www.haskell.org/onlinereport/haskell2010/haskellch10.html>`_
--- a/src/Core/Parse.y
+++ b/src/Core/Parse.y
@@ -58,7 +58,8 @@ Eof             : eof           { () }
                | error         { () }
 Program         :: { Program }
-Program         : VOpen ScDefs VClose              { Program $2 }
+Program         : VOpen ScDefs VClose           { Program $2 }
                | '{'   ScDefs '}'              { Program $2 }
 VOpen           :: { () }
 VOpen           : vl                            { () }