From 48aa05caad5c6281f525668f0f9bc71494d596da Mon Sep 17 00:00:00 2001 From: crumbtoo Date: Mon, 20 Nov 2023 11:17:27 -0700 Subject: [PATCH] some writing on layouts --- docs/src/commentary/layout-lexing.rst | 169 +++++++++++++++++++++++++- docs/src/glossary.rst | 15 +++ docs/src/index.rst | 6 + 3 files changed, 188 insertions(+), 2 deletions(-) create mode 100644 docs/src/glossary.rst diff --git a/docs/src/commentary/layout-lexing.rst b/docs/src/commentary/layout-lexing.rst index 5a1f8ab..4c14735 100644 --- a/docs/src/commentary/layout-lexing.rst +++ b/docs/src/commentary/layout-lexing.rst @@ -1,3 +1,168 @@ -Parsing and the Layout System -============================= +Lexing, Parsing, and Layouts +============================ +The C-style languages of my previous experiences have all had quite trivial +lexical analysis stages, peaking in complexity when I streamed tokens lazily in +C. The task of tokenising a C-style language is very simple in description: you +ignore all whitespace and point out what you recognise. If you don't recognise +something, check if it's a literal or an identifier. Should it be neither, +return an error. + +On paper, both lexing and parsing a Haskell-like language seem to pose a few +greater challenges. Listed by ascending intimidation factor, some of the +potential roadblocks on my mind before making an attempt were: + +* Operators; Haskell has not only user-defined infix operators, but user-defined + precedence levels and associativities. I recall using an algorithm that looked + up infix, prefix, postfix, and even mixfix operators up in a global table to + call their appropriate parser (if their precedence was appropriate, also + stored in the table). I never modified the table at runtime, however this + could be a very nice solution for Haskell. + +* Context-sensitive keywords; Haskell allows for some words to be used as identifiers in + appropriate contexts, such as :code:`family`, :code:`role`, :code:`as`. + Reading a note_ found in `GHC's lexer + `_, + it appears that keywords are only considered in bodies for which their use is + relevant, e.g. :code:`family` and :code:`role` in type declarations, + :code:`as` after :code:`case`; :code:`if`, :code:`then`, and :code:`else` in + expressions, etc. + +* Whitespace sensitivity; While I was comfortable with the idea of a system + similar to Python's INDENT/DEDENT tokens, Haskell seemed to use whitespace to + section code in a way that *felt* different. + +.. _note: https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/coding-style#2-using-notes + +After a bit of thought and research, whitespace sensitivity in the form of +*layouts* as Haskell and I will refer to them as, are easily the scariest thing +on this list -- however they are achievable! + +A Lexical Primer: Python +************************ + +We will compare and contrast with Python's lexical analysis. Much to my dismay, +Python uses newlines and indentation to separate statements and resolve scope +instead of the traditional semicolons and braces found in C-style languages (we +may generally refer to these C-style languages as *explicitly-sectioned*). +Internally during tokenisation, when the Python lexer begins a new line, they +compare the indentation of the new line with that of the previous and apply the +following rules: + +1. If the new line has greater indentation than the previous, insert an INDENT + token and push the new line's indentation level onto the indentation stack + (the stack is initialised with an indentation level of zero). + +2. If the new line has lesser indentation than the previous, pop the stack until + the top of the stack is greater than the new line's indentation level. A + DEDENT token is inserted for each level popped. + +3. If the indentation is equal, insert a NEWLINE token to terminate the previous + line, and leave it at that! + +Parsing Python with the INDENT, DEDENT, and NEWLINE tokens is identical to +parsing a language with braces and semicolons. This is a solution pretty in line +with Python's philosophy of the "one correct answer" (TODO: this needs a +source). In developing our *layout* rules, we will follow in the pattern of +translating the whitespace-sensitive source language to an explicitly sectioned +language. + +But What About Haskell? +*********************** + +We saw that Python, the most notable example of an implicitly sectioned +language, is pretty simple to lex. Why then am I so afraid of Haskell's layouts? +To be frank, I'm far less scared after asking myself this -- however there are +certainly some new complexities that Python needn't concern. Haskell has +implicit line *continuation*: forms written over multiple lines; indentation +styles often seen in Haskell are somewhat esoteric compared to Python's +"s/[{};]//". + +.. code-block:: haskell + + -- line continuation + something = this is a + single expression + + -- an extremely common style found in haskell + data Python = Users + { are :: Crying + , right :: About + , now :: Sorry + } + + -- another formatting oddity + -- note that this is not line contiation! + -- `look at`, `this`, and `alignment` + -- are all separate expressions! + anotherThing = do look at + this + alignment + +But enough fear, lets actually think about implementation. Firstly, some +formality: what do we mean when we say layout? We will define layout as the +rules we apply to an implicitly-sectioned language in order to yield one that is +explicitly-sectioned. We will also define indentation of a lexeme as the column +number of its first character. + +Thankfully for us, our entry point is quite clear; layouts only appear after a +select few keywords, (with a minor exception; TODO: elaborate) being :code:`let` +(followed by supercombinators), :code:`where` (followed by supercombinators), +:code:`do` (followed by expressions), and :code:`of` (followed by alternatives) +(TODO: all of these terms need linked glossary entries). Under this assumption, +we give the following rule: + +1. If a :code:`let`, :code:`where`, :code:`do`, or :code:`of` keyword is not + followed by the lexeme :code:`{`, the token :math:`\{n\}` is inserted after + the keyword, where :math:`n` is the indentation of the next lexeme if there + is one, or 0 if the end of file has been reached. + +Henceforth :math:`\{n\}` will denote the token representing the begining of a +layout; similar in function to a brace, but it stores the indentation level for +subsequent lines to compare with. We must introduce an additional input to the +function handling layouts. Obviously, such a function would require the input +string, but a helpful book-keeping tool which we will make good use of is a +stack of "layout contexts", describing the current cascade of layouts. Each +element is either a :code:`NoLayout`, indicating an explicit layout (i.e. the +programmer inserted semicolons and braces herself) or a :code:`Layout n` where +:code:`n` is a non-negative integer representing the indentation level of the +enclosing context. + +.. code-block:: haskell + + f x -- layout stack: [] + = let -- layout keyword; remember indentation of next token + y = w * w -- layout stack: [Layout 10] + w = x + x + in do -- layout keyword; next token is a brace! + { -- layout stack: [NoLayout] + pure } + +In the code seen above, notice that :code:`let` allows for multiple definitions, +separated by a newline. We accomate for this with a token :math:`\langle n +\rangle` which compliments :math:`\{n\}` in how it functions as a closing brace +that stores indentation. We give a rule to describe the source of such a token: + +2. When the first lexeme on a line is preceeded by only whitespace a + :math:`\langle n \rangle` token is inserted before the lexeme, where + :math:`n` is the indentation of the lexeme, provided that it is not, as a + consequence of rule 1 or rule 3 (as we'll see), preceded by {n}. + +Lastly, to handle the top level we will initialise the stack with a +:math:`\{n\}` where :math:`n` is the indentation of the first lexeme. + +3. If the first lexeme of a module is not '{' or :code:`module`, then it is + preceded by :math:`\{n\}` where :math:`n` is the indentation of the lexeme. + +For a more pedantic description of the layout system, see `chapter 10 +`_ of the +2010 Haskell Report, which I **heavily** referenced here. + +References +---------- + +* `Python's lexical analysis + `_ + +* `Haskell Syntax Reference + `_ diff --git a/docs/src/glossary.rst b/docs/src/glossary.rst new file mode 100644 index 0000000..4cb9aaf --- /dev/null +++ b/docs/src/glossary.rst @@ -0,0 +1,15 @@ +Glossary +======== + +Haskell and Haskell culture is infamous for using scary mathematical terms for +simple ideas. Please excuse us, it's really fun :3. + +.. glossary:: + + supercombinator + An expression with no free variables. For most purposes, just think of a + top-level definition. + + case alternative + An possible match in a case expression (TODO: example) + diff --git a/docs/src/index.rst b/docs/src/index.rst index a0ab572..707ee37 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -6,6 +6,12 @@ Contents .. toctree:: :maxdepth: 2 + :caption: Index + + glossary.rst + +.. toctree:: + :maxdepth: 1 :caption: Commentary :glob: