Lexing, Parsing, and Layouts¶
The C-style languages of my previous experiences have all had quite trivial lexical analysis stages, peaking in complexity when I streamed tokens lazily in C. The task of tokenising a C-style language is very simple in description: you ignore all whitespace and point out what you recognise. If you don’t recognise something, check if it’s a literal or an identifier. Should it be neither, return an error.
On paper, both lexing and parsing a Haskell-like language seem to pose a few greater challenges. Listed by ascending intimidation factor, some of the potential roadblocks on my mind before making an attempt were:
Operators; Haskell has not only user-defined infix operators, but user-defined precedence levels and associativities. I recall using an algorithm that looked up infix, prefix, postfix, and even mixfix operators up in a global table to call their appropriate parser (if their precedence was appropriate, also stored in the table). I never modified the table at runtime, however this could be a very nice solution for Haskell.
Context-sensitive keywords; Haskell allows for some words to be used as identifiers in appropriate contexts, such as
family,role,as. Reading a note found in GHC’s lexer, it appears that keywords are only considered in bodies for which their use is relevant, e.g.familyandrolein type declarations,asaftercase;if,then, andelsein expressions, etc.Whitespace sensitivity; While I was comfortable with the idea of a system similar to Python’s INDENT/DEDENT tokens, Haskell seemed to use whitespace to section code in a way that felt different.
After a bit of thought and research, whitespace sensitivity in the form of layouts as Haskell and I will refer to them as, are easily the scariest thing on this list – however they are achievable!
A Lexical Primer: Python¶
We will compare and contrast with Python’s lexical analysis. Much to my dismay, Python uses newlines and indentation to separate statements and resolve scope instead of the traditional semicolons and braces found in C-style languages (we may generally refer to these C-style languages as explicitly-sectioned). Internally during tokenisation, when the Python lexer begins a new line, they compare the indentation of the new line with that of the previous and apply the following rules:
If the new line has greater indentation than the previous, insert an INDENT token and push the new line’s indentation level onto the indentation stack (the stack is initialised with an indentation level of zero).
If the new line has lesser indentation than the previous, pop the stack until the top of the stack is greater than the new line’s indentation level. A DEDENT token is inserted for each level popped.
If the indentation is equal, insert a NEWLINE token to terminate the previous line, and leave it at that!
Parsing Python with the INDENT, DEDENT, and NEWLINE tokens is identical to parsing a language with braces and semicolons. This is a solution pretty in line with Python’s philosophy of the “one correct answer” (TODO: this needs a source). In developing our layout rules, we will follow in the pattern of translating the whitespace-sensitive source language to an explicitly sectioned language.
But What About Haskell?¶
We saw that Python, the most notable example of an implicitly sectioned language, is pretty simple to lex. Why then am I so afraid of Haskell’s layouts? To be frank, I’m far less scared after asking myself this – however there are certainly some new complexities that Python needn’t concern. Haskell has implicit line continuation: forms written over multiple lines; indentation styles often seen in Haskell are somewhat esoteric compared to Python’s “s/[{};]//”.
-- line continuation
something = this is a
single expression
-- an extremely common style found in haskell
data Python = Users
{ are :: Crying
, right :: About
, now :: Sorry
}
-- another formatting oddity
-- note that this is not a single
-- continued line! `look at`,
-- `this`, and `alignment` are all
-- separate expressions!
anotherThing = do look at
this
alignment
But enough fear, lets actually think about implementation. Firstly, some formality: what do we mean when we say layout? We will define layout as the rules we apply to an implicitly-sectioned language in order to yield one that is explicitly-sectioned. We will also define indentation of a lexeme as the column number of its first character.
Thankfully for us, our entry point is quite clear; layouts only appear after a
select few keywords, (with a minor exception; TODO: elaborate) being let
(followed by supercombinators), where (followed by supercombinators),
do (followed by expressions), and of (followed by alternatives)
(TODO: all of these terms need linked glossary entries). In order to manage the
cascade of layout contexts, our lexer will record a stack for which each element
is either , denoting an explicit layout written with braces
and semicolons, or a
, denoting an implicitly laid-out
layout where the start of each item belonging to the layout is indented
columns.
-- layout stack: []
module M where -- layout stack: [∅]
f x = let -- layout keyword; remember indentation of next token
y = w * w -- layout stack: [∅, <10>]
w = x + x
-- layout ends here
in do -- layout keyword; next token is a brace!
{ -- layout stack: [∅]
print y;
print x;
}
Finally, we also need the concept of “virtual” brace tokens, which as far as we’re concerned at this moment are exactly like normal brace tokens, except implicitly inserted by the compiler. With the presented ideas in mind, we may begin to introduce a small set of informal rules describing the lexer’s handling of layouts, the first being:
If a layout keyword is followed by the token ‘{’, push
onto the layout context stack. Otherwise, push
onto the layout context stack where
is the indentation of the token following the layout keyword. Additionally, the lexer is to insert a virtual opening brace after the token representing the layout keyword.
Consider the following observations from that previous code sample:
Function definitions should belong to a layout, each of which may start at column 1.
A layout can enclose multiple bodies, as seen in the
let-bindings and thedo-expression.Semicolons should terminate items, rather than separate them.
Our current focus is the semicolons. In an implicit layout, items are on separate lines each aligned with the previous. A naïve implementation would be to insert the semicolon token when the EOL is reached, but this proves unideal when you consider the alignment requirement. In our implementation, our lexer will wait until the first token on a new line is reached, then compare indentation and insert a semicolon if appropriate. This comparison – the nondescript measurement of “more, less, or equal indentation” rather than a numeric value – is referred to as offside by myself internally and the Haskell report describing layouts. We informally formalise this rule as follows:
When the first token on a line is preceeded only by whitespace, if the token’s first grapheme resides on a column number
equal to the indentation level of the enclosing context – i.e. the
on top of the layout stack. Should no such context exist on the stack, assume
.
We have an idea of how to begin layouts, delimit the enclosed items, and last we’ll need to end layouts. This is where the distinction between virtual and non-virtual brace tokens comes into play. The lexer needs only partial concern towards closing layouts; the complete responsibility is shared with the parser. This will be elaborated on in the next section. For now, we will be content with naïvely inserting a virtual closing brace when a token is indented right of the layout.
Under the same conditions as rule 2., when
the lexer shall insert a virtual closing brace and pop the layout stack.
This rule covers some cases including the top-level, however, consider
tokenising the in in a let-expression. If our lexical analysis
framework only allows for lexing a single token at a time, we cannot return both
a virtual right-brace and a in. Under this model, the lexer may simply
pop the layout stack and return the in token. As we’ll see in the next
section, as long as the lexer keeps track of its own context (i.e. the stack),
the parser will cope just fine without the virtual end-brace.
Parsing Lonely Braces¶
When viewed in the abstract, parsing and tokenising are near-identical tasks yet the two are very often decomposed into discrete systems with very different implementations. Lexers operate on streams of text and tokens, while parsers are typically far less linear, using a parse stack or recursing top-down. A big reason for this separation is state management: the parser aims to be as context-free as possible, while the lexer tends to burden the necessary statefulness. Still, the nature of a stream-oriented lexer makes backtracking difficult and quite inelegant.
However, simply declaring a parse error to be not an error at all counterintuitively proves to be an elegant solution our layout problem which minimises backtracking and state in both the lexer and the parser. Consider the following definitions found in rlp’s BNF:
VOpen ::=vopenVClose ::=vclose|error
A parse error is recovered and treated as a closing brace. Another point of note in the BNF is the difference between virtual and non-virtual braces (TODO: i don’t like that the BNF is formatted without newlines :/):
LetExpr ::=letVOpen Bindings VCloseinExpr |let`{` Bindings `}`inExpr
This ensures that non-virtual braces are closed explicitly.
This set of rules is adequete enough to satisfy our basic concerns about line continations and layout lists. For a more pedantic description of the layout system, see chapter 10 of the 2010 Haskell Report, which I heavily referenced here.