Happy parse lex #1

Merged
msydneyslaga merged 7 commits from happy-parse-lex into main 2023-11-20 14:09:33 -07:00
3 changed files with 188 additions and 2 deletions
Showing only changes of commit 48aa05caad - Show all commits

View File

@@ -1,3 +1,168 @@
Parsing and the Layout System
=============================
Lexing, Parsing, and Layouts
============================
The C-style languages of my previous experiences have all had quite trivial
lexical analysis stages, peaking in complexity when I streamed tokens lazily in
C. The task of tokenising a C-style language is very simple in description: you
ignore all whitespace and point out what you recognise. If you don't recognise
something, check if it's a literal or an identifier. Should it be neither,
return an error.
On paper, both lexing and parsing a Haskell-like language seem to pose a few
greater challenges. Listed by ascending intimidation factor, some of the
potential roadblocks on my mind before making an attempt were:
* Operators; Haskell has not only user-defined infix operators, but user-defined
precedence levels and associativities. I recall using an algorithm that looked
up infix, prefix, postfix, and even mixfix operators up in a global table to
call their appropriate parser (if their precedence was appropriate, also
stored in the table). I never modified the table at runtime, however this
could be a very nice solution for Haskell.
* Context-sensitive keywords; Haskell allows for some words to be used as identifiers in
appropriate contexts, such as :code:`family`, :code:`role`, :code:`as`.
Reading a note_ found in `GHC's lexer
<https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GHC/Parser/Lexer.x#L1133>`_,
it appears that keywords are only considered in bodies for which their use is
relevant, e.g. :code:`family` and :code:`role` in type declarations,
:code:`as` after :code:`case`; :code:`if`, :code:`then`, and :code:`else` in
expressions, etc.
* Whitespace sensitivity; While I was comfortable with the idea of a system
similar to Python's INDENT/DEDENT tokens, Haskell seemed to use whitespace to
section code in a way that *felt* different.
.. _note: https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/coding-style#2-using-notes
After a bit of thought and research, whitespace sensitivity in the form of
*layouts* as Haskell and I will refer to them as, are easily the scariest thing
on this list -- however they are achievable!
A Lexical Primer: Python
************************
We will compare and contrast with Python's lexical analysis. Much to my dismay,
Python uses newlines and indentation to separate statements and resolve scope
instead of the traditional semicolons and braces found in C-style languages (we
may generally refer to these C-style languages as *explicitly-sectioned*).
Internally during tokenisation, when the Python lexer begins a new line, they
compare the indentation of the new line with that of the previous and apply the
following rules:
1. If the new line has greater indentation than the previous, insert an INDENT
token and push the new line's indentation level onto the indentation stack
(the stack is initialised with an indentation level of zero).
2. If the new line has lesser indentation than the previous, pop the stack until
the top of the stack is greater than the new line's indentation level. A
DEDENT token is inserted for each level popped.
3. If the indentation is equal, insert a NEWLINE token to terminate the previous
line, and leave it at that!
Parsing Python with the INDENT, DEDENT, and NEWLINE tokens is identical to
parsing a language with braces and semicolons. This is a solution pretty in line
with Python's philosophy of the "one correct answer" (TODO: this needs a
source). In developing our *layout* rules, we will follow in the pattern of
translating the whitespace-sensitive source language to an explicitly sectioned
language.
But What About Haskell?
***********************
We saw that Python, the most notable example of an implicitly sectioned
language, is pretty simple to lex. Why then am I so afraid of Haskell's layouts?
To be frank, I'm far less scared after asking myself this -- however there are
certainly some new complexities that Python needn't concern. Haskell has
implicit line *continuation*: forms written over multiple lines; indentation
styles often seen in Haskell are somewhat esoteric compared to Python's
"s/[{};]//".
.. code-block:: haskell
-- line continuation
something = this is a
single expression
-- an extremely common style found in haskell
data Python = Users
{ are :: Crying
, right :: About
, now :: Sorry
}
-- another formatting oddity
-- note that this is not line contiation!
-- `look at`, `this`, and `alignment`
-- are all separate expressions!
anotherThing = do look at
this
alignment
But enough fear, lets actually think about implementation. Firstly, some
formality: what do we mean when we say layout? We will define layout as the
rules we apply to an implicitly-sectioned language in order to yield one that is
explicitly-sectioned. We will also define indentation of a lexeme as the column
number of its first character.
Thankfully for us, our entry point is quite clear; layouts only appear after a
select few keywords, (with a minor exception; TODO: elaborate) being :code:`let`
(followed by supercombinators), :code:`where` (followed by supercombinators),
:code:`do` (followed by expressions), and :code:`of` (followed by alternatives)
(TODO: all of these terms need linked glossary entries). Under this assumption,
we give the following rule:
1. If a :code:`let`, :code:`where`, :code:`do`, or :code:`of` keyword is not
followed by the lexeme :code:`{`, the token :math:`\{n\}` is inserted after
the keyword, where :math:`n` is the indentation of the next lexeme if there
is one, or 0 if the end of file has been reached.
Henceforth :math:`\{n\}` will denote the token representing the begining of a
layout; similar in function to a brace, but it stores the indentation level for
subsequent lines to compare with. We must introduce an additional input to the
function handling layouts. Obviously, such a function would require the input
string, but a helpful book-keeping tool which we will make good use of is a
stack of "layout contexts", describing the current cascade of layouts. Each
element is either a :code:`NoLayout`, indicating an explicit layout (i.e. the
programmer inserted semicolons and braces herself) or a :code:`Layout n` where
:code:`n` is a non-negative integer representing the indentation level of the
enclosing context.
.. code-block:: haskell
f x -- layout stack: []
= let -- layout keyword; remember indentation of next token
y = w * w -- layout stack: [Layout 10]
w = x + x
in do -- layout keyword; next token is a brace!
{ -- layout stack: [NoLayout]
pure }
In the code seen above, notice that :code:`let` allows for multiple definitions,
separated by a newline. We accomate for this with a token :math:`\langle n
\rangle` which compliments :math:`\{n\}` in how it functions as a closing brace
that stores indentation. We give a rule to describe the source of such a token:
2. When the first lexeme on a line is preceeded by only whitespace a
:math:`\langle n \rangle` token is inserted before the lexeme, where
:math:`n` is the indentation of the lexeme, provided that it is not, as a
consequence of rule 1 or rule 3 (as we'll see), preceded by {n}.
Lastly, to handle the top level we will initialise the stack with a
:math:`\{n\}` where :math:`n` is the indentation of the first lexeme.
3. If the first lexeme of a module is not '{' or :code:`module`, then it is
preceded by :math:`\{n\}` where :math:`n` is the indentation of the lexeme.
For a more pedantic description of the layout system, see `chapter 10
<https://www.haskell.org/onlinereport/haskell2010/haskellch10.html>`_ of the
2010 Haskell Report, which I **heavily** referenced here.
References
----------
* `Python's lexical analysis
<https://docs.python.org/3/reference/lexical_analysis.html>`_
* `Haskell Syntax Reference
<https://www.haskell.org/onlinereport/haskell2010/haskellch10.html>`_

15
docs/src/glossary.rst Normal file
View File

@@ -0,0 +1,15 @@
Glossary
========
Haskell and Haskell culture is infamous for using scary mathematical terms for
simple ideas. Please excuse us, it's really fun :3.
.. glossary::
supercombinator
An expression with no free variables. For most purposes, just think of a
top-level definition.
case alternative
An possible match in a case expression (TODO: example)

View File

@@ -6,6 +6,12 @@ Contents
.. toctree::
:maxdepth: 2
:caption: Index
glossary.rst
.. toctree::
:maxdepth: 1
:caption: Commentary
:glob: