Back to development doc homepage
We use megaparsec
for text parsing. There are several good tutorials out there, and those are just the top 3
google results. Note: FOSSA employees can also talk to the analysis team directly, we can walk you through this.
Some of this advice is directly given in the first tutorial, and is repeated here for emphasis. PLEASE READ THE TUTORIALS.
Usually, you should add type Parser = Parsec Void Text
to your parsing module, so that your parser functions can be of type Parser a
, rather than
the long-form Parsec
version, or worse, the fully-verbose ParsecT
version.
When writing a non-trivial text parser, you should create the following helper functions:
sc
/scn
- Space Consumer (or Space Consumer with Newlines)- Use
Lexer.space
to create your whitespace consumer. - READ THE DOCS FOR THAT FUNCTION IF YOU HAVEN'T YET! There's a lot of useful info there, and you'll need to know it.
- You don't always need to create both
sc
andscn
, but you'll almost always need at least one.
- Use
symbol
- Parser for verbatim text strings.- Use
Lexer.symbol
to create this helper. - Use your
sc
orscn
function for the whitespace consumer. - You can use
Lexer.symbol'
if your text is case-insensitive.
- Use
lexeme
- Parser for any basic unit of the language. Whilesymbol
is for verbatim text parsing,lexeme
is used with any parser that should consume space after finishing. For example,Lexer.symbol
is implemented viaLexer.lexeme
.- Use
Lexer.lexeme
to define this helper. - Use your
sc
orscn
function for the whitespace consumer.
- Use
Don't use space parsers directly, since space is usually consumed by the helpers listed above. You can do this if absolutely necessary, but it is not likely to be necessary in the first place.
Instead, lexeme
and symbol
will consume all whitespace after their parser automatically.
There are some known exceptions, like HTTP messages, which require exactly two consecutive newlines to separate the header section from the body. Parsing arbitrary amounts of whitespace would be incorrect for an HTTP message parser. In this case, you should directly parse the newlines.
Every parser must consume something, or fail. Successfully parsing after consuming no input commonly leads to
infinite parsing loops. Using pure
in an applicative parser is a sign that you may be consuming nothing, but
indicating success.
Usually, you're writing a parser which must consume an entire file. In this case, you should terminate the top-level parser with
eof
, which forces a parser to reach the end of input while successfully parsing along the way.