Skip to content

Tutorial 6.1: Syntax definition API

Ilya Lakhin edited this page Nov 11, 2013 · 4 revisions

API

To define actual syntax parser developer needs to decompose programming language syntax by named rules. Each rule parses specific code construction. And responsible for building appropriate AST Node for this code construction. These rules form programming language grammar of Parsing Expression Grammar class.

Rules are defined using methods of instance of the Syntax class: .rule(ruleName: String)(ruleBody: => Rule). These rules build nodes of the same kind as the ruleName. Rule that represents top-level node of AST hierarchy should be defined with mainRule method instead.

The reference returned from the rule definition method can be used to refer to this rule in other rules. Hence the rules can refer to each other forming a graph of rules. Note that rule bodies defined in lazy fashion. So it is safe to define circular references between them. In Json example jsonObject refers to objectEntry, objectEntry refers to jsonValue. And the jsonValue refers to jsonObject again.

Remember that all syntax rules are applied only to the tokens that don't marked as skipped. Thus whitespaces, line breaks and comments are ignored by default regarding the language's lexical specification. Such approach simplifies syntax rule definition as developer takes care about meaningful tokens only.

An example of syntax rule for parsing Array construction of JSON language:

...
    // "array" is a name of rule and target Node kind.
    val jsonArray = rule("array") {
      // Consists of three sequential parts: "[" token, series of nested
      // elements separated with "," token, and losing "]" token.
      sequence(
        token("["), // Matches "[" token.
        zeroOrMore( // Repetition of nested elements
          // Result of each element parsed with jsonValue rule should
          // be saved to the current constructing node branch multimap
          // with tag "value".
          branch("value", jsonValue),
          separator =
            // Matches "," separation token between nested elements.
            // If separation token missed produce error message. But
            // parsing process continues anyway.
            recover(token(","), "array entries must be separated with , sign")
        ),
        // Closed "]" token matcher. If missed produce error message, but the
        // whole code construction will be counted as parsed successfully. And
        // appropriate AST node will be constructed anyway.
        recover(token("]"), "array must end with ] sign")
      )
    }
...

Full example can be found here: JSON parser.

Operators

Likewise the Lexical rules, syntactical rule bodies can be defined by composing of built-in primitive rules. These primitive rules are in fact representing PEG operators.

Token matcher operators
Operator Description
token(x) Matches token of x kind.
tokensUntil(x) Matches all tokens starting from the current position until the x token is met. x is included.
Composition operators
Operator Description
optional(x) Matchers subrule x zero or one time.
zeroOrMore(x, separator) Optionally matches subrule x multiple times. Each code pieces matched with subrule x may be divided with another piece of code defined by optional separator parameter.
oneOrMore(x, separator) Matches subrule x zero or more times. x and separator have the same meaning like in .zeroOrMore method.
repeat(x, y) Matches x repeated exactly y times.
sequence(x, y, z) Matches a sequence of rules.
choice(x, y, z) Ordered choice between rules. Matches next subrule if and only if preceded had been failed.
Node construction operators
Operator Description
capture(tag, x) Matches x and writes token matched by x to the node's reference tagged by tag string.
branch(tag, x) Forces subrules of x to write their resulting nodes to the current node's branches multimap. Using key as a multimap's key.

Note that if capture, branch or any other rule that includes these rules fail, constructing node's references and branches multimaps will be reverted respectively to their initial state. Like they wouldn't have had applied at all.

Miscellaneous
Operator Description
name(label, x) Convenient way to reuse composite syntax rule. May be useful to refer the same rule multimple times instead of duplicating it's code. In contrast with .rule() and .mainRule() methods, .name() does not define rule constructor. It simply returns nested rule's result. Note that in contrast with rule and mainRule methods x is defined eagerly. Therefore named rules cannot refer each other cyclically.
recover(x, exception) Allows subrule x to be possibly recovered with error message exception.
expression(tag, atom) Defines expression parsing rule constructor. tag is optional parameter to bind top-level node of the expression's node subtree to the current node's branches multimap.