Tutorial 6.1: Syntax definition API

API

To define actual syntax parser developer needs to decompose programming language syntax by named rules. Each rule parses specific code construction. And responsible for building appropriate AST Node for this code construction. These rules form programming language grammar of Parsing Expression Grammar class.

Rules are defined using methods of instance of the Syntax class: .rule(ruleName: String)(ruleBody: => Rule). These rules build nodes of the same kind as the ruleName. Rule that represents top-level node of AST hierarchy should be defined with mainRule method instead.

The reference returned from the rule definition method can be used to refer to this rule in other rules. Hence the rules can refer to each other forming a graph of rules. Note that rule bodies defined in lazy fashion. So it is safe to define circular references between them. In Json example jsonObject refers to objectEntry, objectEntry refers to jsonValue. And the jsonValue refers to jsonObject again.

Remember that all syntax rules are applied only to the tokens that don't marked as skipped. Thus whitespaces, line breaks and comments are ignored by default regarding the language's lexical specification. Such approach simplifies syntax rule definition as developer takes care about meaningful tokens only.

An example of syntax rule for parsing Array construction of JSON language:

...
    // "array" is a name of rule and target Node kind.
    val jsonArray = rule("array") {
      // Consists of three sequential parts: "[" token, series of nested
      // elements separated with "," token, and losing "]" token.
      sequence(
        token("["), // Matches "[" token.
        zeroOrMore( // Repetition of nested elements
          // Result of each element parsed with jsonValue rule should
          // be saved to the current constructing node branch multimap
          // with tag "value".
          branch("value", jsonValue),
          separator =
            // Matches "," separation token between nested elements.
            // If separation token missed produce error message. But
            // parsing process continues anyway.
            recover(token(","), "array entries must be separated with , sign")
        ),
        // Closed "]" token matcher. If missed produce error message, but the
        // whole code construction will be counted as parsed successfully. And
        // appropriate AST node will be constructed anyway.
        recover(token("]"), "array must end with ] sign")
      )
    }
...

Full example can be found here: JSON parser.

Operators

Likewise the Lexical rules, syntactical rule bodies can be defined by composing of built-in primitive rules. These primitive rules are in fact representing PEG operators.

Token matcher operators

Operator	Description
`token(x)`	Matches token of x kind.
`tokensUntil(x)`	Matches all tokens starting from the current position until the x token is met. x is included.

Composition operators

Operator	Description
`optional(x)`	Matchers subrule x zero or one time.
`zeroOrMore(x, separator)`	Optionally matches subrule x multiple times. Each code pieces matched with subrule x may be divided with another piece of code defined by optional separator parameter.
`oneOrMore(x, separator)`	Matches subrule x zero or more times. x and separator have the same meaning like in `.zeroOrMore` method.
`repeat(x, y)`	Matches x repeated exactly y times.
`sequence(x, y, z)`	Matches a sequence of rules.
`choice(x, y, z)`	Ordered choice between rules. Matches next subrule if and only if preceded had been failed.

Node construction operators

Operator	Description
`capture(tag, x)`	Matches x and writes token matched by x to the node's reference tagged by `tag` string.
`branch(tag, x)`	Forces subrules of x to write their resulting nodes to the current node's branches multimap. Using `key` as a multimap's key.

Note that if capture, branch or any other rule that includes these rules fail, constructing node's references and branches multimaps will be reverted respectively to their initial state. Like they wouldn't have had applied at all.

Miscellaneous

Operator	Description
`name(label, x)`	Convenient way to reuse composite syntax rule. May be useful to refer the same rule multimple times instead of duplicating it's code. In contrast with `.rule()` and `.mainRule()` methods, `.name()` does not define rule constructor. It simply returns nested rule's result. Note that in contrast with `rule` and `mainRule` methods x is defined eagerly. Therefore named rules cannot refer each other cyclically.
`recover(x, exception)`	Allows subrule x to be possibly recovered with error message exception.
`expression(tag, atom)`	Defines expression parsing rule constructor. tag is optional parameter to bind top-level node of the expression's node subtree to the current node's `branches` multimap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly