PrevUpHomeNext

More About Rules

In the earlier page about rules (Rule Parsers), I described rules as being analogous to functions. rules are, at base, organizational. Here are the common use cases for rules. Use a rule if you want to:

Let's look at the use cases in detail.

Fixing the attribute type

We saw in the previous section how parse() is flexible in what types it will accept as attribute out-parameters. Here's another example.

namespace bp = boost::parser;
auto result = bp::parse(input, bp::int % ',', result);

result can be one of many different types. It could be std::vector<int>. It could be std::set<long long>. It could be a lot of things. Often, this is a very useful property; if you had to rewrite all of your parser logic because you changed the desired container in some part of your attribute from a std::vector to a std::deque, that would be annoying. However, that flexibility comes at the cost of type checking. If you want to write a parser that always produces exactly a std::vector<unsigned int> and no other type, you also probably want a compilation error if you accidentally pass that parser a std::set<unsigned int> attribute instead. There is no way with a plain parser to enforce that its attribute type may only ever be a single, fixed type.

Fortunately, rules allow you to write a parser that has a fixed attribute type. Every rule has a specific attribute type, provided as a template parameter. If one is not specified, the rule has no attribute. The fact that the attribute is a specific type allows you to remove attribute flexibility. For instance, say we have a rule defined like this:

bp::rule<struct doubles, std::vector<double>> doubles = "doubles";
auto const doubles_def = bp::double_ % ',';
BOOST_PARSER_DEFINE_RULES(doubles);

You can then use it in a call to parse(), and parse() will return a std::optional<std::vector<double>>:

auto const result = bp::parse(input, doubles, bp::ws);

If you call parse() with an attribute out-parameter, it must be exactly std::vector<double>:

std::vector<double> vec_result;
bp::parse(input, doubles, bp::ws, vec_result); // Ok.
std::deque<double> deque_result;
bp::parse(input, doubles, bp::ws, deque_result); // Ill-formed!

If we wanted to use a std::deque<double> as the attribute type of our rule:

// Attribute changed to std::deque<double>.
bp::rule<struct doubles, std::deque<double>> doubles = "doubles";
auto const doubles_def = bp::double_ % ',';
BOOST_PARSER_DEFINE_RULES(doubles);

int main()
{
    std::deque<double> deque_result;
    bp::parse(input, doubles, bp::ws, deque_result); // Ok.
}

The take-away here is that the attribute flexibility is still available, but only within the rule — the parser bp::double_ % ',' can parse into a std::vector<double> or a std::deque<double>, but the rule doubles must parse into only the exact attribute it was declared to generate.

The reason for this is that, inside the rule parsing implementation, there is code something like this:

using attr_t = ATTR(doubles_def);
attr_t attr;
parse(first, last, parser, attr);
attribute_out_param = std::move(attr);

Where attribute_out_param is the attribute out-parameter we pass to parse(). If that final move assignment is ill-formed, the call to parse() is too.

You can also use rules to exploit attribute flexibility. Even though a rule reduces the flexibility of attributes it can generate, the fact that it is so easy to write a new rule means that we can use rules themselves to get the attribute flexibility we want across our code:

namespace bp = boost::parser;

// We only need to write the definition once...
auto const generic_doubles_def = bp::double_ % ',';

bp::rule<struct vec_doubles, std::vector<double>> vec_doubles = "vec_doubles";
auto const & vec_doubles_def = generic_doubles_def; // ... and re-use it,
BOOST_PARSER_DEFINE_RULES(vec_doubles);

// Attribute changed to std::deque<double>.
bp::rule<struct deque_doubles, std::deque<double>> deque_doubles = "deque_doubles";
auto const & deque_doubles_def = generic_doubles_def; // ... and re-use it again.
BOOST_PARSER_DEFINE_RULES(deque_doubles);

Now we have one of each, and we did not have to copy any parsing logic that would have to be maintained in two places.

Sometimes, you need to create a rule to enforce a certain attribute type, but the rule's attribute is not constructible from its parser's attribute. When that happens, you'll need to write a semantic action.

struct type_t
{
    type_t() = default;
    explicit type_t(double x) : x_(x) {}
    // etc.

    double x_;
};

namespace bp = boost::parser;

auto doubles_to_type = [](auto & ctx) {
    using namespace bp::literals;
    _val(ctx) = type_t(_attr(ctx)[0_c] * _attr(ctx)[1_c]);
};

bp::rule<struct type_tag, type_t> type = "type";
auto const type_def = (bp::double_ >> bp::double_)[doubles_to_type];
BOOST_PARSER_DEFINE_RULES(type);

For a rule R and its parser P, we do not need to write such a semantic action if:

- ATTR(R) is an aggregate, and ATTR(P) is a compatible tuple;

- ATTR(R) is a tuple, and ATTR(P) is a compatible aggregate;

- ATTR(R) is a non-aggregate class type C, and ATTR(P) is a tuple whose elements can be used to construct C; or

- ATTR(R) and ATTR(P) are compatible types.

The notion of "compatible" is defined in The parse() API.

Creating a parser for better diagnostics

Each rule has associated diagnostic text that Boost.Parser can use for failures of that rule. This is useful when the parse reaches a parse failure at an expectation point (see Expectation points). Let's say you have the following code defined somewhere.

namespace bp = boost::parser;

bp::rule<struct value_tag> value =
    "an integer, or a list of integers in braces";

auto const ints = '{' > (value % ',') > '}';
auto const value_def = bp::int_ | ints;

BOOST_PARSER_DEFINE_RULES(value);

Notice the two expectation points. One before (value % ','), one before the final '}'. Later, you call parse in some input:

bp::parse("{ 4, 5 a", value, bp::ws);

This runs should of the second expectation point, and produces output like this:

1:7: error: Expected '}' here:
{ 4, 5 a
       ^

That's a pretty good error message. Here's what it looks like if we violate the earlier expectation:

bp::parse("{ }", value, bp::ws);
1:2: error: Expected an integer, or a list of integers in braces % ',' here:
{ }
  ^

Not nearly as nice. The problem is that the expectation is on (value % ','). So, even thought we gave value reasonable dianostic text, we put the text on the wrong thing. We can introduce a new rule to put the diagnstic text in the right place.

namespace bp = boost::parser;

bp::rule<struct value_tag> value =
    "an integer, or a list of integers in braces";
bp::rule<struct comma_values_tag> comma_values =
    "a comma-delimited list of integers";

auto const ints = '{' > comma_values > '}';
auto const value_def = bp::int_ | ints;
auto const comma_values_def = (value % ',');

BOOST_PARSER_DEFINE_RULES(value, comma_values);

Now when we call bp::parse("{ }", value, bp::ws) we get a much better message:

1:2: error: Expected a comma-delimited list of integers here:
{ }
  ^

The rule value might be useful elsewhere in our code, perhaps in another parser. It's diagnostic text is appropriate for those other potential uses.

Recursive rules

It's pretty common to see grammars that include recursive rules. Consider this EBNF rule for balanced parentheses:

<parens> ::= "" | ( "(" <parens> ")" )

We can try to write this using Boost.Parser like this:

namespace bp = boost::parser;
auto const parens = '(' >> parens >> ')' | bp::eps;

We had to put the bp::eps second, because Boost.Parser's parsing algorithm is greedy. Otherwise, it's just a straight transliteration. Unfortunately, it does not work. The code is ill-formed because you can't define a variable in terms of itself. Well you can, but nothing good comes of it. If we instead make the parser in terms of a forward-declared rule, it works.

namespace bp = boost::parser;
bp::rule<struct parens_tag> parens = "matched parentheses";
auto const parens_def = '(' >> parens > ')' | bp::eps;
BOOST_PARSER_DEFINE_RULES(parens);

Later, if we use it to parse, it does what we want.

assert(bp::parse("(((())))", parens, bp::ws));

When it fails, it even produces nice diagnostics.

bp::parse("(((()))", parens, bp::ws);
1:7: error: Expected ')' here (end of input):
(((()))
       ^

Recursive rules work differently from other parsers in one way: when re-entering the rule recursively, only the attribute variable (_attr(ctx) in your semantic actions) is unique to that instance of the rule. All the other state of the uppermost instance of that rule is shared. This includes the value of the rule (_val(ctx)), and the locals and parameters to the rule. In other words, _val(ctx) returns a reference to the same object in every instance of a recursive rule. This is because each instance of the rule needs a place to put the attribute it generates from its parse. However, we only want a single return value for the uppermost rule; if each instance had a separate value in _val(ctx), then it would be impossible to build up the result of a recursive rule step by step during in the evaluation of the recursive instantiations.

Also, consider this rule:

namespace bp = boost::parser;
bp::rule<struct ints_tag, std::vector<int>> ints = "ints";
auto const ints_def = bp::int_ >> ints | bp::eps;

What is the default attribute type for ints_def? It sure looks like std::optional<std::vector<int>>. Inside the evaluation of ints, Boost.Parser must evaluate ints_def, and then produce a std::vector<int> — the return type of ints — from it. How? How do you turn a std::optional<std::vector<int>> into a std::vector<int>? To a human, it seems obvious, but the metaprogramming that properly handles this simple example and the general case is certainly beyond me.

Boost.Parser has a specific semantic for what consitutes a recursive rule. Each rule has a tag type associated with it, and if Boost.Parser enters a rule with a certain tag Tag, and the currently-evaluating rule (if there is one) also has the tag Tag, then rule instance being entered is considered to be a recursion. No other situations are considered recursion. In particular, if you have rules Ra and Rb, and Ra uses Rb, which in turn used Ra, the second use of Ra is not considered recursion. Ra and Rb are of course mutually recursive, but neither is considered a "recursive rule" for purposes of getting a unique value, locals, and parameters.

Mutually-recursive rules

One of the advantages of using rules is that you can declare all your rules up front and then use them immediately afterward. This lets you make rules that use each other without introducing cycles:

namespace bp = boost::parser;

// Assume we have some polymorphic type that can be an object/dictionary,
// array, string, or int, called `value_type`.

bp::rule<class string, std::string> const string = "string";
bp::rule<class object_element, bp::tuple<std::string, value_type>> const object_element = "object-element";
bp::rule<class object, value_type> const object = "object";
bp::rule<class array, value_type> const array = "array";
bp::rule<class value_tag, value_type> const value = "value";

auto const string_def = bp::lexeme['"' >> *(bp::char_ - '"') > '"'];
auto const object_element_def = string > ':' > value;
auto const object_def = '{'_l >> -(object_element % ',') > '}';
auto const array_def = '['_l >> -(value % ',') > ']';
auto const value_def = bp::int_ | bp::bool_ | string | array | object;

BOOST_PARSER_DEFINE_RULES(string, object_element, object, array, value);

Here we have a parser for a Javascript-value-like type value_type. value_type may be an array, which itself may contain other arrays, objects, strings, etc. Since we need to be able to parse objects within arrays and vice versa, we need each of those two parsers to be able to refer to each other.

Callback parsing

Only rules can be callback parsers, so if you want to get attributes supplied to you via callbacks instead of somewhere in the middle of a giant attribute that represents the whole parse result, you need to use rules. See Parsing JSON With Callbacks for an extended example of callback parsing.

Accessors available in semantic actions on rules
_val()

Inside all of a rule's semantic actions, the expression _val(ctx) is a reference to the attribute that the rule generates. This can be useful when you want subparsers to build up the attribute in a specific way:

namespace bp = boost::parser;
using namespace bp::literals;

bp::rule<class ints, std::vector<int>> const ints = "ints";
auto twenty_zeros = [](auto & ctx) { _val(ctx).resize(20, 0); };
auto push_back = [](auto & ctx) { _val(ctx).push_back(_attr(ctx)); };
auto const ints_def = "20-zeros"_l[twenty_zeros] | +bp::int_[push_back];
BOOST_PARSER_DEFINE_RULES(ints);
[Tip] Tip

That's just an example. It's almost always better to do things without using semantic actions. We could have instead written ints_def as "20-zeros" >> bp::attr(std::vector<int>(20)) | +bp::int_, which has the same semantics, is a lot easier to read, and is a lot less code.

Locals

The rule template takes another template parameter we have not discussed yet. You can pass a third parameter LocalState to rule, which will be defaulted csontructed by the rule, and made available within semantic actions used in the rule as _locals(ctx). This gives your rule some local state, if it needs it. The type of LocalState can be anything regular. It could be a single value, a struct containing multiple values, or a tuple, among others.

struct foo_locals
{
    char first_value = 0;
};

namespace bp = boost::parser;

bp::rule<class foo, int, foo_locals> const foo = "foo";

auto record_first = [](auto & ctx) { _locals(ctx).first_value = _attr(ctx); }
auto check_against_first = [](auto & ctx) {
    char const first = _locals(ctx).first_value;
    char const attr = _attr(ctx);
    if (attr == first)
        _pass(ctx) = false;
    _val(ctx) = (int(first) << 8) | int(attr);
};

auto const foo_def = bp::cu[record_first] >> bp::cu[check_against_first];
BOOST_PARSER_DEFINE_RULES(foo);

foo matches the input if it can match two elements of the input in a row, but only if they are not the same value. Without locals, it's a lot harder to write parsers that have to track state as they parse.

Parameters

Sometimes, it is convenient to parameterize parsers. Consider these parsing rules from the YAML 1.2 spec:

[80]
s-separate(n,BLOCK-OUT) ::= s-separate-lines(n)
s-separate(n,BLOCK-IN)  ::= s-separate-lines(n)
s-separate(n,FLOW-OUT)  ::= s-separate-lines(n)
s-separate(n,FLOW-IN)   ::= s-separate-lines(n)
s-separate(n,BLOCK-KEY) ::= s-separate-in-line
s-separate(n,FLOW-KEY)  ::= s-separate-in-line

[136]
in-flow(n,FLOW-OUT)  ::= ns-s-flow-seq-entries(n,FLOW-IN)
in-flow(n,FLOW-IN)   ::= ns-s-flow-seq-entries(n,FLOW-IN)
in-flow(n,BLOCK-KEY) ::= ns-s-flow-seq-entries(n,FLOW-KEY)
in-flow(n,FLOW-KEY)  ::= ns-s-flow-seq-entries(n,FLOW-KEY)

[137]
c-flow-sequence(n,c) ::= “[” s-separate(n,c)? in-flow(c)? “]”

YAML [137] says that the parsing should proceed into two YAML subrules, both of which have these n and c parameters. It is certainly possible to transliterate these YAML parsing rules to something that uses unparameterized Boost.Parser rules, but it is quite painful to do so. It is better to use a parameterized rule.

You give parameters to a rule by calling its with() member. The values you pass to with() are used to create a boost::parser::tuple that is available in semantic actions attached to the rule, using _params(ctx).

Passing parameters to rules like this allows you to easily write parsers that change the way they parse depending on contextual data that they have already parsed.

Here is an implementation of YAML [137]. It also implements the two YAML rules used directly by [137], rules [136] and [80]. The rules that those rules use are also represented below, but are implemented using only eps, so that I don't have to repeat too much of the (very large) YAML spec.

namespace bp = boost::parser;

// A type to represent the YAML parse context.
enum class context {
    block_in,
    block_out,
    block_key,
    flow_in,
    flow_out,
    flow_key
};

// A YAML value; no need to fill it in for this example.
struct value
{
    // ...
};

// YAML [66], just stubbed in here.
auto const s_separate_in_line = bp::eps;

// YAML [137].
bp::rule<struct c_flow_seq_tag, value> c_flow_sequence = "c-flow-sequence";
// YAML [80].
bp::rule<struct s_separate_tag> s_separate = "s-separate";
// YAML [136].
bp::rule<struct in_flow_tag, value> in_flow = "in-flow";
// YAML [138]; just eps below.
bp::rule<struct ns_s_flow_seq_entries_tag, value> ns_s_flow_seq_entries =
    "ns-s-flow-seq-entries";
// YAML [81]; just eps below.
bp::rule<struct s_separate_lines_tag> s_separate_lines = "s-separate-lines";

// Parser for YAML [137].
auto const c_flow_sequence_def =
    '[' >>
    -s_separate.with(bp::_p<0>, bp::_p<1>) >>
    -in_flow.with(bp::_p<0>, bp::_p<1>) >>
    ']';
// Parser for YAML [80].
auto const s_separate_def = bp::switch_(bp::_p<1>)
    (context::block_out, s_separate_lines.with(bp::_p<0>))
    (context::block_in, s_separate_lines.with(bp::_p<0>))
    (context::flow_out, s_separate_lines.with(bp::_p<0>))
    (context::flow_in, s_separate_lines.with(bp::_p<0>))
    (context::block_key, s_separate_in_line)
    (context::flow_key, s_separate_in_line);
// Parser for YAML [136].
auto const in_flow_def = bp::switch_(bp::_p<1>)
    (context::flow_out, ns_s_flow_seq_entries.with(bp::_p<0>, context::flow_in))
    (context::flow_in, ns_s_flow_seq_entries.with(bp::_p<0>, context::flow_in))
    (context::block_out, ns_s_flow_seq_entries.with(bp::_p<0>, context::flow_key))
    (context::flow_key, ns_s_flow_seq_entries.with(bp::_p<0>, context::flow_key));

auto const ns_s_flow_seq_entries_def = bp::eps;
auto const s_separate_lines_def = bp::eps;

BOOST_PARSER_DEFINE_RULES(
    c_flow_sequence,
    s_separate,
    in_flow,
    ns_s_flow_seq_entries,
    s_separate_lines);

YAML [137] (c_flow_sequence) parses a list. The list may be empty, and must be surrounded by brackets, as you see here. But, depending on the current YAML context (the c parameter to [137]), we may require certain spacing to be matched by s-separate, and how sub-parser in-flow behaves also depends on the current context.

In s_separate above, we parse differently based on the value of c. This is done above by using the value of the second parameter to s_separate in a switch-parser. The second parameter is looked up by using _p as a parse argument.

in_flow does something similar. Note that in_flow calls its subrule by passing its first parameter, but using a fixed value for the second value. s_separate only passes its n parameter conditionally. The point is that a rule can be used with and without .with(), and that you can pass constants or parse arguments to .with().

With those rules defined, we could write a unit test for YAML [137] like this:

auto const test_parser = c_flow_sequence.with(4, context::block_out);
auto result = bp::parse("[]", test_parser);
assert(result);

You could extend this with tests for different values of n and c. Obviously, in real tests, you parse actual contents inside the "[]", if the other rules were implemented, like [138].

The _p variable template

Getting at one of a rule's arguments and passing it as an argument to another parser can be very verbose. _p is a variable template that allows you to refer to the nth argument to the current rule, so that you can, in turn, pass it to one of the rule's subparsers. Using this, foo_def above can be rewritten as:

auto const foo_def = bp::repeat(bp::_p<0>)[' '_l];

Using _p can prevent you from having to write a bunch of lambdas that get each get an argument out of the parse context using _params(ctx)[0_c] or similar.

Note that _p is a parse argument (see The Parsers And Their Uses), meaning that it is an invocable that takes the context as its only parameter. If you want to use it inside a semantic action, you have to call it.

Special forms of semantic actions usable within a rule

Semantic actions in this tutorial are usually of the signature void (auto & ctx). That is, they take a context by reference, and return nothing. If they were to return something, that something would just get dropped on the floor.

It is a pretty common pattern to create a rule in order to get a certain kind of value out of a parser, when you don't normally get it automatically. If I want to parse an int, int_ does that, and the thing that I parsed is also the desired attribute. If I parse an int followed by a double, I get a boost::parser::tuple containing one of each. But what if I don't want those two values, but some function of those two values? I probably write something like this.

struct obj_t { /* ... */ };
obj_t to_obj(int i, double d) { /* ... */ }

namespace bp = boost::parser;
bp::rule<struct obj_tag, obj_t> obj = "obj";
auto make_obj = [](auto & ctx) {
    using boost::hana::literals;
    _val(ctx) = to_obj(_attr(ctx)[0_c], _attr(ctx)[1_c]);
};
constexpr auto obj_def = (bp::int_ >> bp::double_)[make_obj];

That's fine, if a little verbose. However, you can also do this instead:

namespace bp = boost::parser;
bp::rule<struct obj_tag, obj_t> obj = "obj";
auto make_obj = [](auto & ctx) {
    using boost::hana::literals;
    return to_obj(_attr(ctx)[0_c], _attr(ctx)[1_c]);
};
constexpr auto obj_def = (bp::int_ >> bp::double_)[make_obj];

Above, we return the value from a semantic action, and the returned value gets assigned to _val(ctx).

Finally, you can provide a function that takes the individual elements of the attribute (if it's a tuple), and returns the value to assign to _val(ctx):

namespace bp = boost::parser;
bp::rule<struct obj_tag, obj_t> obj = "obj";
constexpr auto obj_def = (bp::int_ >> bp::double_)[to_obj];

More formally, within a rule, the use of a semantic action is determined as follows. Assume we have a function APPLY that calls a function with the elements of a tuple, like std::apply. For some context ctx, semantic action action, and attribute attr, action is used like this:

- _val(ctx) = APPLY(action, std::move(attr)), if that is well-formed, and attr is a tuple of size 2 or larger;

- otherwise, _val(ctx) = action(ctx), if that is well-formed;

- otherwise, action(ctx).

The first case does not pass the context to the action at all. The last case is the normal use of semantic actions outside of rules.


PrevUpHomeNext