Parsing In Detail

Now that you've seen some examples, let's see how parsing works in a bit more detail. Consider this example.

namespace bp = boost::parser;
auto int_pair = bp::int_ >> bp::int_;         // Attribute: tuple<int, int>
auto int_pairs_plus = +int_pair >> bp::int_;  // Attribute: tuple<std::vector<tuple<int, int>>, int>

int_pairs_plus must match a pair of ints (using int_pair) one or more times, and then must match an additional int. In other words, it matches any odd number (greater than 1) of ints in the input. Let's look at how this parse proceeds.

auto result = bp::parse("1 2 3", int_pairs_plus, bp::ws);

At the beginning of the parse, the top level parser uses its first subparser (if any) to start parsing. So, int_pairs_plus, being a sequence parser, would pass control to its first parser +int_pair. Then +int_pair would use int_pair to do its parsing, which would in turn use bp::int_. This creates a stack of parsers, each one using a particular subparser.

Step 1) The input is "1 2 3", and the stack of active parsers is int_pairs_plus -> +int_pair -> int_pair -> bp::int_. (Read "->" as "uses".) This parses "1", and the whitespace after is skipped by bp::ws. Control passes to the second bp::int_ parser in int_pair.

Step 2) The input is "2 3" and the stack of parsers looks the same, except the active parser is the second bp::int_ from int_pair. This parser consumes "2" and then bp::ws skips the subsequent space. Since we've finished with int_pair's match, its boost::parser::tuple<int, int> attribute is complete. It's parent is +int_pair, so this tuple attribute is pushed onto the back of +int_pair's attribute, which is a std::vector<boost::parser::tuple<int, int>>. Control passes up to the parent of int_pair, +int_pair. Since +int_pair is a one-or-more parser, it starts a new iteration; control passes to int_pair again.

Step 3) The input is "3" and the stack of parsers looks the same, except the active parser is the first bp::int_ from int_pair again, and we're in the second iteration of +int_pair. This parser consumes "3". Since this is the end of the input, the second bp::int_ of int_pair does not match. This partial match of "3" should not count, since it was not part of a full match. So, int_pair indicates its failure, and +int_pair stops iterating. Since it did match once, +int_pair does not fail; it is a zero-or-more parser; failure of its subparser after the first success does not cause it to fail. Control passes to the next parser in sequence within int_pairs_plus.

Step 4) The input is "3" again, and the stack of parsers is int_pairs_plus -> bp::int_. This parses the "3", and the parse reaches the end of input. Control passes to int_pairs_plus, which has just successfully matched with all parser in its sequence. It then produces its attribute, a boost::parser::tuple<std::vector<boost::parser::tuple<int, int>>, int>, which gets returned from bp::parse().

Something to take note of between Steps #3 and #4: at the beginning of #4, the input position had returned to where is was at the beginning of #3. This kind of backtracking happens in alternative parsers when an alternative fails. The next page has more details on the semantics of backtracking.

Parsers in detail

So far, parsers have been presented as somewhat abstract entities. You may be wanting more detail. A Boost.Parser parser P is an invocable object with a pair of call operator overloads. The two functions are very similar, and in many parsers one is implemented in terms of the other. The first function does the parsing and returns the default attribute for the parser. The second function does exactly the same parsing, but takes an out-param into which it writes the attribute for the parser. The out-param does not need to be the same type as the default attribute, but they need to be compatible.

Compatibility means that the default attribute is assignable to the out-param in some fashion. This usually means direct assignment, but it may also mean a tuple -> aggregate or aggregate -> tuple conversion. For sequence types, compatibility means that the sequence type has insert or push_back with the usual semantics. This means that the parser +boost::parser::int_ can fill a std::set<int> just as well as a std::vector<int>.

Some parsers also have additional state that is required to perform a match. For instance, char_ parsers can be parameterized with a single code point to match; the exact value of that code point is stored in the parser object.

No parser has direct support for all the operations defined on parsers (operator|, operator>>, etc.). Instead, there is a template called parser_interface that supports all of these operations. parser_interface wraps each parser, storing it as a data member, adapting it for general use. You should only ever see parser_interface in the debugger, or possibly in some of the reference documentation. You should never have to write it in your own code.