Best Practices

Parse unicode from the start

If you want to parse ASCII, using the Unicode parsing API will not actually cost you anything. Your input will be parsed, char by char, and compared to values that are Unicode code points (which are char32_ts). One caveat is that there may be an extra branch on each char, if the input is UTF-8. If your performance requirements can tolerate this, your life will be much easier if you just start with Unicode and stick with it.

Starting with Unicode support and UTF-8 input will allow you to properly handle unexpected input, like non-ASCII languages (that's most of them), with no additional effort on your part.

Write rules, and test them in isolation

Treat rules as the unit of work in your parser. Write a rule, test its corners, and then use it to build larger rules or parsers. This allows you to get better coverage with less work, since exercising all the code paths of your rules, one by one, keeps the combinatorial number of paths through your code manageable.

Prefer auto-generated attributes to semantic actions

There are multiple ways to get attributes out of a parser. You can:

use whatever attribute the parser generates;
provide an attribute out-argument to parse() for the parser to fill in;
use one or more semantic actions to assign attributes from the parser to variables outside the parser;
use callback parsing to provide attributes via callback calls.

All of these are fairly similar in how much effort they require, except for the semantic action method. For the semantic action approach, you need to have values to fill in from your parser, and keep them in scope for the duration of the parse.

It is much more straight forward, and leads to more reusable parsers, to have the parsers produce the attributes of the parse directly as a result of the parse.

This does not mean that you should never use semantic actions. They are sometimes necessary. However, you should default to using the other non-semantic action methods, and only use semantic actions with a good reason.

If your parser takes end-user input, give rules names that you would want an end-user to see

A typical error message produced by Boost.Parser will say something like, "Expected FOO here", where FOO is some rule or parser. Give your rules names that will read well in error messages like this. For instance, the JSON examples have these rules:

bp::rule<class escape_seq, uint32_t> const escape_seq =
    "\\uXXXX hexadecimal escape sequence";
bp::rule<class escape_double_seq, uint32_t, double_escape_locals> const
    escape_double_seq = "\\uXXXX hexadecimal escape sequence";
bp::rule<class single_escaped_char, uint32_t> const single_escaped_char =
    "'\"', '\\', '/', 'b', 'f', 'n', 'r', or 't'";

Some things to note:

- escape_seq and escape_double_seq have the same name-string. To an end-user who is trying to figure out why their input failed to parse, it doesn't matter which kind of result a parser rule generates. They just want to know how to fix their input. For either rule, the fix is the same: put a hexadecimal escape sequence there.

- single_escaped_char has a terrible-looking name. However, it's not really used as a name anywhere per se. In error messages, it works nicely, though. The error will be "Expected '"', '', '/', 'b', 'f', 'n', 'r', or 't' here", which is pretty helpful.

Have a simple test that you can run to find ill-formed-code-as-asserts

Most of these errors are found at parser construction time, so no actual parsing is even necessary. For instance, a test case might look like this:

TEST(my_parser_tests, my_rule_test) {
    my_rule r;
}