If you want to parse ASCII, using the Unicode parsing API will not actually
cost you anything. Your input will be parsed, char
by char
,
and compared to values that are Unicode code points (which are char32_t
s).
One caveat is that there may be an extra branch on each char, if the input
is UTF-8. If your performance requirements can tolerate this, your life will
be much easier if you just start with Unicode and stick with it.
Starting with Unicode support and UTF-8 input will allow you to properly handle unexpected input, like non-ASCII languages (that's most of them), with no additional effort on your part.
Treat rules as the unit of work in your parser. Write a rule, test its corners, and then use it to build larger rules or parsers. This allows you to get better coverage with less work, since exercising all the code paths of your rules, one by one, keeps the combinatorial number of paths through your code manageable.
There are multiple ways to get attributes out of a parser. You can:
parse()
for the parser to fill in;
All of these are fairly similar in how much effort they require, except for the semantic action method. For the semantic action approach, you need to have values to fill in from your parser, and keep them in scope for the duration of the parse.
It is much more straight forward, and leads to more reusable parsers, to have the parsers produce the attributes of the parse directly as a result of the parse.
This does not mean that you should never use semantic actions. They are sometimes necessary. However, you should default to using the other non-semantic action methods, and only use semantic actions with a good reason.
A typical error message produced by Boost.Parser will say something like, "Expected FOO here", where FOO is some rule or parser. Give your rules names that will read well in error messages like this. For instance, the JSON examples have these rules:
bp::rule<class escape_seq, uint32_t> const escape_seq = "\\uXXXX hexadecimal escape sequence"; bp::rule<class escape_double_seq, uint32_t, double_escape_locals> const escape_double_seq = "\\uXXXX hexadecimal escape sequence"; bp::rule<class single_escaped_char, uint32_t> const single_escaped_char = "'\"', '\\', '/', 'b', 'f', 'n', 'r', or 't'";
Some things to note:
- escape_seq
and escape_double_seq
have the same
name-string. To an end-user who is trying to figure out why their input failed
to parse, it doesn't matter which kind of result a parser rule generates.
They just want to know how to fix their input. For either rule, the fix is
the same: put a hexadecimal escape sequence there.
- single_escaped_char
has a terrible-looking name. However,
it's not really used as a name anywhere per se. In error messages, it works
nicely, though. The error will be "Expected '"', '', '/', 'b',
'f', 'n', 'r', or 't' here", which is pretty helpful.
Most of these errors are found at parser construction time, so no actual parsing is even necessary. For instance, a test case might look like this:
TEST(my_parser_tests, my_rule_test) { my_rule r; }