When writing a parser, it often comes up that there is a set of strings that, when parsed, are associated with a set of values one-to-one. It is tedious to write parsers that recognize all the possible input strings when you have to associate each one with an attribute via a semantic action. Instead, we can use a symbol table.
Say we want to parse Roman numerals, one of the most common work-related parsing problems. We want to recognize numbers that start with any number of "M"s, representing thousands, followed by the hundreds, the tens, and the ones. Any of these may be absent from the input, but not all. Here are three symbol Boost.Parser tables that we can use to recognize ones, tens, and hundreds values, respectively:
bp::symbols<int> const ones = { {"I", 1}, {"II", 2}, {"III", 3}, {"IV", 4}, {"V", 5}, {"VI", 6}, {"VII", 7}, {"VIII", 8}, {"IX", 9}}; bp::symbols<int> const tens = { {"X", 10}, {"XX", 20}, {"XXX", 30}, {"XL", 40}, {"L", 50}, {"LX", 60}, {"LXX", 70}, {"LXXX", 80}, {"XC", 90}}; bp::symbols<int> const hundreds = { {"C", 100}, {"CC", 200}, {"CCC", 300}, {"CD", 400}, {"D", 500}, {"DC", 600}, {"DCC", 700}, {"DCCC", 800}, {"CM", 900}};
A symbols
maps strings of char
to their
associated attributes. The type of the attribute must be specified as a template
parameter to symbols
— in this case, int
.
Any "M"s we encounter should add 1000 to the result, and all other values come from the symbol tables. Here are the semantic actions we'll need to do that:
int result = 0; auto const add_1000 = [&result](auto & ctx) { result += 1000; }; auto const add = [&result](auto & ctx) { result += _attr(ctx); };
add_1000
just adds 1000
to result
.
add
adds whatever attribute
is produced by its parser to result
.
Now we just need to put the pieces together to make a parser:
using namespace bp::literals; auto const parser = *'M'_l[add_1000] >> -hundreds[add] >> -tens[add] >> -ones[add];
We've got a few new bits in play here, so let's break it down. 'M'_l
is a
literal parser. That is, it is a parser that parses
a literal char
, code point,
or string. In this case, a char
'M'
is being parsed. The _l
bit at the end is a UDL
suffix that you can put after any char
,
char32_t
, or char
const *
to form a literal parser. You can also make a literal parser by writing
lit()
, passing an argument of
one of the previously mentioned types.
Why do we need any of this, considering that we just used a literal ','
in our previous example? The reason is that
'M'
is not used in an expression
with another Boost.Parser parser. It is used within *'M'_l[add_1000]
.
If we'd written *'M'[add_1000]
, clearly that would be ill-formed; char
has no operator*
, nor an operator[]
, associated with it.
Tip | |
---|---|
Any time you want to use a |
On to the next bit: -hundreds[add]
.
By now, the use of the index operator should be pretty familiar; it associates
the semantic action add
with
the parser hundreds
. The
operator-
at the beginning is new. It means that the parser it is applied to is optional.
You can read it as "zero or one". So, if hundreds
is not successfully parsed after *'M'[add_1000]
, nothing happens, because hundreds
is allowed to be missing —
it's optional. If hundreds
is parsed successfully, say by matching "CC"
,
the resulting attribute, 200
,
is added to result
inside
add
.
Here is the full listing of the program. Notice that it would have been inappropriate to use a whitespace skipper here, since the entire parse is a single number, so it was removed.
#include <boost/parser/parser.hpp> #include <iostream> #include <string> namespace bp = boost::parser; int main() { std::cout << "Enter a number using Roman numerals. "; std::string input; std::getline(std::cin, input); bp::symbols<int> const ones = { {"I", 1}, {"II", 2}, {"III", 3}, {"IV", 4}, {"V", 5}, {"VI", 6}, {"VII", 7}, {"VIII", 8}, {"IX", 9}}; bp::symbols<int> const tens = { {"X", 10}, {"XX", 20}, {"XXX", 30}, {"XL", 40}, {"L", 50}, {"LX", 60}, {"LXX", 70}, {"LXXX", 80}, {"XC", 90}}; bp::symbols<int> const hundreds = { {"C", 100}, {"CC", 200}, {"CCC", 300}, {"CD", 400}, {"D", 500}, {"DC", 600}, {"DCC", 700}, {"DCCC", 800}, {"CM", 900}}; int result = 0; auto const add_1000 = [&result](auto & ctx) { result += 1000; }; auto const add = [&result](auto & ctx) { result += _attr(ctx); }; using namespace bp::literals; auto const parser = *'M'_l[add_1000] >> -hundreds[add] >> -tens[add] >> -ones[add]; if (bp::parse(input, parser) && result != 0) std::cout << "That's " << result << " in Arabic numerals.\n"; else std::cout << "That's not a Roman number.\n"; }
Important | |
---|---|
|
Just like with a rule
,
you can give a symbols
a bit of diagnostic text that will be used in error messages generated by
Boost.Parser when the parse fails at an expectation point, as described in
Error
Handling and Debugging. See the symbols
constructors for details.