PrevUpHomeNext

The parse() API

There are multiple top-level parse functions. They have some things in common:

[Note] Note

wchar_t is an accepted value type for the input. Please note that this is interpreted as UTF-16 on MSVC, and UTF-32 everywhere else.

The overloads

There are eight overloads of parse() and prefix_parse() combined, because there are three either/or options in how you call them.

Iterator/sentinel versus range

You can call prefix_parse() with an iterator and sentinel that delimit a range of character values. For example:

namespace bp = boost::parser;
auto const p = /* some parser ... */;

char const * str_1 = /* ... */;
// Using null_sentinel, str_1 can point to three billion characters, and
// we can call prefix_parse() without having to find the end of the string first.
auto result_1 = bp::prefix_parse(str_1, bp::null_sentinel, p, bp::ws);

char str_2[] = /* ... */;
auto result_2 = bp::prefix_parse(std::begin(str_2), std::end(str_2), p, bp::ws);

The iterator/sentinel overloads can parse successfully without matching the entire input. You can tell if the entire input was matched by checking if first == last is true after prefix_parse() returns.

By contrast, you call parse() with a range of character values. When the range is a reference to an array of characters, any terminating 0 is ignored; this allows calls like parse("str", p) to work naturally.

namespace bp = boost::parser;
auto const p = /* some parser ... */;

std::u8string str_1 = "str";
auto result_1 = bp::parse(str_1, p, bp::ws);

// The null terminator is ignored.  This call parses s-t-r, not s-t-r-0.
auto result_2 = bp::parse(U"str", p, bp::ws);

char const * str_3 = "str";
auto result_3 = bp::parse(bp::null_term(str_3) | bp::as_utf16, p, bp::ws);

Since there is no way to indicate that p matches the input, but only a prefix of the input was matched, the range (non-iterator/sentinel) overloads of parse() indicate failure if the entire input is not matched.

With or without an attribute out-parameter
namespace bp = boost::parser;
auto const p = '"' >> *(bp::char_ - '"') >> '"';
char const * str = "\"two words\"" ;

std::string result_1;
bool const success = bp::parse(str, p, result_1);   // success is true; result_1 is "two words"
auto result_2 = bp::parse(str, p);                  // !!result_2 is true; *result_2 is "two words"

When you call parse() with an attribute out-parameter and parser p, the expected type is something like ATTR(p). It doesn't have to be exactly that; I'll explain in a bit. The return type is bool.

When you call parse() without an attribute out-parameter and parser p, the return type is std::optional<ATTR(p)>. Note that when ATTR(p) is itself an optional, the return type is std::optional<std::optional<...>>. Each of those optionals tells you something different. The outer one tells you whether the parse succeeded. If so, the parser was successful, but it still generates an attribute that is an optional — that's the inner one.

With or without a skipper
namespace bp = boost::parser;
auto const p = '"' >> *(bp::char_ - '"') >> '"';
char const * str = "\"two words\"" ;

auto result_1 = bp::parse(str, p);         // !!result_1 is true; *result_1 is "two words"
auto result_2 = bp::parse(str, p, bp::ws); // !!result_2 is true; *result_2 is "twowords"
Compatibility of attribute out-parameters

For any call to parse() that takes an attribute out-parameter, like parse("str", p, bp::ws, out), the call is well-formed for a number of possible types of out; decltype(out) does not need to be exactly ATTR(p).

For instance, this is well-formed code that does not abort (remember that the attribute type of string() is std::string):

namespace bp = boost::parser;
auto const p = bp::string("foo");

std::vector<char> result;
bool const success = bp::parse("foo", p, result);
assert(success && result == std::vector<char>({'f', 'o', 'o'}));

Even though p generates a std::string attribute, when it actually takes the data it generates and writes it into an attribute, it only assumes that the attribute is a container (see Concepts), not that it is some particular container type. It will happily insert() into a std::string or a std::vector<char> all the same. std::string and std::vector<char> are both containers of char, but it will also insert into a container with a different element type. p just needs to be able to insert the elements it produces into the attribute-container. As long as an implicit conversion allows that to work, everything is fine:

namespace bp = boost::parser;
auto const p = bp::string("foo");

std::deque<int> result;
bool const success = bp::parse("foo", p, result);
assert(success && result == std::deque<int>({'f', 'o', 'o'}));

This works, too, even though it requires inserting elements from a generated sequence of char32_t into a container of char (remember that the attribute type of +cp is std::vector<char32_t>):

namespace bp = boost::parser;
auto const p = +bp::cp;

std::string result;
bool const success = bp::parse("foo", p, result);
assert(success && result == "foo");

This next example works as well, even though the change to a container is not at the top level. It is an element of the result tuple:

namespace bp = boost::parser;
auto const p = +(bp::cp - ' ') >> ' ' >> string("foo");

using attr_type = decltype(bp::parse(u8"", p));
static_assert(std::is_same_v<
              attr_type,
              std::optional<bp::tuple<std::string, std::string>>>);

using namespace bp::literals;

{
    // This is similar to attr_type, with the first std::string changed to a std::vector<int>.
    bp::tuple<std::vector<int>, std::string> result;
    bool const success = bp::parse(u8"rôle foo" | bp::as_utf8, p, result);
    assert(success);
    assert(bp::get(result, 0_c) == std::vector<int>({'r', U'ô', 'l', 'e'}));
    assert(bp::get(result, 1_c) == "foo");
}
{
    // This time, we have a std::vector<char> instead of a std::vector<int>.
    bp::tuple<std::vector<char>, std::string> result;
    bool const success = bp::parse(u8"rôle foo" | bp::as_utf8, p, result);
    assert(success);
    // The 4 code points "rôle" get transcoded to 5 UTF-8 code points to fit in the std::string.
    assert(bp::get(result, 0_c) == std::vector<char>({'r', (char)0xc3, (char)0xb4, 'l', 'e'}));
    assert(bp::get(result, 1_c) == "foo");
}

As indicated in the inline comments, there are a couple of things to take away from this example:

Let's look at a case where another simple-seeming type replacement does not work. First, the case that works:

namespace bp = boost::parser;
auto parser = -(bp::char_ % ',');
std::vector<int> result;
auto b = bp::parse("a, b", parser, bp::ws, result);

ATTR(parser) is std::optional<std::string>. Even though we pass a std::vector<int>, everything is fine. However, if we modify this case only sightly, so that the std::optional<std::string> is nested within the attribute, the code becomes ill-formed.

struct S
{
    std::vector<int> chars;
    int i;
};
namespace bp = boost::parser;
auto parser = -(bp::char_ % ',') >> bp::int_;
S result;
auto b = bp::parse("a, b 42", parser, bp::ws, result);

If we change chars to a std::vector<char>, the code is still ill-formed. Same if we change chars to a std::string. We must actually use std::optional<std::string> exactly to make the code well-formed again.

The reason the same looseness from the top-level parser does not apply to a nested parser is that, at some point in the code, the parser -(bp::char_ % ',') would try to assign a std::optional<std::string> — the element type of the attribute type it normally generates — to a chars. If there's no implicit conversion there, the code is ill-formed.

The take-away for this last example is that the ability to arbitrarily swap out data types within the type of the attribute you pass to parse() is very flexible, but is also limited to structurally simple cases. When we discuss rules in the next section, we'll see how this flexibility in the types of attributes can help when writing complicated parsers.

Those were examples of swapping out one container type for another. They make good examples because that is more likely to be surprising, and so it's getting lots of coverage here. You can also do much simpler things like parse using a uint_, and writing its attribute into a double. In general, you can swap any type T out of the attribute, as long as the swap would not result in some ill-formed assignment within the parse.

Here is another example that also produces surprising results, for a different reason.

namespace bp = boost::parser;
constexpr auto parser = bp::char_('a') >> bp::char_('b') >> bp::char_('c') |
                        bp::char_('x') >> bp::char_('y') >> bp::char_('z');
std::string str = "abc";
bp::tuple<char, char, char> chars;
bool b = bp::parse(str, parser, chars);
assert(b);
assert(chars == bp::tuple('c', '\0', '\0'));

This looks wrong, but is expected behavior. At every stage of the parse that produces an attribute, Boost.Parser tries to assign that attribute to some part of the out-param attribute provided to parse(), if there is one. Note that ATTR(parser) is std::string, because each sequence parser is three char_ parsers in a row, which forms a std::string; there are two such alternatives, so the overall attribute is also std::string. During the parse, when the first parser bp::char_('a') matches the input, it produces the attribute 'a' and needs to assign it to its destination. Some logic inside the sequence parser indicates that this 'a' contributes to the value in the 0th position in the result tuple, if the result is being written into a tuple. Here, we passed a bp::tuple<char, char, char>, so it writes 'a' into the first element. Each subsequent char_ parser does the same thing, and writes over the first element. If we had passed a std::string as the out-param instead, the logic would have seen that the out-param attribute is a string, and would have appended 'a' to it. Then each subsequent parser would have appended to the string.

Boost.Parser never looks at the arity of the tuple passed to parse() to see if there are too many or too few elements in it, compared to the expected attribute for the parser. In this case, there are two extra elements that are never touched. If there had been too few elements in the tuple, you would have seen a compilation error. The reason that Boost.Parser never does this kind of type-checking up front is that the loose assignment logic is spread out among the individual parsers; the top-level parse can determine what the expected attribute is, but not whether a passed attribute of another type is a suitable stand-in.

Compatibility of variant attribute out-parameters

The use of a variant in an out-param is compatible if the default attribute can be assigned to the variant. No other work is done to make the assignment compatible. For instance, this will work as you'd expect:

namespace bp = boost::parser;
std::variant<int, double> v;
auto b = bp::parse("42", bp::int_, v);
assert(b);
assert(v.index() == 0);
assert(std::get<0>(v) == 42);

Again, this works because v = 42 is well-formed. However, other kinds of substitutions will not work. In particular, the boost::parser::tuple to aggregate or aggregate to boost::parser::tuple transformations will not work. Here's an example.

struct key_value
{
    int key;
    double value;
};

namespace bp = boost::parser;
std::variant<key_value, double> kv_or_d;
key_value kv;
bp::parse("42 13.0", bp::int_ >> bp::double_, kv);      // Ok.
bp::parse("42 13.0", bp::int_ >> bp::double_, kv_or_d); // Error: ill-formed!

In this case, it would be easy for Boost.Parser to look at the alternative types covered by the variant, and do a conversion. However, there are many cases in which there is no obviously correct variant alternative type, or in which the user might expect one variant alternative type and get another. Consider a couple of cases.

struct i_d { int i; double d; };
struct d_i { double d; int i; };
using v1 = std::variant<i_d, d_i>;

struct i_s { int i; short s; };
struct d_d { double d1; double d2; };
using v2 = std::variant<i_s, d_d>;

using tup_t = boost::parser::tuple<short, short>;

If we have a parser that produces a tup_t, and we have a v1 attribute out-param, the correct variant alternative type clearly does not exist — this case is ambiguous, and anyone can see that neither variant alternative is a better match. If we were assigning a tup_t to v2, it's even worse. The same ambiguity exists, but to the user, i_s is clearly "closer" than d_d.

So, Boost.Parser only does assignment. If some parser P generates a default attribute that is not assignable to a variant alternative that you want to assign it to, you can just create a rule that creates either an exact variant alternative type, or the variant itself, and use P as your rule's parser.

Unicode versus non-Unicode parsing

A call to parse() either considers the entire input to be in a UTF format (UTF-8, UTF-16, or UTF-32), or it considers the entire input to be in some unknown encoding. Here is how it deduces which case the call falls under:

[Tip] Tip

if you want to want to parse in ASCII-only mode, or in some other non-Unicode encoding, use only sequences of char, like std::string or char const *.

[Tip] Tip

If you want to ensure all input is parsed as Unicode, pass the input range r as r | boost::parser::as_utf32 — that's the first thing that happens to it inside parse() in the Unicode parsing path anyway.

[Note] Note

Since passing boost::parser::utfN_view is a special case, and since a sequence of chars r is otherwise considered an unknown encoding, boost::parser::parse(r | boost::parser::as_utf8, p) treats r as UTF-8, whereas boost::parser::parse(r, p) does not.

The trace_mode parameter to parse()

Debugging parsers is notoriously difficult once they reach a certain size. To get a verbose trace of your parse, pass boost::parser::trace::on as the final parameter to parse(). It will show you the current parser being matched, the next few characters to be parsed, and any attributes generated. See the Error Handling and Debugging section of the tutorial for details.

Globals and error handlers

Each call to parse() can optionally have a globals object associated with it. To use a particular globals object with you parser, you call with_globals() to create a new parser with the globals object in it:

struct globals_t
{
    int foo;
    std::string bar;
};
auto const parser = /* ... */;
globals_t globals{42, "yay"};
auto result = boost::parser::parse("str", boost::parser::with_globals(parser, globals));

Every semantic action within that call to parse() can access the same globals_t object using _globals(ctx).

The default error handler is great for most needs, but if you want to change it, you can do so by creating a new parser with a call to with_error_handler():

auto const parser = /* ... */;
my_error_handler error_handler;
auto result = boost::parser::parse("str", boost::parser::with_error_handler(parser, error_handler));
[Tip] Tip

If your parsing environment does not allow you to report errors to a terminal, you may want to use callback_error_handler instead of the default error handler.

[Important] Important

Globals and the error handler are ignored, if present, on any parser except the top-level parser.


PrevUpHomeNext