PrevUpHomeNext

Attribute Generation

So far, we've seen several different types of attributes that come from different parsers, int for int_, boost::parser::tuple<char, int> for boost::parser::char_ >> boost::parser::int_, etc. Let's get into how this works with more rigor.

[Note] Note

Some parsers have no attribute at all. In the tables below, the type of the attribute is listed as "None." There is a non-void type that is returned from each parser that lacks an attribute. This keeps the logic simple; having to handle the two cases — void or non-void — would make the library significantly more complicated. The type of this non-void attribute associated with these parsers is an implementation detail. The type comes from the boost::parser::detail namespace and is pretty useless. You should never see this type in practice. Within semantic actions, asking for the attribute of a non-attribute-producing parser (using _attr(ctx)) will yield a value of the special type boost::parser::none. When calling parse() in a form that returns the attribute parsed, when there is no attribute, simply returns bool; this indicates the success of failure of the parse.

[Warning] Warning

Boost.Parser assumes that all attributes are semi-regular (see std::semiregular). Within the Boost.Parser code, attributes are assigned, moved, copy, and default constructed. There is no support for move-only or non-default-constructible types.

The attribute type trait, attribute

You can use attribute (and the associated alias, attribute_t) to determine the attribute a parser would have if it were passed to parse(). Since at least one parser (char_) has a polymorphic attribute type, attribute also takes the type of the range being parsed. If a parser produces no attribute, attribute will produce none, not void.

If you want to feed an iterator/sentinel pair to attribute, create a range from it like so:

constexpr auto parser = /* ... */;
auto first = /* ... */;
auto const last = /* ... */;

namespace bp = boost::parser;
// You can of course use std::ranges::subrange directly in C++20 and later.
using attr_type = bp::attribute_t<decltype(BOOST_PARSER_SUBRANGE(first, last)), decltype(parser)>;

There is no single attribute type for any parser, since a parser can be placed within omit[], which makes its attribute type none. Therefore, attribute cannot tell you what attribute your parser will produce under all circumstances; it only tells you what it would produce if it were passed to parse().

Parser attributes

This table summarizes the attributes generated for all Boost.Parser parsers. In the table below:

Table 1.8. Parsers and Their Attributes

Parser

Attribute Type

Notes

eps

None.

eol

None.

eoi

None.

attr(x)

decltype(RESOLVE(x))

char_

The code point type in Unicode parsing, or char in non-Unicode parsing; see below.

Includes all the _p UDLs that take a single character, and all character class parsers like control and lower.

cp

char32_t

cu

char

lit(x)

None.

Includes all the _l UDLs.

string(x)

std::string

Includes all the _p UDLs that take a string.

bool_

bool

bin

unsigned int

oct

unsigned int

hex

unsigned int

ushort_

unsigned short

uint_

unsigned int

ulong_

unsigned long

ulong_long

unsigned long long

short_

short

int_

int

long_

long

long_long

long long

float_

float

double_

double

symbols<T>

T


char_ is a bit odd, since its attribute type is polymorphic. When you use char_ to parse text in the non-Unicode code path (i.e. a string of char), the attribute is char. When you use the exact same char_ to parse in the Unicode-aware code path, all matching is code point based, and so the attribute type is the type used to represent code points, char32_t. All parsing of UTF-8 falls under this case.

Here, we're parsing plain chars, meaning that the parsing is in the non-Unicode code path, the attribute of char_ is char:

auto result = parse("some text", boost::parser::char_);
static_assert(std::is_same_v<decltype(result), std::optional<char>>));

When you parse UTF-8, the matching is done on a code point basis, so the attribute type is char32_t:

auto result = parse("some text" | boost::parser::as_utf8, boost::parser::char_);
static_assert(std::is_same_v<decltype(result), std::optional<char32_t>>));

The good news is that usually you don't parse characters individually. When you parse with char_, you usually parse repetition of then, which will produce a std::string, regardless of whether you're in Unicode parsing mode or not. If you do need to parse individual characters, and want to lock down their attribute type, you can use cp and/or cu to enforce a non-polymorphic attribute type.

Combining operation attributes

Combining operations of course affect the generation of attributes. In the tables below:

Table 1.9. Combining Operations and Their Attributes

Parser

Attribute Type

!p

None.

&p

None.

*p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

+p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

+*p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

*+p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

-p

std::optional<ATTR(p)>

p1 >> p2

boost::parser::tuple<ATTR(p1), ATTR(p2)>

p1 > p2

boost::parser::tuple<ATTR(p1), ATTR(p2)>

p1 >> p2 >> p3

boost::parser::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 > p2 >> p3

boost::parser::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 >> p2 > p3

boost::parser::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 > p2 > p3

boost::parser::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 | p2

std::variant<ATTR(p1), ATTR(p2)>

p1 | p2 | p3

std::variant<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 || p2

boost::parser::tuple<ATTR(p1), ATTR(p2)>

p1 || p2 || p3

boost::parser::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>

p1 % p2

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p1)>

p[a]

None.

repeat(arg0)[p]

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

repeat(arg0, arg1)[p]

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

if_(pred)[p]

std::optional<ATTR(p)>

switch_(arg0)(arg1, p1)(arg2, p2)...

std::variant<ATTR(p1), ATTR(p2), ...>


[Important] Important

All the character parsers, like char_, cp and cu produce either char or char32_t attributes. So when you see "std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>" in the table above, that effectively means that every sequences of character attributes get turned into a std::string. The only time this does not happen is when you introduce your own rules with attributes using another character type (or use attribute to do so).

[Important] Important

In case you did not notice it above, adding a semantic action to a parser erases the parser's attribute. The attribute is still available inside the semantic action as _attr(ctx).

There are a relatively small number of rules that define how sequence parsers and alternative parsers' attributes are generated. (Don't worry, there are examples below.)

Sequence parser attribute rules

The attribute generation behavior of sequence parsers is conceptually pretty simple:

More formally, the attribute generation algorithm works like this. For a sequence parser p, let the list of attribute types for the subparsers of p be a0, a1, a2, ..., an.

We get the attribute of p by evaluating a compile-time left fold operation, left-fold({a1, a2, ..., an}, tuple<a0>, OP). OP is the combining operation that takes the current attribute type (initially boost::parser::tuple<a0>) and the next attribute type, and returns the new current attribute type. The current attribute type at the end of the fold operation is the attribute type for p.

OP attempts to apply a series of rules, one at a time. The rules are noted as X >> Y -> Z, where X is the type of the current attribute, Y is the type of the next attribute, and Z is the new current attribute type. In these rules, C<T> is a container of T; none is a special type that indicates that there is no attribute; T is a type; CHAR is a character type, either char or char32_t; and Ts... is a parameter pack of one or more types. Note that T may be the special type none. The current attribute is always a tuple (call it Tup), so the "current attribute X" refers to the last element of Tup, not Tup itself, except for those rules that explicitly mention boost::parser::tuple<> as part of X's type.

The rules that combine containers with (possibly optional) adjacent values (e.g. C<T> >> optional<T> -> C<T>) have a special case for strings. If C<T> is exactly std::string, and T is either char or char32_t, the combination yields a std::string.

Again, if the final result is that the attribute is boost::parser::tuple<T>, the attribute becomes T.

[Note] Note

What constitutes a container in the rules above is determined by the container concept:

template<typename T>
concept container = std::ranges::common_range<T> && requires(T t) {
    { t.insert(t.begin(), *t.begin()) }
        -> std::same_as<std::ranges::iterator_t<T>>;
};

Alternative parser attribute rules

The rules for alternative parsers are much simpler. For an alternative parer p, let the list of attribute types for the subparsers of p be a0, a1, a2, ..., an. The attribute of p is std::variant<a0, a1, a2, ..., an>, with the following steps applied:

Formation of containers in attributes

The rule for forming containers from non-containers is simple. You get a vector from any of the repeating parsers, like +p, *p, repeat(3)[p], etc. The value type of the vector is ATTR(p).

Another rule for sequence containers is that a value x and a container c containing elements of x's type will form a single container. However, x's type must be exactly the same as the elements in c. There is an exception to this in the special case for strings and characters noted above. For instance, consider the attribute of char_ >> string("str"). In the non-Unicode code path, char_'s attribute type is guaranteed to be char, so ATTR(char_ >> string("str")) is std::string. If you are parsing UTF-8 in the Unicode code path, char_'s attribute type is char32_t, and the special rule makes it also produce a std::string. Otherwise, the attribute for ATTR(char_ >> string("str")) would be boost::parser::tuple<char32_t, std::string>.

Again, there are no special rules for combining values and containers. Every combination results from an exact match, or fall into the string+character special case.

Another special case: std::string assignment

std::string can be assigned from a char. This is dumb. But, we're stuck with it. When you write a parser with a char attribute, and you try to parse it into a std::string, you've almost certainly made a mistake. More importantly, if you write this:

namespace bp = boost::parser;
std::string result;
auto b = bp::parse("3", bp::int_, bp::ws, result);

... you are even more likely to have made a mistake. Though this should work, because the assignment in std::string s; s = 3; is well-formed, Boost.Parser forbids it. If you write parsing code like the snippet above, you will get a static assertion. If you really do want to assign a float or whatever to a std::string, do it in a semantic action.

Examples of attributes generated by sequence and alternative parsers

In the table: a is a semantic action; and p, p1, p2, ... are parsers that generate attributes. Note that only >> is used here; > has the exact same attribute generation rules.

Table 1.10. Sequence and Alternative Combining Operations and Their Attributes

Expression

Attribute Type

eps >> eps

None.

p >> eps

ATTR(p)

eps >> p

ATTR(p)

cu >> string("str")

std::string

string("str") >> cu

std::string

*cu >> string("str")

boost::parser::tuple<std::string, std::string>

string("str") >> *cu

boost::parser::tuple<std::string, std::string>

p >> p

boost::parser::tuple<ATTR(p), ATTR(p)>

*p >> p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

p >> *p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

*p >> -p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

-p >> *p

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

string("str") >> -cu

std::string

-cu >> string("str")

std::string

!p1 | p2[a]

None.

p | p

ATTR(p)

p1 | p2

std::variant<ATTR(p1), ATTR(p2)>

p | eps

std::optional<ATTR(p)>

p1 | p2 | eps

std::optional<std::variant<ATTR(p1), ATTR(p2)>>

p1 | p2[a] | p3

std::optional<std::variant<ATTR(p1), ATTR(p3)>>


Controlling attribute generation with merge[] and separate[]

As we saw in the previous Parsing into structs and classes section, if you parse two strings in a row, you get two separate strings in the resulting attribute. The parser from that example was this:

namespace bp = boost::parser;
auto employee_parser = bp::lit("employee")
    >> '{'
    >> bp::int_ >> ','
    >> quoted_string >> ','
    >> quoted_string >> ','
    >> bp::double_
    >> '}';

employee_parser's attribute is boost::parser::tuple<int, std::string, std::string, double>. The two quoted_string parsers produce std::string attributes, and those attributes are not combined. That is the default behavior, and it is just what we want for this case; we don't want the first and last name fields to be jammed together such that we can't tell where one name ends and the other begins. What if we were parsing some string that consisted of a prefix and a suffix, and the prefix and suffix were defined separately for reuse elsewhere?

namespace bp = boost::parser;
auto prefix = /* ... */;
auto suffix = /* ... */;
auto special_string = prefix >> suffix;
// Continue to use prefix and suffix to make other parsers....

In this case, we might want to use these separate parsers, but want special_string to produce a single std::string for its attribute. merge[] exists for this purpose.

namespace bp = boost::parser;
auto prefix = /* ... */;
auto suffix = /* ... */;
auto special_string = bp::merge[prefix >> suffix];

merge[] only applies to sequence parsers (like p1 >> p2), and forces all subparsers in the sequence parser to use the same variable for their attribute.

Another directive, separate[], also applies only to sequence parsers, but does the opposite of merge[]. If forces all the attributes produced by the subparsers of the sequence parser to stay separate, even if they would have combined. For instance, consider this parser.

namespace bp = boost::parser;
auto string_and_char = +bp::char_('a') >> ' ' >> bp::cp;

string_and_char matches one or more 'a's, followed by some other character. As written above, string_and_char produces a std::string, and the final character is appended to the string, after all the 'a's. However, if you wanted to store the final character as a separate value, you would use separate[].

namespace bp = boost::parser;
auto string_and_char = bp::separate[+bp::char_('a') >> ' ' >> bp::cp];

With this change, string_and_char produces the attribute boost::parser::tuple<std::string, char32_t>.

merge[] and separate[] in more detail

As mentioned previously, merge[] applies only to sequence parsers. All subparsers must have the same attribute, or produce no attribute at all. At least one subparser must produce an attribute. When you use merge[], you create a combining group. Every parser in a combining group uses the same variable for its attribute. No parser in a combining group interacts with the attributes of any parsers outside of its combining group. Combining groups are disjoint; merge[/*...*/] >> merge[/*...*/] will produce a tuple of two attributes, not one.

separate[] also applies only to sequence parsers. When you use separate[], you disable interaction of all the subparsers' attributes with adjacent attributes, whether they are inside or outside the separate[] directive; you force each subparser to have a separate attribute.

The rules for merge[] and separate[] overrule the steps of the algorithm described above for combining the attributes of a sequence parser. Consider an example.

namespace bp = boost::parser;
constexpr auto parser =
    bp::char_ >> bp::merge[(bp::string("abc") >> bp::char_ >> bp::char_) >> bp::string("ghi")];

You might think that ATTR(parser) would be bp::tuple<char, std::string>. It is not. The parser above does not even compile. Since we created a merge group above, we disabled the default behavior in which the char_ parsers would have collapsed into the string parser that preceded them. Since they are all treated as separate entities, and since they have different attribute types, the use of merge[] is an error.

Many directives create a new parser out of the parser they are given. merge[] and separate[] do not. Since they operate only on sequence parsers, all they do is create a copy of the sequence parser they are given. The seq_parser template has a template parameter CombiningGroups, and all merge[] and separate[] do is take a given seq_parser and create a copy of it with a different CombiningGroups template parameter. This means that merge[] and separate[] are can be ignored in operator>> expressions much like parentheses are. Consider an example.

namespace bp = boost::parser;
constexpr auto parser1 = bp::separate[bp::int_ >> bp::int_] >> bp::int_;
constexpr auto parser2 = bp::lexeme[bp::int_ >> ' ' >> bp::int_] >> bp::int_;

Note that separate[] is a no-op here; it's only being used this way for this example. These parsers have different attribute types. ATTR(parser1) is boost::parser::tuple(int, int, int). ATTR(parser2) is boost::parser::tuple(boost::parser::tuple(int, int), int). This is because bp::lexeme[] wraps its given parser in a new parser. merge[] does not. That's why, even though parser1 and parser2 look so structurally similar, they have different attributes.

transform(f)[]

transform(f)[] is a directive that transforms the attribute of a parser using the given function f. For example:

auto str_sum = [&](std::string const & s) {
    int retval = 0;
    for (auto ch : s) {
        retval += ch - '0';
    }
    return retval;
};

namespace bp = boost::parser;
constexpr auto parser = +bp::char_;
std::string str = "012345";

auto result = bp::parse(str, bp::transform(str_sum)[parser]);
assert(result);
assert(*result == 15);
static_assert(std::is_same_v<decltype(result), std::optional<int>>);

Here, we have a function str_sum that we use for f. It assumes each character in the given std::string s is a digit, and returns the sum of all the digits in s. Out parser parser would normally return a std::string. However, since str_sum returns a different type — int — that is the attribute type of the full parser, bp::transform(by_value_str_sum)[parser], as you can see from the static_assert.

As is the case with attributes all throughout Boost.Parser, the attribute passed to f will be moved. You can take it by const &, &&, or by value.

No distinction is made between parsers with and without an attribute, because there is a Regular special no-attribute type that is generated by parsers with no attribute. You may therefore write something like transform(f)[eps], and Boost.Parser will happily call f with this special no-attribute type.

Other directives that affect attribute generation

omit[p] disables attribute generation for the parser p. raw[p] changes the attribute from ATTR(p) to a view that indicates the subrange of the input that was matched by p. string_view[p] is just like raw[p], except that it produces std::basic_string_views. See Directives for details.


PrevUpHomeNext