PrevUpHomeNext

The Parsers And Their Uses

Boost.Parser comes with all the parsers most parsing tasks will ever need. Each one is a constexpr object, or a constexpr function. Some of the non-functions are also callable, such as char_, which may be used directly, or with arguments, as in char_('a', 'z'). Any parser that can be called, whether a function or callable object, will be called a callable parser from now on. Note that there are no nullary callable parsers; they each take one or more arguments.

Each callable parser takes one or more parse arguments. A parse argument may be a value or an invocable object that accepts a reference to the parse context. The reference parameter may be mutable or constant. For example:

struct get_attribute
{
    template<typename Context>
    auto operator()(Context & ctx)
    {
        return _attr(ctx);
    }
};

This can also be a lambda. For example:

[](auto const & ctx) { return _attr(ctx); }

The operation that produces a value from a parse argument, which may be a value or a callable taking a parse context argument, is referred to as resolving the parse argument. If a parse argument arg can be called with the current context, then the resolved value of arg is arg(ctx); otherwise, the resolved value is just arg.

Some callable parsers take a parse predicate. A parse predicate is not quite the same as a parse argument, because it must be a callable object, and cannot be a value. A parse predicate's return type must be contextually convertible to bool. For example:

struct equals_three
{
    template<typename Context>
    bool operator()(Context const & ctx)
    {
        return _attr(ctx) == 3;
    }
};

This may of course be a lambda:

[](auto & ctx) { return _attr(ctx) == 3; }

The notional macro RESOLVE() expands to the result of resolving a parse argument or parse predicate. You'll see it used in the rest of the documentation.

An example of how parse arguments are used:

namespace bp = boost::parser;
// This parser matches one code point that is at least 'a', and at most
// the value of last_char, which comes from the globals.
auto last_char = [](auto & ctx) { return _globals(ctx).last_char; }
auto subparser = bp::char_('a', last_char);

Don't worry for now about what the globals are for now; the take-away is that you can make any argument you pass to a parser depend on the current state of the parse, by using the parse context:

namespace bp = boost::parser;
// This parser parses two code points.  For the parse to succeed, the
// second one must be >= 'a' and <= the first one.
auto set_last_char = [](auto & ctx) { _globals(ctx).last_char = _attr(x); };
auto parser = bp::char_[set_last_char] >> subparser;

Each callable parser returns a new parser, parameterized using the arguments given in the invocation.

This table lists all the Boost.Parser parsers. For the callable parsers, a separate entry exists for each possible arity of arguments. For a parser p, if there is no entry for p without arguments, p is a function, and cannot itself be used as a parser; it must be called. In the table below:

[Note] Note

The definition of parsable_range is:

[parsable_range_concept

]

[Note] Note

Some of the parsers in this table consume no input. All parsers consume the input they match unless otherwise stated in the table below.

Table 1.6. Parsers and Their Semantics

Parser

Semantics

Attribute Type

Notes

eps

Matches epsilon, the empty string. Always matches, and consumes no input.

None.

Matching eps an unlimited number of times creates an infinite loop, which is undefined behavior in C++. Boost.Parser will assert in debug mode when it encounters *eps, +eps, etc (this applies to unconditional eps only).

eps(pred)

Fails to match the input if RESOLVE(pred) == false. Otherwise, the semantics are those of eps.

None.

ws

Matches a single whitespace code point (see note), according to the Unicode White_Space property.

None.

For more info, see the Unicode properties. ws may consume one code point or two. It only consumes two code points when it matches "\r\n".

eol

Matches a single newline (see note), following the "hard" line breaks in the Unicode line breaking algorithm.

None.

For more info, see the Unicode Line Breaking Algorithm. eol may consume one code point or two. It only consumes two code points when it matches "\r\n".

eoi

Matches only at the end of input, and consumes no input.

None.

attr(arg0)

Always matches, and consumes no input. Generates the attribute RESOLVE(arg0).

decltype(RESOLVE(arg0)).

An important use case for attribute is to provide a default attribute value as a trailing alternative. For instance, an optional comma-delmited list is: int_ % ',' | attr(std::vector<int>). Without the "| attr(...)", at least one int_ match would be required.

char_

Matches any single code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See Attribute Generation.

char_(arg0)

Matches exactly the code point RESOLVE(arg0).

The code point type in Unicode parsing, or char in non-Unicode parsing. See Attribute Generation.

char_(arg0, arg1)

Matches the next code point n in the input, if RESOLVE(arg0) <= n && n <= RESOLVE(arg1).

The code point type in Unicode parsing, or char in non-Unicode parsing. See Attribute Generation.

char_(r)

Matches the next code point n in the input, if n is one of the code points in r.

The code point type in Unicode parsing, or char in non-Unicode parsing. See Attribute Generation.

r is taken to be in a UTF encoding. The exact UTF used depends on r's element type. If you do not pass UTF encoded ranges for r, the behavior of char_ is undefined. Note that ASCII is a subset of UTF-8, so ASCII is fine. EBCDIC is not. r is not copied; a reference to it is taken. The lifetime of char_(r) must be within the lifetime of r. This overload of char_ does not take parse arguments.

cp

Matches a single code point.

char32_t

Similar to char_, but with a fixed char32_t attribute type; cp has all the same call operator overloads as char_, though they are not repeated here, for brevity.

cu

Matches a single code point.

char

Similar to char_, but with a fixed char attribute type; cu has all the same call operator overloads as char_, though they are not repeated here, for brevity. Even though the name "cu" suggests that this parser match at the code unit level, it does not. The name refers to the attribute type generated, much like the names int_ versus uint_.

blank

Equivalent to ws - eol.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

control

Matches a single control-character code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

digit

Matches a single decimal digit code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

punct

Matches a single punctuation code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

hex_digit

Matches a single hexidecimal digit code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

lower

Matches a single lower-case code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

upper

Matches a single upper-case code point.

The code point type in Unicode parsing, or char in non-Unicode parsing. See the entry for char_.

lit(c)

Matches exactly the given code point c.

None.

lit() does not take parse arguments.

c_l

Matches exactly the given code point c.

None.

This is a UDL that represents lit(c), for example 'F'_l.

lit(r)

Matches exactly the given string r.

None.

lit() does not take parse arguments.

str_l

Matches exactly the given string str.

None.

This is a UDL that represents lit(s), for example "a string"_l.

string(r)

Matches exactly r, and generates the match as an attribute.

std::string

string() does not take parse arguments.

str_p

Matches exactly str, and generates the match as an attribute.

std::string

This is a UDL that represents string(s), for example "a string"_p.

bool_

Matches "true" or "false".

bool

bin

Matches a binary unsigned integral value.

unsigned int

For example, bin would match "101", and generate an attribute of 5u.

bin(arg0)

Matches exactly the binary unsigned integral value RESOLVE(arg0).

unsigned int

oct

Matches an octal unsigned integral value.

unsigned int

For example, oct would match "31", and generate an attribute of 25u.

oct(arg0)

Matches exactly the octal unsigned integral value RESOLVE(arg0).

unsigned int

hex

Matches a hexadecimal unsigned integral value.

unsigned int

For example, hex would match "ff", and generate an attribute of 255u.

hex(arg0)

Matches exactly the hexadecimal unsigned integral value RESOLVE(arg0).

unsigned int

ushort_

Matches an unsigned integral value.

unsigned short

ushort_(arg0)

Matches exactly the unsigned integral value RESOLVE(arg0).

unsigned short

uint_

Matches an unsigned integral value.

unsigned int

uint_(arg0)

Matches exactly the unsigned integral value RESOLVE(arg0).

unsigned int

ulong_

Matches an unsigned integral value.

unsigned long

ulong_(arg0)

Matches exactly the unsigned integral value RESOLVE(arg0).

unsigned long

ulong_long

Matches an unsigned integral value.

unsigned long long

ulong_long(arg0)

Matches exactly the unsigned integral value RESOLVE(arg0).

unsigned long long

short_

Matches a signed integral value.

short

short_(arg0)

Matches exactly the signed integral value RESOLVE(arg0).

short

int_

Matches a signed integral value.

int

int_(arg0)

Matches exactly the signed integral value RESOLVE(arg0).

int

long_

Matches a signed integral value.

long

long_(arg0)

Matches exactly the signed integral value RESOLVE(arg0).

long

long_long

Matches a signed integral value.

long long

long_long(arg0)

Matches exactly the signed integral value RESOLVE(arg0).

long long

float_

Matches a floating-point number. float_ uses parsing implementation details from Boost.Spirit. The specifics of what formats are accepted can be found in their real number parsers. Note that only the default RealPolicies is supported by float_.

float

double_

Matches a floating-point number. double_ uses parsing implementation details from Boost.Spirit. The specifics of what formats are accepted can be found in their real number parsers. Note that only the default RealPolicies is supported by double_.

double

repeat(arg0)[p]

Matches iff p matches exactly RESOLVE(arg0) times.

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

The special value Inf may be used; it indicates unlimited repetition. decltype(RESOLVE(arg0)) must be implicitly convertible to int64_t. Matching eps an unlimited number of times creates an infinite loop, which is undefined behavior in C++. Boost.Parser will assert in debug mode when it encounters repeat(Inf)[eps] (this applies to unconditional eps only).

repeat(arg0, arg1)[p]

Matches iff p matches between RESOLVE(arg0) and RESOLVE(arg1) times, inclusively.

std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>

The special value Inf may be used for the upper bound; it indicates unlimited repetition. decltype(RESOLVE(arg0)) and decltype(RESOLVE(arg1)) each must be implicitly convertible to int64_t. Matching eps an unlimited number of times creates an infinite loop, which is undefined behavior in C++. Boost.Parser will assert in debug mode when it encounters repeat(n, Inf)[eps] (this applies to unconditional eps only).

if_(pred)[p]

Equivalent to eps(pred) >> p.

std::optional<ATTR(p)>

It is an error to write if_(pred). That is, it is an error to omit the conditionally matched parser p.

switch_(arg0)(arg1, p1)(arg2, p2) ...

Equivalent to p1 when RESOLVE(arg0) == RESOLVE(arg1), p2 when RESOLVE(arg0) == RESOLVE(arg2), etc. If there is such no argN, the behavior of switch_() is undefined.

std::variant<ATTR(p1), ATTR(p2), ...>

It is an error to write switch_(arg0). That is, it is an error to omit the conditionally matched parsers p1, p2, ....

symbols<T>

symbols is an associative container of key, value pairs. Each key is a std::string and each value has type T. In the Unicode parsing path, the strings are considered to be UTF-8 encoded; in the non-Unicode path, no encoding is assumed. symbols Matches the longest prefix pre of the input that is equal to one of the keys k. If the length len of pre is zero, and there is no zero-length key, it does not match the input. If len is positive, the generated attribute is the value associated with k.

T

Unlike the other entries in this table, symbols is a type, not an object.

quoted_string

Matches '"', followed by zero or more characters, followed by '"'.

std::string

The result does not include the quotes. A quote within the string can be written by escaping it with a backslash. A backslash within the string can be written by writing two consecutive backslashes. Any other use of a backslash will fail the parse. Skipping is disabled while parsing the entire string, as if using lexeme[].

quoted_string(c)

Matches c, followed by zero or more characters, followed by c.

std::string

The result does not include the c quotes. A c within the string can be written by escaping it with a backslash. A backslash within the string can be written by writing two consecutive backslashes. Any other use of a backslash will fail the parse. Skipping is disabled while parsing the entire string, as if using lexeme[].

quoted_string(r)

Matches some character Q in r, followed by zero or more characters, followed by Q.

std::string

The result does not include the Q quotes. A Q within the string can be written by escaping it with a backslash. A backslash within the string can be written by writing two consecutive backslashes. Any other use of a backslash will fail the parse. Skipping is disabled while parsing the entire string, as if using lexeme[].

quoted_string(c, symbols)

Matches c, followed by zero or more characters, followed by c.

std::string

The result does not include the c quotes. A c within the string can be written by escaping it with a backslash. A backslash within the string can be written by writing two consecutive backslashes. A backslash followed by a successful match using symbols will be interpreted as the corresponding value produced by symbols. Any other use of a backslash will fail the parse. Skipping is disabled while parsing the entire string, as if using lexeme[].

quoted_string(r, symbols)

Matches some character Q in r, followed by zero or more characters, followed by Q.

std::string

The result does not include the Q quotes. A Q within the string can be written by escaping it with a backslash. A backslash within the string can be written by writing two consecutive backslashes. A backslash followed by a successful match using symbols will be interpreted as the corresponding value produced by symbols. Any other use of a backslash will fail the parse. Skipping is disabled while parsing the entire string, as if using lexeme[].


[Important] Important

All the character parsers, like char_, cp and cu produce either char or char32_t attributes. So when you see "std::string if ATTR(p) is char or char32_t, otherwise std::vector<ATTR(p)>" in the table above, that effectively means that every sequences of character attributes get turned into a std::string. The only time this does not happen is when you introduce your own rules with attributes using another character type (or use attribute to do so).

[Note] Note

A slightly more complete description of the attributes generated by these parsers is in a subsequent section. The attributes are repeated here so you can use see all the properties of the parsers in one place.

If you have an integral type IntType that is not covered by any of the Boost.Parser parsers, you can use a more verbose declaration to declare a parser for IntType. If IntType were unsigned, you would use uint_parser. If it were signed, you would use int_parser. For example:

constexpr parser_interface<int_parser<IntType>> hex_int;

uint_parser and int_parser accept three more non-type template parameters after the type parameter. They are Radix, MinDigits, and MaxDigits. Radix defaults to 10, MinDigits to 1, and MaxDigits to -1, which is a sentinel value meaning that there is no max number of digits.

So, if you wanted to parse exactly eight hexadecimal digits in a row in order to recognize Unicode character literals like C++ has (e.g. \Udeadbeef), you could use this parser for the digits at the end:

constexpr parser_interface<uint_parser<unsigned int, 16, 8, 8>> hex_int;

PrevUpHomeNext