PrevUpHomeNext

Unicode Support

Boost.Parser was designed from the start to be Unicode friendly. There are numerous references to the "Unicode code path" and the "non-Unicode code path" in the Boost.Parser documentation. Though there are in fact two code paths for Unicode and non-Unicode parsing, the code is not very different in the two code paths, as they are written generically. The only difference is that the Unicode code path parses the input as a range of code points, and the non-Unicode path does not. In effect, this means that, in the Unicode code path, when you call parse(r, p) for some input range r and some parser p, the parse happens as if you called parse(r | boost::parser::as_utf32, p) instead. (Of course, it does not matter if r is a proper range, or an iterator/sentinel pair; those both work fine with boost::parser::as_utf32.)

Matching "characters" within Boost.Parser's parsers is assumed to be a code point match. In the Unicode path there is a code point from the input that is matched to each char_ parser. In the non-Unicode path, the encoding is unknown, and so each element of the input is considered to be a whole "character" in the input encoding, analogous to a code point. From this point on, I will therefore refer to a single element of the input exclusively as a code point.

So, let's say we write this parser:

constexpr auto char8_parser = boost::parser::char_('\xcc');

For any char_ parser that should match a value or values, the type of the value to match is retained. So char8_parser contains a char that it will use for matching. If we had written:

constexpr auto char32_parser = boost::parser::char_(U'\xcc');

char32_parser would instead contain a char32_t that it would use for matching.

So, at any point during the parse, if char8_parser were being used to match a code point next_cp from the input, we would see the moral equivalent of next_cp == '\xcc', and if char32_parser were being used to match next_cp, we'd see the equivalent of next_cp == U'\xcc'. The take-away here is that you can write char_ parsers that match specific values, without worrying if the input is Unicode or not because, under the covers, what takes place is a simple comparison of two integral values.

[Note] Note

Boost.Parser actually promotes any two values to a common type using std::common_type before comparing them. This is almost always works because the input and any parameter passed to char_ must be character types.

Since matches are always done at a code point level (remember, a "code point" in the non-Unicode path is assumed to be a single char), you get different results trying to match UTF-8 input in the Unicode and non-Unicode code paths:

namespace bp = boost::parser;

{
    std::string str = (char const *)u8"\xcc\x80"; // encodes the code point U+0300
    auto first = str.begin();

    // Since we've done nothing to indicate that we want to do Unicode
    // parsing, and we've passed a range of char to parse(), this will do
    // non-Unicode parsing.
    std::string chars;
    assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars));

    // Finds one match of the *char* 0xcc, because the value in the parser
    // (0xcc) was matched against the two code points in the input (0xcc and
    // 0x80), and the first one was a match.
    assert(chars == "\xcc");
}
{
    std::u8string str = u8"\xcc\x80"; // encodes the code point U+0300
    auto first = str.begin();

    // Since the input is a range of char8_t, this will do Unicode
    // parsing.  The same thing would have happened if we passed
    // str | boost::parser::as_utf32 or even str | boost::parser::as_utf8.
    std::string chars;
    assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars));

    // Finds zero matches of the *code point* 0xcc, because the value in
    // the parser (0xcc) was matched against the single code point in the
    // input, 0x0300.
    assert(chars == "");
}
Implicit transcoding

Additionally, it is expected that most programs will use UTF-8 for the encoding of Unicode strings. Boost.Parser is written with this typical case in mind. This means that if you are parsing 32-bit code points (as you always are in the Unicode path), and you want to catch the result in a container C of char or char8_t values, Boost.Parser will silently transcode from UTF-32 to UTF-8 and write the attribute into C. This means that std::string, std::u8string, etc. are fine to use as attribute out-parameters for *char_, and the result will be UTF-8.

[Note] Note

UTF-16 strings as attributes are not supported directly. If you want to use UTF-16 strings as attributes, you may need to do so by transcoding a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can do this by using boost::parser::as_utf16.

The treatment of strings as UTF-8 is nearly ubiquitous within Boost.Parser. For instance, though the entire interface of symbols uses std::string or std::string_view, UTF-32 comparisons are used internally.

Explicit transcoding

I mentioned above that the use of boost::parser::utf*_view as the range to parse opts you in to Unicode parsing. Here's a bit more about these views and how best to use them.

If you want to do Unicode parsing, you're always going to be comparing code points at each step of the parse. As such, you're going to implicitly convert any parse input to UTF-32, if needed. This is what all the parse API functions do internally.

However, there are times when you have parse input that is a sequence of UTF-8-encoded chars, and you want to do Unicode-aware parsing. As mentioned previously, Boost.Parser has a special case for char inputs, and it will not assume that char sequences are UTF-8. If you want to tell the parse API to do Unicode processing on them anyway, you can use the as_utf32 range adapter. (Note that you can use any of the as_utf* adaptors and the semantics will not differ from the semantics below.)

namespace bp = boost::parser;

auto const p = '"' >> *(bp::char_ - '"' - 0xb6) >> '"';
char const * str = "\"two wörds\""; // ö is two code units, 0xc3 0xb6

auto result_1 = bp::parse(str, p);                // Treat each char as a code point (typically ASCII).
assert(!result_1);
auto result_2 = bp::parse(str | bp::as_utf32, p); // Unicode-aware parsing on code points.
assert(result_2);

The first call to parse() treats each char as a code point, and since "ö" is the pair of code units 0xc3 0xb6, the parse matches the second code unit against the - 0xb6 part of the parser above, causing the parse to fail. This happens because each code unit/char in str is treated as an independent code point.

The second call to parse() succeeds because, when the parse gets to the code point for 'ö', it is 0xf6 (U+00F6), which does not match the - 0xb6 part of the parser.

The other adaptors as_utf8 and as_utf16 are also provided for completeness, if you want to use them. They each can transcode any sequence of character types.

[Important] Important

The as_utfN adaptors are optional, so they don't come with parser.hpp. To get access to them, #include <boost/parser/transcode_view.hpp>.

(Lack of) normalization

One thing that Boost.Parser does not handle for you is normalization; Boost.Parser is completely normalization-agnostic. Since all the parsers do their matching using equality comparisons of code points, you should make sure that your parsed range and your parsers all use the same normalization form.


PrevUpHomeNext