Boost.Parser was designed from the start to be Unicode friendly. There are
numerous references to the "Unicode code path" and the "non-Unicode
code path" in the Boost.Parser documentation. Though there are in fact
two code paths for Unicode and non-Unicode parsing, the code is not very
different in the two code paths, as they are written generically. The only
difference is that the Unicode code path parses the input as a range of code
points, and the non-Unicode path does not. In effect, this means that, in
the Unicode code path, when you call parse(r, p)
for some input range r
and some parser p
, the parse
happens as if you called parse(r | boost::parser::as_utf32, p)
instead. (Of course, it does not matter if r
is a proper range, or an iterator/sentinel pair; those both work fine with
boost::parser::as_utf32
.)
Matching "characters" within Boost.Parser's parsers is assumed
to be a code point match. In the Unicode path there is a code point from
the input that is matched to each char_
parser. In the non-Unicode
path, the encoding is unknown, and so each element of the input is considered
to be a whole "character" in the input encoding, analogous to a
code point. From this point on, I will therefore refer to a single element
of the input exclusively as a code point.
So, let's say we write this parser:
constexpr auto char8_parser = boost::parser::char_('\xcc');
For any char_
parser that should match a value or values, the type of the value to match
is retained. So char8_parser
contains a char
that it will
use for matching. If we had written:
constexpr auto char32_parser = boost::parser::char_(U'\xcc');
char32_parser
would instead
contain a char32_t
that it would
use for matching.
So, at any point during the parse, if char8_parser
were being used to match a code point next_cp
from the input, we would see the moral equivalent of next_cp
== '\xcc'
,
and if char32_parser
were
being used to match next_cp
,
we'd see the equivalent of next_cp
== U'\xcc'
. The take-away here is that you can write
char_
parsers that match specific values, without worrying if the input is Unicode
or not because, under the covers, what takes place is a simple comparison
of two integral values.
Note | |
---|---|
Boost.Parser actually promotes any two values to a common type using |
Since matches are always done at a code point level (remember, a "code
point" in the non-Unicode path is assumed to be a single char
), you get different results trying to
match UTF-8 input in the Unicode and non-Unicode code paths:
namespace bp = boost::parser; { std::string str = (char const *)u8"\xcc\x80"; // encodes the code point U+0300 auto first = str.begin(); // Since we've done nothing to indicate that we want to do Unicode // parsing, and we've passed a range of char to parse(), this will do // non-Unicode parsing. std::string chars; assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars)); // Finds one match of the *char* 0xcc, because the value in the parser // (0xcc) was matched against the two code points in the input (0xcc and // 0x80), and the first one was a match. assert(chars == "\xcc"); } { std::u8string str = u8"\xcc\x80"; // encodes the code point U+0300 auto first = str.begin(); // Since the input is a range of char8_t, this will do Unicode // parsing. The same thing would have happened if we passed // str | boost::parser::as_utf32 or even str | boost::parser::as_utf8. std::string chars; assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars)); // Finds zero matches of the *code point* 0xcc, because the value in // the parser (0xcc) was matched against the single code point in the // input, 0x0300. assert(chars == ""); }
Additionally, it is expected that most programs will use UTF-8 for the encoding
of Unicode strings. Boost.Parser is written with this typical case in mind.
This means that if you are parsing 32-bit code points (as you always are
in the Unicode path), and you want to catch the result in a container C
of char
or char8_t
values, Boost.Parser
will silently transcode from UTF-32 to UTF-8 and write the attribute into
C
. This means that std::string
,
std::u8string
, etc. are fine to use as attribute
out-parameters for *char_
, and the result
will be UTF-8.
Note | |
---|---|
UTF-16 strings as attributes are not supported directly. If you want to
use UTF-16 strings as attributes, you may need to do so by transcoding
a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can
do this by using |
The treatment of strings as UTF-8 is nearly ubiquitous within Boost.Parser.
For instance, though the entire interface of symbols
uses std::string
or std::string_view
, UTF-32 comparisons are used
internally.
I mentioned above that the use of boost::parser::utf*_view
as the range to parse opts you in
to Unicode parsing. Here's a bit more about these views and how best to use
them.
If you want to do Unicode parsing, you're always going to be comparing code points at each step of the parse. As such, you're going to implicitly convert any parse input to UTF-32, if needed. This is what all the parse API functions do internally.
However, there are times when you have parse input that is a sequence of
UTF-8-encoded char
s, and you
want to do Unicode-aware parsing. As mentioned previously, Boost.Parser has
a special case for char
inputs,
and it will not assume that char
sequences are UTF-8. If you want to tell
the parse API to do Unicode processing on them anyway, you can use the as_utf32
range adapter. (Note that you
can use any of the as_utf*
adaptors and the semantics will not differ
from the semantics below.)
namespace bp = boost::parser; auto const p = '"' >> *(bp::char_ - '"' - 0xb6) >> '"'; char const * str = "\"two wörds\""; // ö is two code units, 0xc3 0xb6 auto result_1 = bp::parse(str, p); // Treat each char as a code point (typically ASCII). assert(!result_1); auto result_2 = bp::parse(str | bp::as_utf32, p); // Unicode-aware parsing on code points. assert(result_2);
The first call to parse()
treats each char
as a code point,
and since "ö"
is the
pair of code units 0xc3
0xb6
, the parse matches the second code unit
against the - 0xb6
part of the parser above, causing the parse to fail. This happens because
each code unit/char
in str
is treated as an independent code point.
The second call to parse()
succeeds because, when the parse gets to the code point for 'ö'
, it is 0xf6
(U+00F6), which does not match the -
0xb6
part of the parser.
The other adaptors as_utf8
and as_utf16
are also provided
for completeness, if you want to use them. They each can transcode any sequence
of character types.
Important | |
---|---|
The |
One thing that Boost.Parser does not handle for you is normalization; Boost.Parser is completely normalization-agnostic. Since all the parsers do their matching using equality comparisons of code points, you should make sure that your parsed range and your parsers all use the same normalization form.