Boost.Text provides conversions among the UTF-8, UTF-16, and UTF-32 encodings, via converting iterators. The conversion among UTF encoding formats is referred to as transcoding. The converting iterators are:
There are three make-functions that create these transcoding iterators:
A make-function is provided for each to-encoding; utf16_iterator(a,
b, c)
returns:
a utf_8_to_16_iterator
if a
, b
,
and c
are iterators to UTF-8
code units; a utf_32_to_16_iterator
if a
, b
,
and c
are iterators to UTF-32
code units (i.e. code points); and b
otherwise. See the section on iterator
unpacking for details.
By default, the transcoding iterators produce the Unicode replacement character
0xFFFD
when encountering an invalid
encoding. The exact error handling behavior can be controlled via the ErrorHandler
template parameter.
Note | |
---|---|
The error handling strategy of producing replacement characters is used exclusively within Boost.Text when performing conversions. |
The Unicode standard is flexible with respect to where, in an incoming stream, encoding errors are reported. However, the standard provides recommendations for where within a stream, and how frequently within that stream, errors should be reported. Boost.Text's converting iterators follow the Unicode recommendations. See Unicode, "Best Practices for Using U+FFFD" and Table 3-8.
The converting iterators are pretty straightforward, but there is an important caveat. Because each of these converting iterators does a substantial amount of work in increment and decrement operations, including in some cases caching the result of reading several bytes of a multi-byte encoding, post-increment and post-decrement can be quite a bit more expensive than pre-increment and pre-decrement.
To use a converting iterator, you must provide it with an underlying iterator. An example of use:
uint32_t const utf32[] = {0x004d, 0x0430, 0x4e8c, 0x10302}; char const utf8[] = { 0x4d, char(0xd0), char(0xb0), char(0xe4), char(0xba), char(0x8c), char(0xf0), char(0x90), char(0x8c), char(0x82)}; int i = 0; for (auto it = boost::text::utf32_iterator( std::begin(utf8), std::begin(utf8), std::end(utf8)), end = boost::text::utf32_iterator( std::begin(utf8), std::end(utf8), std::end(utf8)); it != end; ++it) { uint32_t cp = *it; assert(cp == utf32[i++]); }
That's a lot of typing, so there's also a much terser range-based form using
utf32_view
:
uint32_t const utf32[] = {0x004d, 0x0430, 0x4e8c, 0x10302}; char const utf8[] = { 0x4d, char(0xd0), char(0xb0), char(0xe4), char(0xba), char(0x8c), char(0xf0), char(0x90), char(0x8c), char(0x82)}; int i = 0; for (auto cp : boost::text::as_utf32(utf8)) { assert(cp == utf32[i++]); }
All the transcoding iterators are bidirectional. When each one is dereferenced,
it may read multiple values from the underlying input range. For instance,
a utf_8_to_32_iterator it
has to read 1-4 code units to produce a code point when performing the operation
*it
.
Therefore, each of these iterators requires the endpoints of the underlying
range, to ensure that reads outside the underlying range generate replacement
characters instead of undefined behavior.
All the transcoding iterators still exhibit undefined behavior when incrementing past the end of, or decrementing before the beginning of, the underlying range. The generation of replacment characters only happens in cases where the number of code units is variable and there is an incomplete sequence of code units right at the beginning or end.
For instance, if you have a sequence of UTF-8 that consists of a single broken code point sequence:
// That the full UTF-8 encoding for U+8D00 is 0xe8, 0xb4, 0x80. char const broken_utf8_for_0x8d00[] = { char(0xe8), char(0xb4) /* Incomplete! */}; auto it = boost::text::utf32_iterator( std::begin(broken_utf8_for_0x8d00), std::begin(broken_utf8_for_0x8d00), std::end(broken_utf8_for_0x8d00)); assert(*it == boost::text::replacement_character()); // --it; // Undefined behavior! ++it; // Ok. Incrementing through a broken sequence is defined. // ++it; // Undefined behavior! A subsequent increment past the end is not defined. }
then *it
will produce a replacement character, and the behavior of std::next(it)
is defined. However, the behaviors of std::prev(it)
and std::next(it, 2)
are each undefined.
In keeping with the ranges and algorithms from std::ranges
,
the transcoding iterators accept an underlying input range that may be delimited
by different types: the range is represented as an iterator-sentinel pair,
instead of a pair of iterators. Boost.Text provides a null-terminated string
sentinel, null_sentinel_t
,
which can be used to represent a null-terminated string as a pointer-null_sentinel_t
pair.
Each transcoding iterator contains three elements: the [first, last)
of
the underlying range, and the current position within that range. In the
worst case, a transcoding iterator is the size of three pointers. If the
underlying range is end-delimited by a null_sentinel_t
, it may be
a bit smaller.
There are output iterator adapters for each of the iterators above. For each
transcoding iterator utf_N_to_M_iterator
,
there is are these iterators (and their associated functions):
utf_N_to_M_out_iterator
- This is an adapting iterator that accepts UTF-N
values and writes values of type UTF-M
to an underlying output iterator. Each write to this iterator may write
multiple values to the underlying iterator, depending on N
and M
.
You can create one using utf_N_to_M_out
.
utf_N_to_M_insert_iterator
- This is analogous to std::insert_iterator
,
but also does UTF-N
to
UTF-M
transcoding. You
can create one using from_utfN_inserter
;
M
is deduced from the
size of the value_type
of the container.
utf_N_to_M_front_insert_iterator
- This is analogous to std::front_insert_iterator
,
but also does UTF-N
to
UTF-M
transcoding. You
can create one using from_utfN_front_inserter
;
M
is deduced from the
size of the value_type
of the container.
utf_N_to_M_back_insert_iterator
- This is analogous to std::back_insert_iterator
,
but also does UTF-N
to
UTF-M
transcoding. You
can create one using from_utfN_back_inserter
;
M
is deduced from the
size of the value_type
of the container.
The transcoding iterators are available for flexibility. In particular, they can be used with the standard algorithms. However, this flexibility comes at a cost. When doing a bulk-transcoding operation, using these iterators can be substantially slower than using the transcoding algorithms.
Using transcoding iterators directly is a bit verbose and tedious. Also, when you use two transcoding iterators as an iterator pair, it may be the size of 6 underlying itertors!
So, both for convenience and for size optimization, Boost.Text provides view
templates for UTF-8, UTF-16, and UTF-32: utf8_view
, utf16_view
, and utf32_view
,
respectively. Just as with the transcoding iterators, each view can be constructed
from a pair of iterators or as an iterator-sentinel pair.
Make-functions as_utf8()
, as_utf16()
,
and as_utf32()
, are also provided. The
as_utfN()
functions are overloaded to take an iterator and sentinel, or a range:
std::string str = "some text"; // These each represent the same range: auto const utf32_range_the_hard_way = boost::text::utf32_view<std::string::iterator>(str.begin(), str.end()); auto const utf32_range_from_utf8_range = boost::text::as_utf32(str); auto const utf32_range_from_utf8_iters = boost::text::as_utf32(str.begin(), str.end()); auto const utf32_range_from_utf8_iter_and_sentinel = boost::text::as_utf32(str.c_str(), boost::text::null_sentinel);
These functions also support null-terminated strings; you can just pass a
pointer to as_utfN()
:
auto const range_1 = boost::text::as_utf32("text"); // char const * auto const range_2 = boost::text::as_utf32(u8"more text"); // char8_t const *, in C++20 and later auto const range_3 = boost::text::as_utf32(u16"more text"); // char16_t const * auto const range_4 = boost::text::as_utf32(u32"more text"); // char32_t const *
You can pass any range of 1-, 2-, or 4-byte integral values to as_utfN()
,
and you will get a range that transcodes from UTF-8, UTF-16, or UTF-32 to
UTF-N
, respectively.
The views are all streamable. They transcode to UTF-8 when streamed to a
std::basic_ostream<char>
,
and they transcode to UTF-16 when streamed to a std::basic_ostream<wchar_t>
(Windows only).
Note | |
---|---|
Though it does not produce a |
Important | |
---|---|
|
A simple way to represent a transcoding view is as a pair of transcoding
iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>>
would be a range the size of 6 pointers.
Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>>
would be the size of 18 pointers!
Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when
it could have done a direct UTF-8 to UTF-32 conversion instead.
To solve these kinds of problems, as_utfN()
unpacks the iterators it is given, so that
only the bottom-most underlying pointer or iterator is stored:
std::string str = "some text"; auto to_16_first = boost::text::utf_8_to_16_iterator<std::string::iterator>( str.begin(), str.begin(), str.end()); auto to_16_last = boost::text::utf_8_to_16_iterator<std::string::iterator>( str.begin(), str.end(), str.end()); auto to_32_first = boost::text::utf_16_to_32_iterator< boost::text::utf_8_to_16_iterator<std::string::iterator> >(to_16_first, to_16_first, to_16_last); auto to_32_last = boost::text::utf_16_to_32_iterator< boost::text::utf_8_to_16_iterator<std::string::iterator> >(to_16_first, to_16_last, to_16_last); auto range = boost::text::as_utf8(to_32_first, to_32_last); static_assert(std::is_same<decltype(range), boost::text::utf8_view<std::string::iterator>>::value, "");
Each of these views stores only the unpacked iterator and sentinel, so each view is typically the size of two pointers, and possibly smaller if a sentinel is used.
The same unpacking logic is used in utfN_iterator()
, from_utfN_inserter()
, the transcoding algorithms, and the normalization
algorithms. This allows you to write boost::text::as_uf32(first, last)
in a generic context, without caring whether
first
and last
are iterators to a sequence of UTF-8, UTF-16, or UTF-32. You also do not
need to care about whether first
and last
are raw pointers,
some other kind of iterator, or transcoding iterators. For example, if first
is a utf_32_to_8_iterator
, the resulting
view will use first.base()
for its begin-iterator.
When you only need to transcode from one UTF encoding to another, use the
transcoding algorithms instead of the transcoding iterators. The algorithms
are quite a bit faster in most cases (transcode_to_utf32()
is particularly faster when given UTF-8 input, as it uses SIMD instructions
when available).
There are three of these, just as there are three make-functions for transcoding iterators:
Like the default behavior of the transcoding iterators, these algorithms
produce the Unicode replacement character 0xFFFD
when encountering an invalid encoding. Unlike the iterators, the algorithms
are not configurable to handle errors in any other way.
These are fully generic algorithms, and there are overloads that take iterator-sentinel pairs as well as ones that take ranges. Also, the transcode algorithms unpack the iterators that they are given, which can be a large optimization. See the iterator unpacking section for details.
Important | |
---|---|
Though these algorithms are generic, some of the optimizations in them require pointers on both the input and output iterators for maximum performance. In performance-critical code paths, stick to pointers. |
Since there are multiple ways to perform transcoding, how do you pick one? Here are some guidelines:
as_utfN()
function call, use the transcoding views.
When using utf_8_to_32_iterator
or utf32_view
,
it is often desirable to get access to the underlying sequence of char
s (e.g. for copying into a buffer or constructing
a std::string
).
utf_8_to_32_iterator
exposes, and in fact all the converting iterators expose, the iterator they
are parameterized with, via the member function base()
. You can always get at the sequence of
char
underlying the code point
sequence exposed by utf_8_to_32_iterator
like this:
boost::text::utf_8_to_32_iterator first = /* ... */; boost::text::utf_8_to_32_iterator last = /* ... */; // Copy [first, last) as code points. std::vector<uint32_t> cp_vec; std::copy(first, last, std::back_inserter(cp_vec)); // Copy [first, last) as chars. std::vector<char> char_vec; std::copy(first.base(), last.base(), std::back_inserter(char_vec));
See the iterator unpacking section for more detail about how this is used within Boost.Text's interfaces.
Unicode text often contains sequences in which a noncombining code point (e.g. 'A') is followed by one or more combining code points (e.g. some number of umlauts). It is valid to have an 'A' followed by 100 million umlauts. This is valid but not useful. Unicode specifies something called the Stream-Safe Format. This format inserts extra code points between combiners to ensure that there are never more than 30 combiners in a row. In practice, you should never need anywhere near 30.
Boost.Text provides an API for putting text in a Stream-Safe Format:
std::vector<uint32_t> code_points = /* ... */; auto it = boost::text::stream_safe(code_points); code_points.erase(it.base(), code_points.end()); assert(boost::text::is_safe_stream(code_points));
There is also a view API:
std::vector<uint32_t> code_points = /* ... */; auto const stream_safe_view = boost::text::as_stream_safe(code_points);
These operations do not implement the Stream-Safe Format algorithm described on the Unicode web site. Instead, it takes the much simpler approach of allowing only at most 8 combiners after any noncombiner. The rest are truncated.
Important | |
---|---|
Long sequences of combining characters create a problem for algorithms like normalization or grapheme breaking; the grapheme breaking algorithm may be required to look ahead a very long way in order to determine how to handle the current grapheme. To address this, Unicode allows a conforming implementation to assume that a sequence of code points contains graphemes of at most 32 code points. This is known as the Stream-Safe Format assumption; Boost.Text makes this assumption.
If you give Boost.Text algorithms a code point sequence with graphemes
longer than 32 code points, you will get undefined behavior. This poses
a security problem. To address this, use the stream-safe API (e.g. |
Boost.Text provides algorithms for all four Unicode normalization forms: NFD, NFKD, NFC, and NFKC. In addition, it provides an unofficial fifth normalization form called FCC that is described in Unicode Technical Note #5. FCC is just as compact as the most compact official form, NFC, except in a few degenerate cases. FCC is particularly useful when doing collation — the collation algorithm requires its inputs to be normalized NFD or FCC, and FCC is much faster to normalize to.
The algorithm is invoked as normalize<X>()
,
where X
is one of the enumerators
of nf
("Normalization Form"). Range and iterator-based overloads are
provided. The iterator interfaces require iterators that model code_point_iter
, and ranges that
model code_point_range
.
There are also algorithms that can check if a code point sequence is in a
certain normalization form.
// 쨰◌̴ᆮ HANGUL SYLLABLE JJYAE, COMBINING TILDE OVERLAY, HANGUL JONGSEONG TIKEUT std::array<uint32_t, 4> const nfd = {{ 0x110D, 0x1164, 0x0334, 0x11AE }}; // Iterator interface. assert(boost::text::normalized<boost::text::nf::d>(nfd.begin(), nfd.end())); { std::vector<uint32_t> nfc; // Iterator interface. boost::text::normalize<boost::text::nf::c>(nfd.begin(), nfd.end(), std::back_inserter(nfc)); // Range interface. assert(boost::text::normalized<boost::text::nf::c>(nfc)); } { std::vector<uint32_t> nfc; // Range interface. boost::text::normalize<boost::text::nf::c>(nfd, std::back_inserter(nfc)); // Range interface. assert(boost::text::normalized<boost::text::nf::c>(nfc)); }
There are std::string
-specific
in-place normalization functions as well, in normalize_string.hpp.
There is also an API for normalizing a code point sequence and appending the result to a container:
// 쨰◌̴ᆮ HANGUL SYLLABLE JJYAE, COMBINING TILDE OVERLAY, HANGUL JONGSEONG TIKEUT std::array<uint32_t, 4> const nfd = {{ 0x110D, 0x1164, 0x0334, 0x11AE }}; // Iterator interface. assert(boost::text::normalized<boost::text::nf::d>(nfd.begin(), nfd.end())); { std::vector<uint32_t> nfc; // Iterator interface. boost::text::normalize<boost::text::nf::c>(nfd.begin(), nfd.end(), std::back_inserter(nfc)); // Range interface. assert(boost::text::normalized<boost::text::nf::c>(nfc)); } { std::vector<uint32_t> nfc; // Range interface. boost::text::normalize<boost::text::nf::c>(nfd, std::back_inserter(nfc)); // Range interface. assert(boost::text::normalized<boost::text::nf::c>(nfc)); }
This is much more performant than the normalize()
function, because the output iterator used
by normalize()
slows things down quite a bit — normalize()
can be factors slower than normalize_append()
.
normalize_append()
will only append to UTF-8 and UTF-16 containers (UTF-32-encoded containers
are very uncommon, and so are not supported).
Note | |
---|---|
When you're working entirely within a UTF-8 encoding (on both sides of
the normalization operation), the most efficient version of the normalization
API is |
If you need to insert text into a std::string
or other STL-compatible container, you can use the erase/insert/replace API,
found in normalize_algorithm.hpp:
There are iterator and range overloads of each. Each one:
This last step is necessary because insertions and erasures may create situations in which code points which may combine are now next to each other, when they were not before.
This API is like the normalize_append()
overloads in that it may operate on UTF-8
or UTF-16 containers, and deduces the UTF from the size of the mutated container's
value_type
.