PrevUpHomeNext

Encoding and Normalization

Transcoding Iterators

Boost.Text provides conversions among the UTF-8, UTF-16, and UTF-32 encodings, via converting iterators. The conversion among UTF encoding formats is referred to as transcoding. The converting iterators are:

There are three make-functions that create these transcoding iterators:

A make-function is provided for each to-encoding; utf16_iterator(a, b, c) returns: a utf_8_to_16_iterator if a, b, and c are iterators to UTF-8 code units; a utf_32_to_16_iterator if a, b, and c are iterators to UTF-32 code units (i.e. code points); and b otherwise. See the section on iterator unpacking for details.

By default, the transcoding iterators produce the Unicode replacement character 0xFFFD when encountering an invalid encoding. The exact error handling behavior can be controlled via the ErrorHandler template parameter.

[Note] Note

The error handling strategy of producing replacement characters is used exclusively within Boost.Text when performing conversions.

The Unicode standard is flexible with respect to where, in an incoming stream, encoding errors are reported. However, the standard provides recommendations for where within a stream, and how frequently within that stream, errors should be reported. Boost.Text's converting iterators follow the Unicode recommendations. See Unicode, "Best Practices for Using U+FFFD" and Table 3-8.

The converting iterators are pretty straightforward, but there is an important caveat. Because each of these converting iterators does a substantial amount of work in increment and decrement operations, including in some cases caching the result of reading several bytes of a multi-byte encoding, post-increment and post-decrement can be quite a bit more expensive than pre-increment and pre-decrement.

To use a converting iterator, you must provide it with an underlying iterator. An example of use:

uint32_t const utf32[] = {0x004d, 0x0430, 0x4e8c, 0x10302};
char const utf8[] = {
    0x4d,
    char(0xd0),
    char(0xb0),
    char(0xe4),
    char(0xba),
    char(0x8c),
    char(0xf0),
    char(0x90),
    char(0x8c),
    char(0x82)};
int i = 0;
for (auto it = boost::text::utf32_iterator(
              std::begin(utf8), std::begin(utf8), std::end(utf8)),
          end = boost::text::utf32_iterator(
              std::begin(utf8), std::end(utf8), std::end(utf8));
     it != end;
     ++it) {
    uint32_t cp = *it;
    assert(cp == utf32[i++]);
}

That's a lot of typing, so there's also a much terser range-based form using utf32_view:

uint32_t const utf32[] = {0x004d, 0x0430, 0x4e8c, 0x10302};
char const utf8[] = {
    0x4d,
    char(0xd0),
    char(0xb0),
    char(0xe4),
    char(0xba),
    char(0x8c),
    char(0xf0),
    char(0x90),
    char(0x8c),
    char(0x82)};
int i = 0;
for (auto cp : boost::text::as_utf32(utf8)) {
    assert(cp == utf32[i++]);
}

All the transcoding iterators are bidirectional. When each one is dereferenced, it may read multiple values from the underlying input range. For instance, a utf_8_to_32_iterator it has to read 1-4 code units to produce a code point when performing the operation *it. Therefore, each of these iterators requires the endpoints of the underlying range, to ensure that reads outside the underlying range generate replacement characters instead of undefined behavior.

All the transcoding iterators still exhibit undefined behavior when incrementing past the end of, or decrementing before the beginning of, the underlying range. The generation of replacment characters only happens in cases where the number of code units is variable and there is an incomplete sequence of code units right at the beginning or end.

For instance, if you have a sequence of UTF-8 that consists of a single broken code point sequence:

// That the full UTF-8 encoding for U+8D00 is 0xe8, 0xb4, 0x80.
char const broken_utf8_for_0x8d00[] = {
    char(0xe8), char(0xb4) /* Incomplete! */};
auto it = boost::text::utf32_iterator(
    std::begin(broken_utf8_for_0x8d00),
    std::begin(broken_utf8_for_0x8d00),
    std::end(broken_utf8_for_0x8d00));

assert(*it == boost::text::replacement_character());

// --it; // Undefined behavior!
++it; // Ok.  Incrementing through a broken sequence is defined.
// ++it; // Undefined behavior!  A subsequent increment past the end is not defined.

}

then *it will produce a replacement character, and the behavior of std::next(it) is defined. However, the behaviors of std::prev(it) and std::next(it, 2) are each undefined.

In keeping with the ranges and algorithms from std::ranges, the transcoding iterators accept an underlying input range that may be delimited by different types: the range is represented as an iterator-sentinel pair, instead of a pair of iterators. Boost.Text provides a null-terminated string sentinel, null_sentinel_t, which can be used to represent a null-terminated string as a pointer-null_sentinel_t pair.

Each transcoding iterator contains three elements: the [first, last) of the underlying range, and the current position within that range. In the worst case, a transcoding iterator is the size of three pointers. If the underlying range is end-delimited by a null_sentinel_t, it may be a bit smaller.

Transcoding Output Iterators

There are output iterator adapters for each of the iterators above. For each transcoding iterator utf_N_to_M_iterator, there is are these iterators (and their associated functions):

Transcoding Iterator Performance

The transcoding iterators are available for flexibility. In particular, they can be used with the standard algorithms. However, this flexibility comes at a cost. When doing a bulk-transcoding operation, using these iterators can be substantially slower than using the transcoding algorithms.

Transcoding Views

Using transcoding iterators directly is a bit verbose and tedious. Also, when you use two transcoding iterators as an iterator pair, it may be the size of 6 underlying itertors!

So, both for convenience and for size optimization, Boost.Text provides view templates for UTF-8, UTF-16, and UTF-32: utf8_view, utf16_view, and utf32_view, respectively. Just as with the transcoding iterators, each view can be constructed from a pair of iterators or as an iterator-sentinel pair.

Make-functions as_utf8(), as_utf16(), and as_utf32(), are also provided. The as_utfN() functions are overloaded to take an iterator and sentinel, or a range:

std::string str = "some text";

// These each represent the same range:

auto const utf32_range_the_hard_way =
    boost::text::utf32_view<std::string::iterator>(str.begin(), str.end());

auto const utf32_range_from_utf8_range =
    boost::text::as_utf32(str);

auto const utf32_range_from_utf8_iters =
    boost::text::as_utf32(str.begin(), str.end());

auto const utf32_range_from_utf8_iter_and_sentinel =
    boost::text::as_utf32(str.c_str(), boost::text::null_sentinel);

These functions also support null-terminated strings; you can just pass a pointer to as_utfN():

auto const range_1 = boost::text::as_utf32("text");         // char const *
auto const range_2 = boost::text::as_utf32(u8"more text");  // char8_t const *, in C++20 and later
auto const range_3 = boost::text::as_utf32(u16"more text"); // char16_t const *
auto const range_4 = boost::text::as_utf32(u32"more text"); // char32_t const *

You can pass any range of 1-, 2-, or 4-byte integral values to as_utfN(), and you will get a range that transcodes from UTF-8, UTF-16, or UTF-32 to UTF-N, respectively.

The views are all streamable. They transcode to UTF-8 when streamed to a std::basic_ostream<char>, and they transcode to UTF-16 when streamed to a std::basic_ostream<wchar_t> (Windows only).

[Note] Note

Though it does not produce a utfN_view, there is a related function to_string(), which takes two UTF-32 iterators and returns a std::string containing the given sequence, UTF-8-encoded.

[Important] Important

utf8_view, utf16_view, and utf32_view are all implemented in terms of C++20's std::ranges::view_interface. Boost.Text uses a pre-C++20-friendly implementation of this from Boost.STLIterfaces for pre-C++20 builds. The implication of using view_interface is that utf8_view, utf16_view, and utf32_view all have the fullest interface possible, based on the iterator and/or sentinel template parameters used to instantiate them. For instance, if you use a random access iterator to instantiate utf32_view, it will have an operator[]. If you use a bidirectional iterator instead, it will not have operator[]. See [view.interface] in the standard for more details.

Iterator "Unpacking"

A simple way to represent a transcoding view is as a pair of transcoding iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>> would be a range the size of 6 pointers. Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>> would be the size of 18 pointers! Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct UTF-8 to UTF-32 conversion instead.

To solve these kinds of problems, as_utfN() unpacks the iterators it is given, so that only the bottom-most underlying pointer or iterator is stored:

std::string str = "some text";

auto to_16_first = boost::text::utf_8_to_16_iterator<std::string::iterator>(
    str.begin(), str.begin(), str.end());
auto to_16_last = boost::text::utf_8_to_16_iterator<std::string::iterator>(
    str.begin(), str.end(), str.end());

auto to_32_first = boost::text::utf_16_to_32_iterator<
    boost::text::utf_8_to_16_iterator<std::string::iterator>
>(to_16_first, to_16_first, to_16_last);
auto to_32_last = boost::text::utf_16_to_32_iterator<
    boost::text::utf_8_to_16_iterator<std::string::iterator>
>(to_16_first, to_16_last, to_16_last);

auto range = boost::text::as_utf8(to_32_first, to_32_last);
static_assert(std::is_same<decltype(range),
                           boost::text::utf8_view<std::string::iterator>>::value, "");

Each of these views stores only the unpacked iterator and sentinel, so each view is typically the size of two pointers, and possibly smaller if a sentinel is used.

The same unpacking logic is used in utfN_iterator(), from_utfN_inserter(), the transcoding algorithms, and the normalization algorithms. This allows you to write boost::text::as_uf32(first, last) in a generic context, without caring whether first and last are iterators to a sequence of UTF-8, UTF-16, or UTF-32. You also do not need to care about whether first and last are raw pointers, some other kind of iterator, or transcoding iterators. For example, if first is a utf_32_to_8_iterator, the resulting view will use first.base() for its begin-iterator.

Transcoding Algorithms

When you only need to transcode from one UTF encoding to another, use the transcoding algorithms instead of the transcoding iterators. The algorithms are quite a bit faster in most cases (transcode_to_utf32() is particularly faster when given UTF-8 input, as it uses SIMD instructions when available).

There are three of these, just as there are three make-functions for transcoding iterators:

Like the default behavior of the transcoding iterators, these algorithms produce the Unicode replacement character 0xFFFD when encountering an invalid encoding. Unlike the iterators, the algorithms are not configurable to handle errors in any other way.

These are fully generic algorithms, and there are overloads that take iterator-sentinel pairs as well as ones that take ranges. Also, the transcode algorithms unpack the iterators that they are given, which can be a large optimization. See the iterator unpacking section for details.

[Important] Important

Though these algorithms are generic, some of the optimizations in them require pointers on both the input and output iterators for maximum performance. In performance-critical code paths, stick to pointers.

Choosing a Transcoding Mechanism

Since there are multiple ways to perform transcoding, how do you pick one? Here are some guidelines:

Accessing the Underlying UTF-8 chars

When using utf_8_to_32_iterator or utf32_view, it is often desirable to get access to the underlying sequence of chars (e.g. for copying into a buffer or constructing a std::string).

utf_8_to_32_iterator exposes, and in fact all the converting iterators expose, the iterator they are parameterized with, via the member function base(). You can always get at the sequence of char underlying the code point sequence exposed by utf_8_to_32_iterator like this:

boost::text::utf_8_to_32_iterator first = /* ... */;
boost::text::utf_8_to_32_iterator last = /* ... */;

// Copy [first, last) as code points.
std::vector<uint32_t> cp_vec;
std::copy(first, last, std::back_inserter(cp_vec));

// Copy [first, last) as chars.
std::vector<char> char_vec;
std::copy(first.base(), last.base(), std::back_inserter(char_vec));

See the iterator unpacking section for more detail about how this is used within Boost.Text's interfaces.

The Stream-Safe Format

Unicode text often contains sequences in which a noncombining code point (e.g. 'A') is followed by one or more combining code points (e.g. some number of umlauts). It is valid to have an 'A' followed by 100 million umlauts. This is valid but not useful. Unicode specifies something called the Stream-Safe Format. This format inserts extra code points between combiners to ensure that there are never more than 30 combiners in a row. In practice, you should never need anywhere near 30.

Boost.Text provides an API for putting text in a Stream-Safe Format:

std::vector<uint32_t> code_points = /* ... */;
auto it = boost::text::stream_safe(code_points);
code_points.erase(it.base(), code_points.end());
assert(boost::text::is_safe_stream(code_points));

There is also a view API:

std::vector<uint32_t> code_points = /* ... */;
auto const stream_safe_view = boost::text::as_stream_safe(code_points);

These operations do not implement the Stream-Safe Format algorithm described on the Unicode web site. Instead, it takes the much simpler approach of allowing only at most 8 combiners after any noncombiner. The rest are truncated.

[Important] Important

Long sequences of combining characters create a problem for algorithms like normalization or grapheme breaking; the grapheme breaking algorithm may be required to look ahead a very long way in order to determine how to handle the current grapheme. To address this, Unicode allows a conforming implementation to assume that a sequence of code points contains graphemes of at most 32 code points. This is known as the Stream-Safe Format assumption; Boost.Text makes this assumption.

If you give Boost.Text algorithms a code point sequence with graphemes longer than 32 code points, you will get undefined behavior. This poses a security problem. To address this, use the stream-safe API (e.g. boost::text::as_stream_safe()), or use the container-modifying free functions to normalize, render stream-safe, and erase/insert/replace in one call. See the Container-Modifying Normalization API section for details.

Normalization

Boost.Text provides algorithms for all four Unicode normalization forms: NFD, NFKD, NFC, and NFKC. In addition, it provides an unofficial fifth normalization form called FCC that is described in Unicode Technical Note #5. FCC is just as compact as the most compact official form, NFC, except in a few degenerate cases. FCC is particularly useful when doing collation — the collation algorithm requires its inputs to be normalized NFD or FCC, and FCC is much faster to normalize to.

The algorithm is invoked as normalize<X>(), where X is one of the enumerators of nf ("Normalization Form"). Range and iterator-based overloads are provided. The iterator interfaces require iterators that model code_point_iter, and ranges that model code_point_range. There are also algorithms that can check if a code point sequence is in a certain normalization form.

// 쨰◌̴ᆮ HANGUL SYLLABLE JJYAE, COMBINING TILDE OVERLAY, HANGUL JONGSEONG TIKEUT
std::array<uint32_t, 4> const nfd = {{ 0x110D, 0x1164, 0x0334, 0x11AE }};
// Iterator interface.
assert(boost::text::normalized<boost::text::nf::d>(nfd.begin(), nfd.end()));

{
    std::vector<uint32_t> nfc;
    // Iterator interface.
    boost::text::normalize<boost::text::nf::c>(nfd.begin(), nfd.end(), std::back_inserter(nfc));
    // Range interface.
    assert(boost::text::normalized<boost::text::nf::c>(nfc));
}

{
    std::vector<uint32_t> nfc;
    // Range interface.
    boost::text::normalize<boost::text::nf::c>(nfd, std::back_inserter(nfc));
    // Range interface.
    assert(boost::text::normalized<boost::text::nf::c>(nfc));
}

There are std::string-specific in-place normalization functions as well, in normalize_string.hpp.

There is also an API for normalizing a code point sequence and appending the result to a container:

// 쨰◌̴ᆮ HANGUL SYLLABLE JJYAE, COMBINING TILDE OVERLAY, HANGUL JONGSEONG TIKEUT
std::array<uint32_t, 4> const nfd = {{ 0x110D, 0x1164, 0x0334, 0x11AE }};
// Iterator interface.
assert(boost::text::normalized<boost::text::nf::d>(nfd.begin(), nfd.end()));

{
    std::vector<uint32_t> nfc;
    // Iterator interface.
    boost::text::normalize<boost::text::nf::c>(nfd.begin(), nfd.end(), std::back_inserter(nfc));
    // Range interface.
    assert(boost::text::normalized<boost::text::nf::c>(nfc));
}

{
    std::vector<uint32_t> nfc;
    // Range interface.
    boost::text::normalize<boost::text::nf::c>(nfd, std::back_inserter(nfc));
    // Range interface.
    assert(boost::text::normalized<boost::text::nf::c>(nfc));
}

This is much more performant than the normalize() function, because the output iterator used by normalize() slows things down quite a bit — normalize() can be factors slower than normalize_append(). normalize_append() will only append to UTF-8 and UTF-16 containers (UTF-32-encoded containers are very uncommon, and so are not supported).

[Note] Note

When you're working entirely within a UTF-8 encoding (on both sides of the normalization operation), the most efficient version of the normalization API is normalize_append(a, b, str), where a and b are iterators over an underlying sequence of UTF-8, and str is a container of integral type T, where sizeof(T) == 1.

Container-Modifying Normalization API

If you need to insert text into a std::string or other STL-compatible container, you can use the erase/insert/replace API, found in normalize_algorithm.hpp:

There are iterator and range overloads of each. Each one:

This last step is necessary because insertions and erasures may create situations in which code points which may combine are now next to each other, when they were not before.

This API is like the normalize_append() overloads in that it may operate on UTF-8 or UTF-16 containers, and deduces the UTF from the size of the mutated container's value_type.


PrevUpHomeNext