PrevUpHomeNext

The Unicode Layer

Encoding and Normalization
Text Segmentation
Case Mapping
Collation
Searching
Bidirectional Text
String Algorithms and Utilities

"Unicode is hard."

-- Everyone

Unicode is hard to implement; the algorithms are crazy. Even as a user of Unicode, it can be difficult to understand how one is supposed to use Unicode correctly. The text layer types do much of what is described in this section, but nicely out of view. Unless you need to use many different normalization and/or encoding forms, feel free to skip those portions of this section.

A primary design goal of the Unicode layer of Boost.Text is usability. To that end, the data model is as simple as possible.

A Quick Unicode Primer

There are multiple encoding types defined in Unicode: UTF-8, UTF-16, and UTF-32. A code unit is the lowest-level datum-type in your Unicode data. Examples are a char in UTF-8 and a uint32_t in UTF-32. A code point is a 32-bit integral value that represents a single Unicode value. Examples are U+0041 "A" "LATIN CAPITAL LETTER A" and U+0308 "¨" "COMBINING DIAERESIS".

There are four different Unicode normalization forms. Normalization is necessary because Unicode requires that certain combinations of code points be considered identical. For instance, the two code points mentioned above, U+0041 U+0308, appear like this: "Ä", and the code point U+00C4 appears like this: "Ä". Since these two sequences are not visually distinct, all the algorithms must treat them as the same thing. Therefore, u8"\U00000041\U00000308" == u8"\U000000C4" must evaluate to true for the purposes of Unicode. Normalizations exist to put strings of code points into canonical forms that can be bitwise compared.

An extended grapheme cluster, or just grapheme, is a sequence of code points that appears to the end-user to be a single character. For example, the code points (U+0041 U+0308) form a grapheme, since they appear when rendered to be the single character "Ä".

Unicode Versions

There are multiple versions of Unicode, and Boost.Text only supports one at a time. There are large volumes of data required to implement the Unicode algorithms, and adding data for N versions of Unicode would make an already large library larger by a factor of N.

Most Unicode data used in Boost.Text come straight from the published Unicode data files, but the collation data are taken from CLDR, with language-specific tailoring data taken from LDML (a part of the CLDR project).

To find out what versions of Unicode and CLDR were used to generate Boost.Text's data, call unicode_version or cldr_version, respectively.

Unicode Layer Parameter Conventions

Most of the Unicode layer algorithms are written as typical C++ standard algorithms; they take iterators as input and produce output via an out-iterator. Since ranges are the future, there are range overloads of the algorithms that take a pair of iterators. The Unicode algorithms all operate on code points, so they take code_point_iter iterator parameters. The range overloads take code_point_range parameters. For convenience, overloads are provided for many of the Unicode layer algorithms that take grapheme_range and grapheme_iter parameters. This provides convenient compatibility with the text layer types, like text and rope.

Many of the algorithms and other functions accept null-terminated pointers, and treat them as ranges. This is done to make calling them with string literals and other null-terminated strings easier. A null terminated string s is isomorphic with the range [s, boost::text::null_sentinel), and this is what Boost.Text's functions treat null-terminated strings as internally.

Memory Allocations

Many of the Unicode algorithms require that intermediate results be accumulated in side buffers at various times during their operation. Therefore, many of the algorithms in this section may allocate memory. However, Boost.Text makes extensive use of boost::container::small_vector<>s for these side buffers. The end result is that though these algorithms may allocate memory, in practice they seldom do, if ever.

What About Code Point Properties?

You might notice that there are no interfaces in this layer that provide properties of code points, like whether a particular code point is space or punctuation. The reason for this is that such properties are complicated in Unicode.

Unicode defines properties like space and punctuation, but it defines them in a highly context-sensitive way; each algorithm has its own set of properties it associates with code points. For instance, the word breaking algorithm is concerned with single quotes and double quotes, and so has a property for each, but other punctuation is spread out among its other properties. It has no property that maps to something like "punctuation" in a general sense.

So if you want to know if a code point is whitespace or not, you might have to look it up on a Unicode reference website (and implement a function for yourself), or see if the whitespace code points covered by one of the Unicode algorithms fits your needs and use that algorithm's *_prop() function (e.g. word_prop).


PrevUpHomeNext