"Unicode is hard."
-- Everyone
Unicode is hard to implement; the algorithms are crazy. Even as a user of Unicode,
it can be difficult to understand how one is supposed to use Unicode correctly.
The text
layer types do much of what is described in this section, but nicely out of
view. Unless you need to use many different normalization and/or encoding forms,
feel free to skip those portions of this section.
A primary design goal of the Unicode layer of Boost.Text is usability. To that end, the data model is as simple as possible.
There are multiple encoding types defined in Unicode: UTF-8, UTF-16, and UTF-32.
A code unit is the lowest-level datum-type in your Unicode
data. Examples are a char
in UTF-8
and a uint32_t
in UTF-32. A
code point is a 32-bit integral value that represents
a single Unicode value. Examples are U+0041 "A" "LATIN CAPITAL
LETTER A" and U+0308 "¨" "COMBINING DIAERESIS".
There are four different Unicode normalization forms. Normalization is necessary
because Unicode requires that certain combinations of code points be considered
identical. For instance, the two code points mentioned above, U+0041 U+0308,
appear like this: "Ä", and the code point U+00C4 appears like this:
"Ä". Since these two sequences are not visually distinct, all the
algorithms must treat them as the same thing. Therefore, u8"\U00000041\U00000308" ==
u8"\U000000C4"
must evaluate to true
for the
purposes of Unicode. Normalizations exist to put strings of code points into
canonical forms that can be bitwise compared.
An extended grapheme cluster, or just grapheme, is a sequence of code points that appears to the end-user to be a single character. For example, the code points (U+0041 U+0308) form a grapheme, since they appear when rendered to be the single character "Ä".
There are multiple versions of Unicode, and Boost.Text only supports one at a time. There are large volumes of data required to implement the Unicode algorithms, and adding data for N versions of Unicode would make an already large library larger by a factor of N.
Most Unicode data used in Boost.Text come straight from the published Unicode data files, but the collation data are taken from CLDR, with language-specific tailoring data taken from LDML (a part of the CLDR project).
To find out what versions of Unicode and CLDR
were used to generate Boost.Text's data, call unicode_version
or cldr_version
,
respectively.
Most of the Unicode layer algorithms are written as typical C++ standard algorithms;
they take iterators as input and produce output via an out-iterator. Since
ranges are the future, there are range overloads of the algorithms that take
a pair of iterators. The Unicode algorithms all operate on code points, so
they take code_point_iter
iterator parameters. The range overloads take code_point_range
parameters. For convenience,
overloads are provided for many of the Unicode layer algorithms that take
grapheme_range
and grapheme_iter
parameters. This provides convenient compatibility with the text layer types,
like text
and rope
.
Many of the algorithms and other functions accept null-terminated pointers,
and treat them as ranges. This is done to make calling them with string literals
and other null-terminated strings easier. A null terminated string s
is isomorphic with the range [s,
boost::text::null_sentinel)
, and this is what Boost.Text's functions treat
null-terminated strings as internally.
Many of the Unicode algorithms require that intermediate results be accumulated
in side buffers at various times during their operation. Therefore, many of
the algorithms in this section may allocate memory. However, Boost.Text makes
extensive use of boost::container::small_vector<>
s
for these side buffers. The end result is that though these algorithms may
allocate memory, in practice they seldom do, if ever.
You might notice that there are no interfaces in this layer that provide properties of code points, like whether a particular code point is space or punctuation. The reason for this is that such properties are complicated in Unicode.
Unicode defines properties like space and punctuation, but it defines them in a highly context-sensitive way; each algorithm has its own set of properties it associates with code points. For instance, the word breaking algorithm is concerned with single quotes and double quotes, and so has a property for each, but other punctuation is spread out among its other properties. It has no property that maps to something like "punctuation" in a general sense.
So if you want to know if a code point is whitespace or not, you might have
to look it up on a Unicode reference website (and implement a function for
yourself), or see if the whitespace code points covered by one of the Unicode
algorithms fits your needs and use that algorithm's *_prop()
function
(e.g. word_prop
).