An Overview of Boost.Text

Boost.Text is composed of two main layers:

The Unicode layer
The text layer

There are a couple of assorted bits that were necessary or useful to have around when implementing various parts of Boost.Text: segmented_vector, unencoded_rope, unencoded_rope_view, and trie/trie_map/trie_set.

The Unicode Layer

The Unicode layer provides a few Unicode-related utility types, but is primarily comprised of the Unicode algorithms. These algorithms are done in the style of the standard algorithms, with range-friendly interfaces. For each of the unicode algorithms there is a corresponding view. There are algorithms for these Unicode operations:

Transcoding among UTF-8, UTF-16, and UTF-32
Normalization, including the four Unicode normalization forms, plus the FCC form from Unicode Technical Note #5
Text segmentation (line breaks, word breaks, etc.)
Case mapping (to_upper(), is_lower(), etc.)
Collation, including tailoring using the LDML syntax and serialization of collation tables
Collation-aware searching, including caseless searching
The Unicode Bidirectional Algorithm, for laying out text that includes both left-to-right and right-to-left text

These algorithms are independent of the text layer; it is possible to use Boost.Text as a Unicode library without using the text layer at all.

The text Layer

The text layer is built on top of the Unicode layer. Its types encode text as UTF-8, and maintain normalization. Much of their implementation is done in terms of the algorithms from the Unicode layer. The types in this layer are: text, text_view, rope, and rope_view. It contains templates that can be instantiated with different UTF formats, normalization forms, and/or underlying storage.

The Assorted Bits

Finally, there are some items that I wrote in the process of implementing everything else, that rise to the level of general utility.

First is segmented_vector. This is a discontiguous sequence of T, for which insertions anywhere in the sequence are cheap, with very cheap copies provided via a copy-on-write mechanism. It is a generalization of unencoded_rope for arbitrary T.

The remaining assorted types are trie, trie_map, and trie_set. The first of these is a trie that is not a valid C++ container. The latter two are analogous to std::map and std::set, respectively, just built on a trie instead of a binary tree.