PrevUpHomeNext

The Text Layer

The text layer of Boost.Text consists mostly of four types: text, text_view, rope, and rope_view. These are directly analogous to std::string, std::string_view, unencoded_rope, and unencoded_rope_view, respectively. In fact, each of the text layer types has a single data member — an object of the corresponding unencoded type.

Each of these four types is a typedef of a general-purpose template. For instance, text is a specialization of basic_text that is normalized NFC, with a code unit type of char, and using std::string as the underlying storage.

The basic_ templates are:

Each template can be configured: to maintain one of the five supported normalization forms; with a particular underlying code unit type; and, for those that have underlying storage (basic_text and basic_rope), with the type of storage container to use. The UTF format each template uses is deduced from the code point type. For instance, char and char8_t imply UTF-8, whereas uint16_t and char16_t imply UTF-16. These templates do not support UTF-32.

For the rest of the documentation, we will focus on the four specializations of these templates that Boost.Text provides: text, text_view, rope, and rope_view.

Let's consider one of these specializations in particular, text. Its underlying storage is a std::string, as mentioned before. The chars in the std::string are UTF-8 encoded. The code point view of those UTF-8 encoded chars is available by using utf_8_to_32_iterator. Since utf_8_to_32_iterator generates replacement characters when it encounters a broken encoding, no runtime checks ever need to be performed on the encoding of the chars in a text's std::string data member.

All the text layer types are kept normalized at all times; the views may be initialized only from known-normalized types. There is a runtime cost associated with normalizing text that is inserted into a text layer type.

text and rope are also kept in Stream-Safe Format; text_view and rope_view may be initialized only from known-stream-safe types.

The text layer types are centered around graphemes instead of chars. Since graphemes are variable-length, this means that all indexing is done using iterators instead of integral indices; there is no random access to the graphemes in the text layer types.

Graphemes

When you dereference an iterator that you get from one of the text layer types, you get a grapheme_ref. This is a non-owning range of code points.

boost::text::text t = "A bit of text.";

// grapheme_refs should be declared as values in range-based for loops, since
// they are small value types.
for (auto grapheme : t) {
    std::cout << grapheme; // grapheme_ref is directly streamable.
    // grapheme_ref is also a code point range, of course
    for (auto cp : grapheme) {
        // Do something with code point cp here....
    }
}
std::cout << "\n";

If you want an owning set of code points, you can construct a grapheme (either from a grapheme_ref, or by some other means).

// These functions don't do anything interesting; Just pay attention to the
// interfaces.

// This returns a view into a temporary.  Don't do this.
boost::text::grapheme_ref<boost::text::text::iterator::iterator>
find_first_dot_bad(boost::text::text t)
{
    for (auto grapheme : t) {
        uint32_t target[1] = {'.'};
        // Using the one from Boost.Algorithm, so we get the 4-param version
        // even if we're not using a C++14 compiler.
        if (boost::algorithm::equal(
                grapheme.begin(), grapheme.end(), target, target + 1)) {
            // This line compiles without warnings.  The compiler doesn't know
            // how to tell me I'm returning a dangling reference, because it's
            // a view, not a builtin reference.
            return grapheme;
        }
    }
    return boost::text::grapheme_ref<boost::text::text::iterator::iterator>();
}

// This returns a grapheme that owns its code point storage, so it cannot
// dangle.  Do this.
boost::text::grapheme find_first_dot_good(boost::text::text t)
{
    for (auto grapheme : t) {
        uint32_t target[1] = {'.'};
        if (boost::algorithm::equal(
                grapheme.begin(), grapheme.end(), target, target + 1)) {
            return boost::text::grapheme(grapheme);
        }
    }
    return boost::text::grapheme();
}

[Note] Note

grapheme has a small-buffer optimization so that using it seldom involves allocations. This also implies that an array of graphemes is likely to be wasteful of storage. If you want a bunch of graphemes, you should probably be using a text.

An individual grapheme, in the form of a grapheme or grapheme_ref, can be inserted into a text or rope.

[Note] Note

The grapheme is the unit of work within the text layer. When using text layer types, you should always use the grapheme_range overloads of the Unicode algorithms.

Grapheme Iterators

Since the grapheme is the unit of work within the text layer, it's natural that the text layer types return grapheme_iterators from begin() and end(). But what about the times when you actually want to deal with sequences of code points or chars? This comes up often when you need to use legacy interfaces.

Fortunately, that's really easy. grapheme_iterator has a base() member that returns its underlying code point iterator. The code point iterator used in the text layer types is always utf_8_to_32_iterator, which also has a base() member that returns it's underlying char iterator:

// t is a GraphemeRange.
boost::text::text t = "This is a short sentence.";

// This is a code point range that contains all the same code points as t.
boost::text::utf32_view<boost::text::utf_8_to_32_iterator<char const *>> cps(
    t.begin().base(), t.end().base());

// This is achar range that contains all the same code units as t, though it
// is not null-terminated like t's underlying storage is.
std::vector<char> chars(t.begin().base().base(), t.end().base().base());

Since text's iterators are all declared in terms of char * or char const *, you have convenient access to the underlying null-terminated sequence of characters:

boost::text::text t = "This is a short sentence.";

assert(strlen(t.begin().base().base()) == 25);

t = "This is a short séance."; // é occupies two UTF-8 code units.
assert(strlen(t.begin().base().base()) == 24);

A Shortcoming

There is a problem with basing all your Unicode operations on graphemes. Graphemes do not always line up with the text segmentation algorithms. Specifically, lines are not guaranteed to end exactly on a grapheme break (the other text breaking algorithms all happen to end on grapheme breaks).

There are two things to note here:

  1. The cases where a line does not end on a grapheme break are obscure corner cases that do not often come up in practice.
  2. Graphemes are context-sensitive. In other words, if you break a line in the middle of a grapheme G, G's code points before the line break from their own grapheme, and the rest of G's code points form another grapheme. In other words, breaking graphemes is pretty benign; it's not like breaking encoding or normalization.

The Text Layer Types

As mentioned before, this layer has four types (text, text_view, rope, and rope_view) that are directly analogous to the four major unencoded types that they use for storage (std::string, std::string_view, unencoded_rope, and unencoded_rope_view). The high-level semantics of these underlying types is the same:

Many other high-level semantics apply as well, such as the size-type used for each. How about the differences?

An Odd Thing About Erase/Insert/Replace

When you insert into a text or rope, you get back an view that points to the inserted text. This is not what the standard C++ containers do.

Image that insert() returned a single iterator, as the standard containers do. So, I would get an iterator that points to the start of the insertion:

boost::text::rope r("e");
auto it = r.insert(r.end(), 'f');
assert(r.distance() == 2);           // We added a grapheme.
assert(it == std::next(r.begin()));  // This one, in fact.

Not so fast. If I insert a sequence starting with one or more combining code points, the beginning of the insertion might get combined with the preceding code point (due to normalization) or grapheme (because combining marks at the end of a grapheme are just part of that grapheme). Just as importantly, the same thing can happen at the end of the insertion, too.

For these reasons, text returns a view that indicates the full extent of its graphemes that were changed by erase/insert/replace.

char const * combining_diaeresis = u8"\u0308";
text::text t("e");

auto v = t.insert(t.end(), combining_diaeresis);
assert(t.distance() == 1);    // No new grapheme!  We now have the single code point 'ë'.
assert(v.begin() == t.begin());

it = t.insert(t.end(), combining_diaeresis);
assert(t.distance() == 1);    // Still no new grapheme.  We now have a grapheme with the code points 'ë' and '¨'.
assert(v.begin() == t.begin());

This behavior is correct, but a bit surprising when you first see it.

Since you may want to use text with some standard facilities that expect erase/insert/replace to return a single iterator, the view returned is implicitly convertible to a single iterator; the view's begin() is used for this.

Picking the Right String Type

Table 1.7. Picking the Right Text Type

If I want ...

... my text type is:

a reference to some text that will outlive the reference, without allocating

text_view

mutable text with efficient mutation at the end of the string

text

mutable text with efficient mutation at any point in the string

rope

text with contiguous storage

text_view or text

a null-terminated underlying string

text

mutable text the size of a single pointer

rope

thread-safe text

rope

text with the small-object optimization

text

text with copy-on-write semantics

rope

function parameters that may bind to text_views or texts

text_view

function parameters that may bind to text_views, texts, ropes, or rope_views

rope_view



PrevUpHomeNext