The text layer of Boost.Text consists mostly of four types: text
, text_view
, rope
, and rope_view
. These are directly
analogous to std::string
,
std::string_view
,
unencoded_rope
,
and unencoded_rope_view
,
respectively. In fact, each of the text layer types has a single data member
— an object of the corresponding unencoded type.
Each of these four types is a typedef of a general-purpose template. For instance,
text
is a specialization of basic_text
that is normalized
NFC, with a code unit type of char
,
and using std::string
as the underlying storage.
The basic_
templates are:
Each template can be configured: to maintain one of the five supported normalization
forms; with a particular underlying code unit type; and, for those that have
underlying storage (basic_text
and basic_rope
), with the type of
storage container to use. The UTF format each template uses is deduced from
the code point type. For instance, char
and char8_t
imply UTF-8, whereas
uint16_t
and char16_t
imply UTF-16. These templates do not
support UTF-32.
For the rest of the documentation, we will focus on the four specializations
of these templates that Boost.Text provides: text
, text_view
, rope
, and rope_view
.
Let's consider one of these specializations in particular, text
. Its underlying storage
is a std::string
,
as mentioned before. The char
s
in the std::string
are UTF-8 encoded. The code point view of those UTF-8 encoded char
s is available by using utf_8_to_32_iterator
. Since
utf_8_to_32_iterator
generates replacement characters when it encounters a broken encoding, no runtime
checks ever need to be performed on the encoding of the char
s
in a text
's
std::string
data member.
All the text layer types are kept normalized at all times; the views may be initialized only from known-normalized types. There is a runtime cost associated with normalizing text that is inserted into a text layer type.
text
and rope
are also kept in Stream-Safe
Format; text_view
and rope_view
may be initialized only from known-stream-safe types.
The text layer types are centered around graphemes instead of char
s. Since graphemes are variable-length,
this means that all indexing is done using iterators instead of integral indices;
there is no random access to the graphemes in the text layer types.
When you dereference an iterator that you get from one of the text layer types,
you get a grapheme_ref
.
This is a non-owning range of code points.
boost::text::text t = "A bit of text."; // grapheme_refs should be declared as values in range-based for loops, since // they are small value types. for (auto grapheme : t) { std::cout << grapheme; // grapheme_ref is directly streamable. // grapheme_ref is also a code point range, of course for (auto cp : grapheme) { // Do something with code point cp here.... } } std::cout << "\n";
If you want an owning set of code points, you can construct a grapheme
(either from a grapheme_ref
,
or by some other means).
// These functions don't do anything interesting; Just pay attention to the // interfaces. // This returns a view into a temporary. Don't do this. boost::text::grapheme_ref<boost::text::text::iterator::iterator> find_first_dot_bad(boost::text::text t) { for (auto grapheme : t) { uint32_t target[1] = {'.'}; // Using the one from Boost.Algorithm, so we get the 4-param version // even if we're not using a C++14 compiler. if (boost::algorithm::equal( grapheme.begin(), grapheme.end(), target, target + 1)) { // This line compiles without warnings. The compiler doesn't know // how to tell me I'm returning a dangling reference, because it's // a view, not a builtin reference. return grapheme; } } return boost::text::grapheme_ref<boost::text::text::iterator::iterator>(); } // This returns a grapheme that owns its code point storage, so it cannot // dangle. Do this. boost::text::grapheme find_first_dot_good(boost::text::text t) { for (auto grapheme : t) { uint32_t target[1] = {'.'}; if (boost::algorithm::equal( grapheme.begin(), grapheme.end(), target, target + 1)) { return boost::text::grapheme(grapheme); } } return boost::text::grapheme(); }
Note | |
---|---|
|
An individual grapheme, in the form of a grapheme
or grapheme_ref
, can be inserted
into a text
or rope
.
Note | |
---|---|
The grapheme is the unit of work within the text layer. When using text layer
types, you should always use the |
Since the grapheme is the unit of work within the text layer, it's natural
that the text layer types return grapheme_iterators
from begin()
and
end()
.
But what about the times when you actually want to deal with sequences of code
points or char
s? This comes up
often when you need to use legacy interfaces.
Fortunately, that's really easy. grapheme_iterator
has a base()
member
that returns its underlying code point iterator. The code point iterator used
in the text layer types is always utf_8_to_32_iterator
, which also
has a base()
member that returns it's underlying char
iterator:
// t is a GraphemeRange. boost::text::text t = "This is a short sentence."; // This is a code point range that contains all the same code points as t. boost::text::utf32_view<boost::text::utf_8_to_32_iterator<char const *>> cps( t.begin().base(), t.end().base()); // This is achar range that contains all the same code units as t, though it // is not null-terminated like t's underlying storage is. std::vector<char> chars(t.begin().base().base(), t.end().base().base());
Since text
's
iterators are all declared in terms of char
*
or char
const *
,
you have convenient access to the underlying null-terminated sequence of characters:
boost::text::text t = "This is a short sentence."; assert(strlen(t.begin().base().base()) == 25); t = "This is a short séance."; // é occupies two UTF-8 code units. assert(strlen(t.begin().base().base()) == 24);
There is a problem with basing all your Unicode operations on graphemes. Graphemes do not always line up with the text segmentation algorithms. Specifically, lines are not guaranteed to end exactly on a grapheme break (the other text breaking algorithms all happen to end on grapheme breaks).
There are two things to note here:
G
,
G
's code points before
the line break from their own grapheme, and the rest of G
's
code points form another grapheme. In other words, breaking graphemes is
pretty benign; it's not like breaking encoding or normalization.
As mentioned before, this layer has four types (text
, text_view
, rope
, and rope_view
) that are directly
analogous to the four major unencoded types that they use for storage (std::string
, std::string_view
,
unencoded_rope
,
and unencoded_rope_view
).
The high-level semantics of these underlying types is the same:
std::string
and text
are owning and contiguous.
std::string_view
and text_view
and non-owning and contiguous.
unencoded_rope
and rope
are owning and discontiguous.
unencoded_rope_view
and rope_view
are non-owning and discontiguous.
Many other high-level semantics apply as well, such as the size-type used for each. How about the differences?
char
s.
<
,
<=
, >
,
and >=
). This is because
this would lead to implicit ordering based on the type's binary representation,
and not on some collation. This is wrong in some — but not all —
contexts, making it bad candidate for implicit semantics.
==
and !=
), because they are
fundamental.
size()
members. To implement this would be an O(N)
operation, and so size()
would be a very misleading name for this.
There is instead a distance()
function that returns the number of graphemes
in O(N)
time.
There is an O(1)
member
storage_code_points()
that returns the size (in char
s)
of the underlying storage.
max_size()
either. There is instead a max_code_points()
.
text
has a capacity_code_points()
instead of a capacity()
member.
text
and rope
). extract()
moves the underlying container out, and
an overload of replace()
moves it in. replace()
requires that the moved-in container
is normalized and in Stream-Safe
Format.
When you insert into a text
or rope
,
you get back an view that points to the inserted text. This is not what the
standard C++ containers do.
Image that insert()
returned a single iterator, as the standard containers do. So, I would get
an iterator that points to the start of the insertion:
boost::text::rope r("e"); auto it = r.insert(r.end(), 'f'); assert(r.distance() == 2); // We added a grapheme. assert(it == std::next(r.begin())); // This one, in fact.
Not so fast. If I insert a sequence starting with one or more combining code points, the beginning of the insertion might get combined with the preceding code point (due to normalization) or grapheme (because combining marks at the end of a grapheme are just part of that grapheme). Just as importantly, the same thing can happen at the end of the insertion, too.
For these reasons, text
returns a view that indicates the full extent of its graphemes that were changed
by erase/insert/replace.
char const * combining_diaeresis = u8"\u0308"; text::text t("e"); auto v = t.insert(t.end(), combining_diaeresis); assert(t.distance() == 1); // No new grapheme! We now have the single code point 'ë'. assert(v.begin() == t.begin()); it = t.insert(t.end(), combining_diaeresis); assert(t.distance() == 1); // Still no new grapheme. We now have a grapheme with the code points 'ë' and '¨'. assert(v.begin() == t.begin());
This behavior is correct, but a bit surprising when you first see it.
Since you may want to use text
with some standard facilities
that expect erase/insert/replace to return a single iterator, the view returned
is implicitly convertible to a single iterator; the view's begin()
is used for this.
Table 1.7. Picking the Right Text Type
If I want ... |
... my text type is: |
---|---|
a reference to some text that will outlive the reference, without allocating |
|
mutable text with efficient mutation at the end of the string |
|
mutable text with efficient mutation at any point in the string |
|
text with contiguous storage |
|
a null-terminated underlying string |
|
mutable text the size of a single pointer |
|
thread-safe text |
|
text with the small-object optimization |
|
text with copy-on-write semantics |
|
function parameters that may bind to |
|
function parameters that may bind to |