Unicode provides algorithms for breaking code point sequences into graphemes, words, sentences, and lines. The Unicode Bidirectional Algorithm requires paragraph breaking too, so paragraph breaking is included as well, even though it is not an official Unicode text segmentation algorithm.
All the kinds of text breaking have a common pattern. Each kind of break
X
(where X
is word, sentence, etc.) provides at least these functions:
template <typename CPIter, typename Sentinel> CPIter prev_X_break(CPIter first, CPIter it, Sentinel last) noexcept;
prev_X_break()
returns it
if it
is already at a break, or the break
before it
otherwise. There
is one exception to this — even though there is always an implicit
break at the end of a sequence of code points, if it
== last
, the previous break
is still returned, if any.
This behavior allows us to do two convenient things with prev_X_break()
. First, we can use prev_X_break(first, it,
last)
== it
as a predicate that it
is
at a break. Second, we can use prev_X_break()
followed by next_X_break()
to find the nearest breaks around it
.
Note that prev_X_break()
requires last
because in
the general case, the algorithm needs to know context after it
to determine where the breaks are at
or before it
.
template <typename CPIter, typename Sentinel> CPIter next_X_break(CPIter first, Sentinel last) noexcept;
next_X_break()
returns the next break after first
.
It has a precondition that first
is already at a break; the results are otherwise undefined.
template<typename CPIter, typename Sentinel> utf32_view<CPIter> X(CPIter first, CPIter it, Sentinel last) noexcept;
X()
returns smallest range of code points that comprise an X
(word, line, etc.) in which it
is found.
template<typename CPIter, typename Sentinel> auto Xs(CPIter first, Sentinel last) noexcept;
Xs()
returns a view of subranges of [first, last)
.
Each subrange is an X
. Xs
is a view adaptor, and can be used in
pipe expressions — without parameters — as in r | Xs
.
Note | |
---|---|
Since all the text segmentation operations can be done in in terms of next
and previous steps, |
And of course there are code_point_range
overloads as well:
template<typename CPRange, typename CPIter> auto prev_X_break(CPRange & range, CPIter it) noexcept; template<typename CPRange, typename CPIter> auto next_X_break(CPRange & range, CPIter it) noexcept; template<typename CPRange, typename CPIter> auto X(CPRange & range, CPIter it) noexcept; template<typename CPRange> auto Xs(CPRange & range) noexcept;
For all kinds of breaks besides grapheme breaks, there are range overloads
that accept grapheme_range
ranges grapheme_iter
iterators instead.
These provide convenient support for using the Unicode layer algorithms with
the text layer types like text
and rope
.
template<typename GraphemeRange, typename GraphemeIter> auto prev_X_break(GraphemeRange const & range, GraphemeIter it) noexcept; template<typename GraphemeRange, typename GraphemeIter> auto next_X_break(GraphemeRange const & range, GraphemeIter it) noexcept; template<typename GraphemeRange, typename GraphemeIter> auto X(GraphemeRange const & range, GraphemeIter it) noexcept; template<typename GraphemeRange> auto Xs(GraphemeRange const & range) noexcept;
Unicode allows for tailoring of the segmentation algorithms, to produce customized results that are necessary or useful for a particular application, or to produce correct results in cases that the Unicode algorithms do not handle. Some of the break algorithms below are tailorable. Each section below indicates whether a certain kind of break is tailorable, and if so, how.
// U+0308 COMBINING ACUTE ACCENT std::array<uint32_t, 3> cps = {{'a', 0x0308, 'b'}}; auto const first = cps.begin(); auto const last = cps.end(); auto at_or_before_1 = boost::text::prev_grapheme_break(first, first + 1, last); assert(at_or_before_1 == first + 0); auto at_or_before_2 = boost::text::prev_grapheme_break(cps, first + 2); assert(at_or_before_2 == first + 2); auto at_or_before_3 = boost::text::prev_grapheme_break(first, first + 3, last); assert(at_or_before_3 == first + 2); auto after_0 = boost::text::next_grapheme_break(first, last); assert(after_0 == first + 2); // Prints "[0, 2) [2, 3)". for (auto range : boost::text::as_graphemes(cps)) { std::cout << '[' << (range.begin() - first) << ", " << (range.end() - first) << ") "; } std::cout << "\n"; // Prints "[2, 3) [0, 2)". You should use std::views::reverse if you can. for (auto range : boost::text::as_graphemes(cps) | boost::text::reverse) { std::cout << '[' << (range.begin() - first) << ", " << (range.end() - first) << ") "; } std::cout << "\n";
Boost.Text does not support tailoring of grapheme breaking, because graphemes
are the fundamental unit of work for the text
layer of the library.
All code must have the same notion of what a grapheme is for that to work.
The overload set for producing a view of graphemes, and the associated view
adaptor, are called as_graphemes()
,
not graphemes()
.
Important | |
---|---|
grapheme_view<Iter1> v1(/*...*/); grapheme_view<Iter2> v2(/*...*/); std::ranges::search(v1, v2);
may not compile. The reasons is that the default comparator used by
As a workaround, you can pass a more permissive comparator, such as
grapheme_view<Iter1> v1(/*...*/); grapheme_view<Iter2> v2(/*...*/); std::ranges::search(v1, v2, std::equal_to<>{});
|
Word breaks occur where you'd expect — at the beginnings and ends
of words — but they also occur where you might not expect —
at the beginnings and ends of the code point sequences between
words. Here is an example of word breaks taken from Unicode
Text Segmention. The string "The
quick (“brown”) fox can’t jump 32.3 feet, right?"
is broken up into words like this:
Table 1.1. Word Break Example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that many of those "words" are not what most people would consider to be words. You may need to do some additional processing to find only the "real" words, if that matters in your use case.
The word breaking API can be used just as the grapheme break API, except
that it also has grapheme_range
overloads. Here are
some example calls using only the grapheme_range
overloads, with a
text
as the grapheme_range
:
// Using GraphemeRange/GraphemeIterator overloads... boost::text::text cps("The quick (\"brown\") fox can’t jump 32.3 feet, right?"); auto const first = cps.begin(); auto at_or_before_1 = boost::text::prev_word_break(cps, std::next(first, 1)); assert(at_or_before_1 == std::next(first, 0)); auto at_or_before_3 = boost::text::prev_word_break(cps, std::next(first, 3)); assert(at_or_before_3 == std::next(first, 3)); auto after_0 = boost::text::next_word_break(cps, first); assert(after_0 == std::next(first, 3)); auto around_7 = boost::text::word(cps, std::next(first, 7)); assert(around_7.begin() == std::next(first, 4)); assert(around_7.end() == std::next(first, 9)); // Prints the indices of the words from the table above. for (auto range : boost::text::words(cps)) { std::cout << '[' << std::distance(first, range.begin()) << ", " << std::distance(first, range.end()) << ") "; } std::cout << "\n"; // Prints the indices of the words from the table above, backward. for (auto range : boost::text::words(cps) | boost::text::reverse) { std::cout << '[' << std::distance(first, range.begin()) << ", " << std::distance(first, range.end()) << ") "; } std::cout << "\n";
This algorithm does not work for all languages. From Unicode Text Segmention:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
French and Italian words are not meant to be broken after an apostrophe,
but the default algorithm finds "l’objectif"
to be a single word.
Breaking on dashes is the default. For example, the default algorithm finds
"out-of-the-box"
to
be seven words.
There are other rarer failure cases in that document you might want to look at too.
Fortunately, unlike grapheme breaking, word breaking is tailorable. There are two ways to do so.
Each break algorithm is defined in terms of code point properties; each code point is a letter, digit, punctuation, etc. All the word break functions accept an optional word-property lookup function to replace the default one.
For example, here I've made a custom word property lookup function that treats
a regular dash '-'
as a MidLetter
. MidLetter
is a property that repesents code points that are part of a word as long
as it can reach at least one letter on either side, before reaching a word
break first:
boost::text::text cps("out-of-the-box"); // Prints "out - of - the - box". for (auto range : boost::text::words(cps)) { std::cout << boost::text::text_view(range.begin(), range.end()) << " "; } std::cout << "\n"; auto const custom_word_prop = [](uint32_t cp) { if (cp == '-') return boost::text::word_property::MidLetter; // '-' becomes a MidLetter return boost::text::word_prop(cp); // Otherwise, just use the default implementation. }; // Prints "out-of-the-box". for (auto range : boost::text::words(cps, custom_word_prop)) { std::cout << boost::text::text_view(range.begin(), range.end()) << " "; } std::cout << "\n";
From Unicode Text Segmention,
here are some other code points you might want to treat as MidLetter
, depending on your language and
use case:
Table 1.2. `MidLetter` Candidates
Code Point |
---|
U+002D ( - ) HYPHEN-MINUS |
U+055A ( ՚ ) ARMENIAN APOSTROPHE |
U+058A ( ֊ ) ARMENIAN HYPHEN |
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG |
U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN |
U+2010 ( ‐ ) HYPHEN |
U+2011 ( ‑ ) NON-BREAKING HYPHEN |
U+201B ( ‛ ) SINGLE HIGH-REVERSED-9 QUOTATION MARK |
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN |
U+30FB ( ・ ) KATAKANA MIDDLE DOT |
U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS |
U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS |
Another example from Unicode
Text Segmention is to treat spaces as MidNum
to support languages that use spaces as thousands separators, as in "€1 234,56"
. MidNum
is like MidLetter
, but for
the interior code points of numbers instead of words containing letters.
Here are the space code points you might want to do that with:
Table 1.3. `MidNum` Candidates
Code Point |
---|
U+0020 SPACE |
U+00A0 NO-BREAK SPACE |
U+2007 FIGURE SPACE |
U+2008 PUNCTUATION SPACE |
U+2009 THIN SPACE |
U+202F NARROW NO-BREAK SPACE |
Tailoring the properties for each code point works for some cases, but using
tailorings of the meanings of MidLetter
and MidNum
can only add to
the sizes of words; it cannot decrease their sizes. The word break functions
take a second optional parameter that allows you to pick arbitrary word breaks
based on limited context.
The Boost.Text implementation of the word break algorithm uses the current code point, plus two code points before and two code points after, to determine whether a word break exists at the current code point. Therefore, the signature of the custom word break function is this:
bool custom_break(uint32_t prev_prev, uint32_t prev, uint32_t curr, uint32_t next, uint32_t next_next);
Returning true
indicates that
[prev, curr]
straddles a word break — prev
is the last code point of one word,
and curr
is the first code
point of the next. If provided, this custom break function is evaluated before
any of the Unicode word break rules.
boost::text::text cps("snake_case camelCase"); // Prints "snake_case camelCase". for (auto range : boost::text::words(cps)) { std::cout << boost::text::text_view(range.begin(), range.end()) << " "; } std::cout << "\n"; // Break up words into chunks as if they were parts of identifiers in a // popular programming language. auto const identifier_break = [](uint32_t prev_prev, uint32_t prev, uint32_t curr, uint32_t next, uint32_t next_next) { if ((prev == '_') != (curr == '_')) return true; if (0x61 <= prev && prev <= 0x7a && 0x41 <= curr && curr <= 0x5a) return true; return false; }; // Prints "snake _ case camel Case". for (auto range : boost::text::words(cps, boost::text::word_prop, identifier_break)) { std::cout << boost::text::text_view(range.begin(), range.end()) << " "; } std::cout << "\n";
The sentence breaking API is the same as the word breaking API, without the extra tailoring parameters.
The paragraph breaking API is the same as the sentence and word breaking APIs, without the extra tailoring parameters.
Unicode does not list paragraph breaks as a specific kind of text segmentation, but it can be useful in some cases. In particular, paragraph detection is part of the Unicode bidirectional algorithm. One way of tailoring the behavior of the bidirectional algorithm is to process some paragraphs separately from others; having an API for detecting paragraph breaks makes that simpler.
The Unicode line breaking algorithm differs from the other break algorithms
in that there are multiple kinds of line breaks. Some line breaks are required,
as after a newline (e.g. "\n"
or "\r\n"
). These are
known as hard line breaks.
The line breaking algorithm produces many more line breaks, but all non-hard line breaks are places where it is possible to break the line — though it is not necessary to do so. These are known as allowed line breaks. Higher-level program logic must determine which of these allowed breaks is to be used, for example to fit in available horizontal space.
Note | |
---|---|
Boost.Text only generates hard line breaks where they are indicated in the Unicode line breaking rules and there could be an allowed line break. Line breaks always occur at the beginning and end of any sequence, but Boost.Text does not report those as hard breaks — the fact that they are hard breaks is implicit. |
The next_*_break()
and prev_*_break()
functions for line breaking come in two flavors. There are hard_line
versions and allowed_line
versions. For the allowed
overloads, you may need to know, once you have the break position, whether
it was a hard line break. The allowed
overloads therefore return a struct, line_break_result
. It has an
.iter
member to indicate the location, and a .hard_break
member to indicate whether that
location is a hard line break. Overloads of operator==()
and operator!=()
are defined between line_break_result
and its iterator
type so that you can treat it as an iterator in generic code if you don't
care about the hard line break information:
std::array<uint32_t, 5> cps = {{'a', ' ', 'b', '\n', 'c'}}; auto const first = cps.begin(); auto const last = cps.end(); // prev_/next_hard_line_break() returns an iterator. auto at_or_before_2_hard = boost::text::prev_hard_line_break(first, first + 2, last); assert(at_or_before_2_hard == first + 0); auto after_0_hard = boost::text::next_hard_line_break(first, last); assert(after_0_hard == first + 4); // prev_/next_allowed_line_break() returns a line_break_result<CPIter>. auto at_or_before_2_allowed = boost::text::prev_allowed_line_break(first, first + 2, last); assert(at_or_before_2_allowed.iter == first + 2); assert(!at_or_before_2_allowed.hard_break); auto after_0_allowed = boost::text::next_allowed_line_break(first, last); assert(after_0_allowed.iter == first + 2); assert(!after_0_allowed.hard_break); // operator==() and operator!=() are defined between line_break_result<CPIter> // and CPIter. assert(at_or_before_2_allowed == first + 2); assert(first + 2 == after_0_allowed);
The hard
naming is only present
in these low-level functions; the rest of the line breaking API uses line
for the hard break version, and allowed_line
for the other. The rest of
the line breaking API should be familiar by now; it parallels the other breaking
APIs, but with the hard vs. allowed overloads.
Just as the low-level prev
and next
functions for allowed
beaks returned extra information, the other allowed-break functions do as
well. The rest of the API produces code point or grapheme ranges, and the
allowed-break versions produce line_break_cp_views
or line_break_grapheme_views
instead, respectively:
std::array<uint32_t, 5> cps = {{'a', ' ', 'b', '\n', 'c'}}; /* Prints: "c" "b ""a " */ for (auto line : boost::text::lines(cps, boost::text::allowed_breaks) | boost::text::reverse) { std::cout << '"' << boost::text::to_string(line.begin(), line.end()) << '"'; // Don't add \n to a hard line break; it already has one! if (!line.hard_break()) std::cout << "\n"; }
Additionally, there are overloads that make it convenient to write simple
code that selects an allowed break based on available space. The available
space, and the amount of space taken up by each chunk of code points, is
user-configurable. There is an overload of lines()
that takes the amount of space and a callable
that determines the space used by some sequence of code points. Each chunk
contains the code points between allowed breaks. If a chunk would exceed
available space, the allowed break before that chunk is used:
boost::text::text const cps = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod " "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim " "veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea " "commodo consequat. Duis aute irure dolor in reprehenderit in voluptate " "velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint " "occaecat cupidatat non proident, sunt in culpa qui officia deserunt " "mollit anim id est laborum."; /* Prints: ************************************************************ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. ************************************************************ */ std::cout << "************************************************************\n"; for (auto line : boost::text::lines( cps, 60, [](boost::text::text::const_iterator::iterator first, boost::text::text::const_iterator::iterator last) -> std::ptrdiff_t { // estimated_width_of_graphemes() uses the same table-based width // determination that std::format() uses. You can use anything // here, even the width of individual code points in a particular // font. return boost::text::estimated_width_of_graphemes(first, last); })) { std::cout << boost::text::text_view(line.begin(), line.end()); if (!line.hard_break()) std::cout << "\n"; } std::cout << "************************************************************\n";