Text Segmentation

Unicode provides algorithms for breaking code point sequences into graphemes, words, sentences, and lines. The Unicode Bidirectional Algorithm requires paragraph breaking too, so paragraph breaking is included as well, even though it is not an official Unicode text segmentation algorithm.

Conventions

All the kinds of text breaking have a common pattern. Each kind of break X (where X is word, sentence, etc.) provides at least these functions:

template <typename CPIter, typename Sentinel>
CPIter prev_X_break(CPIter first, CPIter it, Sentinel last) noexcept;

prev_X_break() returns it if it is already at a break, or the break before it otherwise. There is one exception to this — even though there is always an implicit break at the end of a sequence of code points, if it == last, the previous break is still returned, if any.

This behavior allows us to do two convenient things with prev_X_break(). First, we can use prev_X_break(first, it, last) == it as a predicate that it is at a break. Second, we can use prev_X_break() followed by next_X_break() to find the nearest breaks around it.

Note that prev_X_break() requires last because in the general case, the algorithm needs to know context after it to determine where the breaks are at or before it.

template <typename CPIter, typename Sentinel>
CPIter next_X_break(CPIter first, Sentinel last) noexcept;

next_X_break() returns the next break after first. It has a precondition that first is already at a break; the results are otherwise undefined.

template<typename CPIter, typename Sentinel>
utf32_view<CPIter> X(CPIter first, CPIter it, Sentinel last) noexcept;

X() returns smallest range of code points that comprise an X (word, line, etc.) in which it is found.

template<typename CPIter, typename Sentinel>
auto Xs(CPIter first, Sentinel last) noexcept;

Xs() returns a view of subranges of [first, last). Each subrange is an X. Xs is a view adaptor, and can be used in pipe expressions — without parameters — as in r | Xs.

	Note
	Since all the text segmentation operations can be done in in terms of next and previous steps, `Xs` can be reversed. Use `boost::text::reverse` or `std::views::reverse` to do so, as in `r \| Xs \| std::views::reverse`

And of course there are code_point_range overloads as well:

template<typename CPRange, typename CPIter>
auto prev_X_break(CPRange & range, CPIter it) noexcept;
template<typename CPRange, typename CPIter>
auto next_X_break(CPRange & range, CPIter it) noexcept;
template<typename CPRange, typename CPIter>
auto X(CPRange & range, CPIter it) noexcept;
template<typename CPRange>
auto Xs(CPRange & range) noexcept;

For all kinds of breaks besides grapheme breaks, there are range overloads that accept grapheme_range ranges grapheme_iter iterators instead. These provide convenient support for using the Unicode layer algorithms with the text layer types like text and rope.

template<typename GraphemeRange, typename GraphemeIter>
auto prev_X_break(GraphemeRange const & range, GraphemeIter it) noexcept;
template<typename GraphemeRange, typename GraphemeIter>
auto next_X_break(GraphemeRange const & range, GraphemeIter it) noexcept;
template<typename GraphemeRange, typename GraphemeIter>
auto X(GraphemeRange const & range, GraphemeIter it) noexcept;
template<typename GraphemeRange>
auto Xs(GraphemeRange const & range) noexcept;

Tailoring

Unicode allows for tailoring of the segmentation algorithms, to produce customized results that are necessary or useful for a particular application, or to produce correct results in cases that the Unicode algorithms do not handle. Some of the break algorithms below are tailorable. Each section below indicates whether a certain kind of break is tailorable, and if so, how.

Graphemes

// U+0308 COMBINING ACUTE ACCENT
std::array<uint32_t, 3> cps = {{'a', 0x0308, 'b'}};

auto const first = cps.begin();
auto const last = cps.end();

auto at_or_before_1 = boost::text::prev_grapheme_break(first, first + 1, last);
assert(at_or_before_1 == first + 0);

auto at_or_before_2 = boost::text::prev_grapheme_break(cps, first + 2);
assert(at_or_before_2 == first + 2);

auto at_or_before_3 = boost::text::prev_grapheme_break(first, first + 3, last);
assert(at_or_before_3 == first + 2);

auto after_0 = boost::text::next_grapheme_break(first, last);
assert(after_0 == first + 2);

// Prints "[0, 2) [2, 3)".
for (auto range : boost::text::as_graphemes(cps)) {
    std::cout << '[' << (range.begin() - first) << ", " << (range.end() - first)
              << ") ";
}
std::cout << "\n";

// Prints "[2, 3) [0, 2)".  You should use std::views::reverse if you can.
for (auto range : boost::text::as_graphemes(cps) | boost::text::reverse) {
    std::cout << '[' << (range.begin() - first) << ", " << (range.end() - first)
              << ") ";
}
std::cout << "\n";

Boost.Text does not support tailoring of grapheme breaking, because graphemes are the fundamental unit of work for the text layer of the library. All code must have the same notion of what a grapheme is for that to work.

Alternate Grapheme API

The overload set for producing a view of graphemes, and the associated view adaptor, are called as_graphemes(), not graphemes().

Important

	Important
`as_graphemes()` produces `grapheme_view<Iter>`s (where `Iter` is some code point iterator). If you call a standard algorithm in `std::ranges` with two `grapheme_view`s of different type, there is a chance that the call will be ill-formed. For example, this: grapheme_view<Iter1> v1(/.../); grapheme_view<Iter2> v2(/.../); std::ranges::search(v1, v2); may not compile. The reasons is that the default comparator used by `std::ranges::search()` expects the elements of both ranges to be equality-comparable. This means that they have to have a common type that both types of elements may be converted to. If the underlying values iterated over in `grapheme_view<Iter1>` and `grapheme_view<Iter2>` are different, a common type probably does not exist for the elements of those two views. As a workaround, you can pass a more permissive comparator, such as `std::equal_to<>`: grapheme_view<Iter1> v1(/.../); grapheme_view<Iter2> v2(/.../); std::ranges::search(v1, v2, std::equal_to<>{});

as_graphemes() produces grapheme_view<Iter>s (where Iter is some code point iterator). If you call a standard algorithm in std::ranges with two grapheme_views of different type, there is a chance that the call will be ill-formed. For example, this:

grapheme_view<Iter1> v1(/*...*/);
grapheme_view<Iter2> v2(/*...*/);
std::ranges::search(v1, v2);

may not compile. The reasons is that the default comparator used by std::ranges::search() expects the elements of both ranges to be equality-comparable. This means that they have to have a common type that both types of elements may be converted to. If the underlying values iterated over in grapheme_view<Iter1> and grapheme_view<Iter2> are different, a common type probably does not exist for the elements of those two views.

As a workaround, you can pass a more permissive comparator, such as std::equal_to<>:

grapheme_view<Iter1> v1(/*...*/);
grapheme_view<Iter2> v2(/*...*/);
std::ranges::search(v1, v2, std::equal_to<>{});

Words

Word breaks occur where you'd expect — at the beginnings and ends of words — but they also occur where you might not expect — at the beginnings and ends of the code point sequences between words. Here is an example of word breaks taken from Unicode Text Segmention. The string "The quick (“brown”) fox can’t jump 32.3 feet, right?" is broken up into words like this:

Table 1.1. Word Break Example

"The"

" "

"quick"

" "

"("

"“"

"brown"

"”"

")"

" "

"fox"

" "

"can’t"

" "

"jump"

" "

"32.3"

" "

"feet"

","

" "

"right"

"?"

Note that many of those "words" are not what most people would consider to be words. You may need to do some additional processing to find only the "real" words, if that matters in your use case.

The word breaking API can be used just as the grapheme break API, except that it also has grapheme_range overloads. Here are some example calls using only the grapheme_range overloads, with a text as the grapheme_range:

// Using GraphemeRange/GraphemeIterator overloads...
boost::text::text cps("The quick (\"brown\") fox can’t jump 32.3 feet, right?");

auto const first = cps.begin();

auto at_or_before_1 = boost::text::prev_word_break(cps, std::next(first, 1));
assert(at_or_before_1 == std::next(first, 0));

auto at_or_before_3 = boost::text::prev_word_break(cps, std::next(first, 3));
assert(at_or_before_3 == std::next(first, 3));

auto after_0 = boost::text::next_word_break(cps, first);
assert(after_0 == std::next(first, 3));

auto around_7 = boost::text::word(cps, std::next(first, 7));
assert(around_7.begin() == std::next(first, 4));
assert(around_7.end() == std::next(first, 9));

// Prints the indices of the words from the table above.
for (auto range : boost::text::words(cps)) {
    std::cout << '[' << std::distance(first, range.begin()) << ", "
              << std::distance(first, range.end()) << ") ";
}
std::cout << "\n";

// Prints the indices of the words from the table above, backward.
for (auto range : boost::text::words(cps) | boost::text::reverse) {
    std::cout << '[' << std::distance(first, range.begin()) << ", "
              << std::distance(first, range.end()) << ") ";
}
std::cout << "\n";

Limitations of Word Breaking

This algorithm does not work for all languages. From Unicode Text Segmention:

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

French and Italian words are not meant to be broken after an apostrophe, but the default algorithm finds "l’objectif" to be a single word.

Breaking on dashes is the default. For example, the default algorithm finds "out-of-the-box" to be seven words.

There are other rarer failure cases in that document you might want to look at too.

Word Break Tailoring

Fortunately, unlike grapheme breaking, word breaking is tailorable. There are two ways to do so.

Each break algorithm is defined in terms of code point properties; each code point is a letter, digit, punctuation, etc. All the word break functions accept an optional word-property lookup function to replace the default one.

For example, here I've made a custom word property lookup function that treats a regular dash '-' as a MidLetter. MidLetter is a property that repesents code points that are part of a word as long as it can reach at least one letter on either side, before reaching a word break first:

boost::text::text cps("out-of-the-box");

// Prints "out - of - the - box".
for (auto range : boost::text::words(cps)) {
    std::cout << boost::text::text_view(range.begin(), range.end()) << " ";
}
std::cout << "\n";

auto const custom_word_prop = [](uint32_t cp) {
    if (cp == '-')
        return boost::text::word_property::MidLetter; // '-' becomes a MidLetter
    return boost::text::word_prop(cp); // Otherwise, just use the default implementation.
};

// Prints "out-of-the-box".
for (auto range : boost::text::words(cps, custom_word_prop)) {
    std::cout << boost::text::text_view(range.begin(), range.end()) << " ";
}
std::cout << "\n";

From Unicode Text Segmention, here are some other code points you might want to treat as MidLetter, depending on your language and use case:

Table 1.2. `MidLetter` Candidates

Code Point
U+002D ( - ) HYPHEN-MINUS
U+055A ( ՚ ) ARMENIAN APOSTROPHE
U+058A ( ֊ ) ARMENIAN HYPHEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2011 ( ‑ ) NON-BREAKING HYPHEN
U+201B ( ‛ ) SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT
U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
U+FF0D ( － ) FULLWIDTH HYPHEN-MINUS

Another example from Unicode Text Segmention is to treat spaces as MidNum to support languages that use spaces as thousands separators, as in "€1 234,56". MidNum is like MidLetter, but for the interior code points of numbers instead of words containing letters. Here are the space code points you might want to do that with:

Table 1.3. `MidNum` Candidates

Code Point
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+202F NARROW NO-BREAK SPACE

Tailoring the properties for each code point works for some cases, but using tailorings of the meanings of MidLetter and MidNum can only add to the sizes of words; it cannot decrease their sizes. The word break functions take a second optional parameter that allows you to pick arbitrary word breaks based on limited context.

The Boost.Text implementation of the word break algorithm uses the current code point, plus two code points before and two code points after, to determine whether a word break exists at the current code point. Therefore, the signature of the custom word break function is this:

bool custom_break(uint32_t prev_prev,
                  uint32_t prev,
                  uint32_t curr,
                  uint32_t next,
                  uint32_t next_next);

Returning true indicates that [prev, curr] straddles a word break — prev is the last code point of one word, and curr is the first code point of the next. If provided, this custom break function is evaluated before any of the Unicode word break rules.

boost::text::text cps("snake_case camelCase");

// Prints "snake_case   camelCase".
for (auto range : boost::text::words(cps)) {
    std::cout << boost::text::text_view(range.begin(), range.end()) << " ";
}
std::cout << "\n";

// Break up words into chunks as if they were parts of identifiers in a
// popular programming language.
auto const identifier_break = [](uint32_t prev_prev,
                                 uint32_t prev,
                                 uint32_t curr,
                                 uint32_t next,
                                 uint32_t next_next) {
    if ((prev == '_') != (curr == '_'))
        return true;
    if (0x61 <= prev && prev <= 0x7a && 0x41 <= curr && curr <= 0x5a)
        return true;
    return false;
};

// Prints "snake _ case   camel Case".
for (auto range :
     boost::text::words(cps, boost::text::word_prop, identifier_break)) {
    std::cout << boost::text::text_view(range.begin(), range.end()) << " ";
}
std::cout << "\n";

Sentences

The sentence breaking API is the same as the word breaking API, without the extra tailoring parameters.

Paragraphs

The paragraph breaking API is the same as the sentence and word breaking APIs, without the extra tailoring parameters.

Unicode does not list paragraph breaks as a specific kind of text segmentation, but it can be useful in some cases. In particular, paragraph detection is part of the Unicode bidirectional algorithm. One way of tailoring the behavior of the bidirectional algorithm is to process some paragraphs separately from others; having an API for detecting paragraph breaks makes that simpler.

Lines

The Unicode line breaking algorithm differs from the other break algorithms in that there are multiple kinds of line breaks. Some line breaks are required, as after a newline (e.g. "\n" or "\r\n"). These are known as hard line breaks.

The line breaking algorithm produces many more line breaks, but all non-hard line breaks are places where it is possible to break the line — though it is not necessary to do so. These are known as allowed line breaks. Higher-level program logic must determine which of these allowed breaks is to be used, for example to fit in available horizontal space.

	Note
	Boost.Text only generates hard line breaks where they are indicated in the Unicode line breaking rules and there could be an allowed line break. Line breaks always occur at the beginning and end of any sequence, but Boost.Text does not report those as hard breaks — the fact that they are hard breaks is implicit.

The next_*_break() and prev_*_break() functions for line breaking come in two flavors. There are hard_line versions and allowed_line versions. For the allowed overloads, you may need to know, once you have the break position, whether it was a hard line break. The allowed overloads therefore return a struct, line_break_result. It has an .iter member to indicate the location, and a .hard_break member to indicate whether that location is a hard line break. Overloads of operator==() and operator!=() are defined between line_break_result and its iterator type so that you can treat it as an iterator in generic code if you don't care about the hard line break information:

std::array<uint32_t, 5> cps = {{'a', ' ', 'b', '\n', 'c'}};

auto const first = cps.begin();
auto const last = cps.end();

// prev_/next_hard_line_break() returns an iterator.
auto at_or_before_2_hard =
    boost::text::prev_hard_line_break(first, first + 2, last);
assert(at_or_before_2_hard == first + 0);

auto after_0_hard = boost::text::next_hard_line_break(first, last);
assert(after_0_hard == first + 4);

// prev_/next_allowed_line_break() returns a line_break_result<CPIter>.
auto at_or_before_2_allowed =
    boost::text::prev_allowed_line_break(first, first + 2, last);
assert(at_or_before_2_allowed.iter == first + 2);
assert(!at_or_before_2_allowed.hard_break);

auto after_0_allowed = boost::text::next_allowed_line_break(first, last);
assert(after_0_allowed.iter == first + 2);
assert(!after_0_allowed.hard_break);

// operator==() and operator!=() are defined between line_break_result<CPIter>
// and CPIter.
assert(at_or_before_2_allowed == first + 2);
assert(first + 2 == after_0_allowed);

The hard naming is only present in these low-level functions; the rest of the line breaking API uses line for the hard break version, and allowed_line for the other. The rest of the line breaking API should be familiar by now; it parallels the other breaking APIs, but with the hard vs. allowed overloads.

Just as the low-level prev and next functions for allowed beaks returned extra information, the other allowed-break functions do as well. The rest of the API produces code point or grapheme ranges, and the allowed-break versions produce line_break_cp_views or line_break_grapheme_views instead, respectively:

std::array<uint32_t, 5> cps = {{'a', ' ', 'b', '\n', 'c'}};

/* Prints:
"c"
"b
""a "
*/
for (auto line : boost::text::lines(cps, boost::text::allowed_breaks) |
                     boost::text::reverse) {
    std::cout << '"' << boost::text::to_string(line.begin(), line.end()) << '"';
    // Don't add \n to a hard line break; it already has one!
    if (!line.hard_break())
        std::cout << "\n";
}

Additionally, there are overloads that make it convenient to write simple code that selects an allowed break based on available space. The available space, and the amount of space taken up by each chunk of code points, is user-configurable. There is an overload of lines() that takes the amount of space and a callable that determines the space used by some sequence of code points. Each chunk contains the code points between allowed breaks. If a chunk would exceed available space, the allowed break before that chunk is used:

boost::text::text const cps =
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod "
    "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim "
    "veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea "
    "commodo consequat. Duis aute irure dolor in reprehenderit in voluptate "
    "velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint "
    "occaecat cupidatat non proident, sunt in culpa qui officia deserunt "
    "mollit anim id est laborum.";

/* Prints:
************************************************************
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
************************************************************
*/
std::cout << "************************************************************\n";
for (auto line : boost::text::lines(
         cps,
         60,
         [](boost::text::text::const_iterator::iterator first,
            boost::text::text::const_iterator::iterator last)
             -> std::ptrdiff_t {
             // estimated_width_of_graphemes() uses the same table-based width
             // determination that std::format() uses.  You can use anything
             // here, even the width of individual code points in a particular
             // font.
             return boost::text::estimated_width_of_graphemes(first, last);
         })) {
    std::cout << boost::text::text_view(line.begin(), line.end());
    if (!line.hard_break())
        std::cout << "\n";
}
std::cout << "************************************************************\n";