Peculiarities of Boost.Text Iterators and Ranges

Boost.Text Iterators Are Constructed from More Than One Underlying Iterator

To do iteration in many text-handling contexts, you need to know the beginning and the end of the range you are iterating over, just to be able to do iteration correctly. Note that this is not a safety issue, but a correctness one.

For example, say we have a string s of UTF-8 code units that we would like to iterate over to produce UTF-32 code points. If the last code unit in s is 0xe0, we should expect two more code units to follow. They are not present, though, becuase 0xe0 is the last code unit. Now consider how you would implement operator++() for an iterator it that transcodes from UTF-8 to UTF-32. If you advance far enough to get the next UTF-32 code point in each call to operator++(), you may run off the end of s when you find 0xe0 and try to read two more code units. Note that it does not matter that it probably comes from a range with an end-iterator or sentinel as its mate; inside it's operator++() this is no help. it must therefore have the end-iterator or sentinel as a data member. The same logic applies to the other end of the range if it is bidirectional — it must also have the iterator to the start of the underlying range as a data member.

This unfortunate reality comes up over and over in the Boost.Text iterators, not just the ones that are UTF transcoding iterators. This is why iterators in Boost.Text usually consist of three underlying iterators.

Often, Iterators from Subranges Can Only be Compared to Each Other

This follows directly from the effect described above. Say you get a subrange from one iteration of a text segmentation algorithm:

char const * c_str = /* ... */;
auto const lines =
    c_str | boost::text::as_graphemes |
    boost::text::lines(boost::text::allowed_breaks);

int line_index = 0;
for (auto line : lines) {
    auto first = lines.begin()->begin();
    std::cout << "line " << line_index++ << " offsets: "
              << std::ranges::distance(first, line.begin())
              << " - "
              << std::ranges::distance(first, line.end()) // Oops.
              << "\n";
}

This code does not halt. The line marked with "Oops." will continue to count forever when it is executed in the second loop iteration. This happens because first is constructed from the iterators delimiting the first line, *lines.begin() (let's call that line l for brevity). first's underlying iterators are: l.begin().base(), first's lower bound, which points to the first code point in l; l.begin().base(), which is the current position of first within l; and l.end().base(), first's upper bound, for one past the last code point in l.

When evaluating std::ranges::distance(first, line.end()), first must be advanced until it is equal to line.end(). However, there is an upper bound on how far we can advance first. It cannot advance past its underlying upper bound iterator, which is equal to l.end().base() (which is lines.begin()->end().base()). This upper bound will always be less than line.end(). Remember, the line in line.end() is the line in the second iteration of the loop, but the line l (== *lines.begin()) is the line in the first iteration of the loop.

I know all of that was complicated. To keep things simple, follow this rule:

	Important
	When Boost.Text gives you a subrange `s`, comparisons of `s.begin()` to `s.end()` are fine, and so is iteration between `s.begin()` and `s.end()`. However, iteration between either `s.begin()` or `s.end()` and any other iterator may result in undefined behavior.

Sentinels Break Pre-C++20 Range-Based `for` Loops

Prior to C++20, range-based for loops require that the begin-iterator and the end-iterator have the same type. This means that any range consisting of an iterator/sentinel pair will not work with pre-C++20 range-based for loops. Writing it out the long way still works, of course, and in C++20 and later modes, everything just works.