To do iteration in many text-handling contexts, you need to know the beginning and the end of the range you are iterating over, just to be able to do iteration correctly. Note that this is not a safety issue, but a correctness one.
For example, say we have a string s
of UTF-8 code units that we would like to iterate over to produce UTF-32 code
points. If the last code unit in s
is 0xe0
, we should expect two more
code units to follow. They are not present, though, becuase 0xe0
is the last code unit. Now consider how you would implement operator++()
for an iterator it
that transcodes from UTF-8 to UTF-32. If you advance far enough to get the
next UTF-32 code point in each call to operator++()
, you may run off the end of s
when you find 0xe0
and try to read two more code units. Note that it does not matter that it
probably comes from a range with an end-iterator
or sentinel as its mate; inside it
's operator++()
this is no help. it
must therefore have the end-iterator or sentinel as a data member. The same
logic applies to the other end of the range if it
is bidirectional — it
must also have the iterator to the start of the underlying range as a data
member.
This unfortunate reality comes up over and over in the Boost.Text iterators, not just the ones that are UTF transcoding iterators. This is why iterators in Boost.Text usually consist of three underlying iterators.
This follows directly from the effect described above. Say you get a subrange from one iteration of a text segmentation algorithm:
char const * c_str = /* ... */; auto const lines = c_str | boost::text::as_graphemes | boost::text::lines(boost::text::allowed_breaks); int line_index = 0; for (auto line : lines) { auto first = lines.begin()->begin(); std::cout << "line " << line_index++ << " offsets: " << std::ranges::distance(first, line.begin()) << " - " << std::ranges::distance(first, line.end()) // Oops. << "\n"; }
This code does not halt. The line marked with "Oops." will continue
to count forever when it is executed in the second loop iteration. This happens
because first
is constructed
from the iterators delimiting the first line, *lines.begin()
(let's call that line l
for brevity). first
's underlying
iterators are: l.begin().base()
,
first
's lower bound, which
points to the first code point in l
;
l.begin().base()
, which
is the current position of first
within l
; and l.end().base()
,
first
's upper bound, for one
past the last code point in l
.
When evaluating std::ranges::distance(first, line.end())
,
first
must be advanced until
it is equal to line.end()
. However,
there is an upper bound on how far we can advance first
.
It cannot advance past its underlying upper bound iterator, which is equal
to l.end().base()
(which is lines.begin()->end().base()
). This upper bound will always be less than
line.end()
. Remember,
the line
in line.end()
is the line in the second iteration of the loop, but the line l
( ==
*lines.begin()
)
is the line in the first iteration of the
loop.
I know all of that was complicated. To keep things simple, follow this rule:
Important | |
---|---|
When Boost.Text gives you a subrange |
for
Loops
Prior to C++20, range-based for
loops require that the begin-iterator and the end-iterator have the same type.
This means that any range consisting of an iterator/sentinel pair will not
work with pre-C++20 range-based for
loops. Writing it out the long way still works, of course, and in C++20 and
later modes, everything just works.