PrevUpHomeNext

Bidirectional Text

Most scripts supported by Unicode are written left-to-right (LTR). Some, most notably Arabic and Hebrew, are written right-to-left (RTL). Let's say you have some text that is mostly LTR, but which contains some RTL text. If you want to present this text to the user, say by printing it to a terminal or showing it in a GUI, it becomes necessary to produce an ordering on the code points in the text that reverses the RTL subsequences, so they appear in their proper readable order, not in memory-order. For example, if the capital letters in the string "car means CAR." were from an RTL script, the end-user should see "car means RAC."

There are also times when you find RTL nested within LTR nested within RTL nested within LTR. Unicode is hard.

Boost.Text has an implementation of the Unicode bidirectional algorithm. The interface for simple uses is straightforward; you just call the algorithm with a sequence of code points, and it returns a view of subranges. Each subrange is a bidirectional_cp_subrange (or bidirectional_grapheme_subrange for the grapheme_range overload):

// This is the Arabic text as it appears in Google translate, already
// right-to-left.  This is how we expect it to appear in our output below,
// after the bidirectional algorithm processes it.
boost::text::text const RTL_order_arabic = "مرحبا ، عالم ثنائي الاتجاه";

boost::text::text const memory_order_text =
    (char const *)
    u8"When I type \"Hello, bidirectional world\" into Google translate "
    u8"English->Arabic, it produces \"هاجتالا يئانث ملاع ، ابحرم\".  I have no "
    u8"idea if it's correct.\n";

/* Prints:
When I type "Hello, bidirectional world" into Google translate English->Arabic, it produces "مرحبا ، عالم ثنائي الاتجاه".  I have no idea if it's correct.
*/
boost::text::rope bidirectional_text;
for (auto range : memory_order_text | boost::text::bidirectional_subranges) {
    for (auto grapheme : range) {
        // We can take each grapheme and print it out directly ...
        std::cout << grapheme;
        // ... or we can insert it into a text to be used elsewhere.
        bidirectional_text.insert(bidirectional_text.end(), grapheme);
    }
}

// Prints same as the loop above.
std::cout << bidirectional_text;

[Note] Note

By default, bidirectional_subranges() auto-detects the overall direction of the text — LTR or RTL. This is known as the embedding level; even embedding levels are LTR, and odd embedding levels are RTL. Even levels above 0 are found nested within RTL text. Odd levels above 1 are found nested within LTR text. This is what the embedding part of embedding level means.

If you want to force a particular embedding level, say if you're laying out text in an HTML table, so the text you're processing is separate from the surrounding text, but you know that the surrounding text is all RTL, you may want to specify the optional paragraph_embedding_level parameter to bidirectional_subranges().

Most users can leave this parameter unchanged.

There is another form of bidirectional_subranges(). This second form takes a line width extent and a callable to provide the extent of a subsequence of code points. These parameters are the same as the ones passed to some of the line break overloads.

This form is necessary because the locations of line breaks within the text can affect the output of the bidirectional algorithm. Therefore, the bidirectional algorithm needs to know the positions of line breaks in order to operate correctly. For example:

boost::text::text const memory_order_text =
    (char const *)
    u8"When I type \"Hello, bidirectional world\" into Google translate "
    u8"English->Arabic, it produces \"هاجتالا يئانث ملاع ، ابحرم\".  I have no "
    u8"idea if it's correct.\n";

#ifdef BOOST_NO_CXX14_GENERIC_LAMBDAS
// This is an out-of-line callable with a operator() member template.  It's
// just like the lambda below.
extent_callable extent;
#else
// The extent callable used in the line-breaking example took parameters of
// fixed type.  The bidirectional algorithm needs to call this using some of
// its own internal iterators, so the extent-callable it expects should be
// written as a template or a generic lambda.
auto const extent = [](auto first, auto last) {
    boost::text::grapheme_view<decltype(first)> range(first, last);
    return std::distance(range.begin(), range.end());
};
#endif

/* Prints:
************************************************************
When I type "Hello, bidirectional world" into Google
translate English->Arabic, it produces "ثنائي الاتجاه
مرحبا ، عالم".  I have no idea if it's correct.
************************************************************
*/
std::cout << "**************************************************\n";
for (auto range :
     memory_order_text | boost::text::bidirectional_subranges(60, extent)) {
    for (auto grapheme : range) {
        std::cout << grapheme;
    }
    // With the example for line breaking, our predicate in this spot was
    // !hard_break().  In this case though, there are some subranges that have
    // no line breaks at all.  Since we don't want double line breaks, we
    // still don't add a line break after a hard_break() (because those will
    // come after a sequence that already causes a line break, like "\r\n").
    // That leaves the need to break only after allowed breaks.
    if (range.allowed_break())
        std::cout << "\n";
}
std::cout << "**************************************************\n";

[Important] Important

The extent-callable used in the line breaking API takes two parameters of exact type. The extent-callable used above must take two parameters of any type that models code_point_iter. This is because the callable is used with some internal iterator types that do not match the code_point_iters in the top-level range that you pass to bidirectional_subranges().

As with most of the other Unicode layer algorithms, overloads of the two forms above exist for grapheme_range, code_point_range, and code_point_iter/sentinel inputs.


PrevUpHomeNext