Most scripts supported by Unicode are written left-to-right (LTR). Some, most notably Arabic and Hebrew, are written right-to-left (RTL). Let's say you have some text that is mostly LTR, but which contains some RTL text. If you want to present this text to the user, say by printing it to a terminal or showing it in a GUI, it becomes necessary to produce an ordering on the code points in the text that reverses the RTL subsequences, so they appear in their proper readable order, not in memory-order. For example, if the capital letters in the string "car means CAR." were from an RTL script, the end-user should see "car means RAC."
There are also times when you find RTL nested within LTR nested within RTL nested within LTR. Unicode is hard.
Boost.Text has an implementation of the Unicode bidirectional algorithm.
The interface for simple uses is straightforward; you just call the algorithm
with a sequence of code points, and it returns a view of subranges. Each
subrange is a bidirectional_cp_subrange
(or
bidirectional_grapheme_subrange
for the grapheme_range
overload):
// This is the Arabic text as it appears in Google translate, already // right-to-left. This is how we expect it to appear in our output below, // after the bidirectional algorithm processes it. boost::text::text const RTL_order_arabic = "مرحبا ، عالم ثنائي الاتجاه"; boost::text::text const memory_order_text = (char const *) u8"When I type \"Hello, bidirectional world\" into Google translate " u8"English->Arabic, it produces \"هاجتالا يئانث ملاع ، ابحرم\". I have no " u8"idea if it's correct.\n"; /* Prints: When I type "Hello, bidirectional world" into Google translate English->Arabic, it produces "مرحبا ، عالم ثنائي الاتجاه". I have no idea if it's correct. */ boost::text::rope bidirectional_text; for (auto range : memory_order_text | boost::text::bidirectional_subranges) { for (auto grapheme : range) { // We can take each grapheme and print it out directly ... std::cout << grapheme; // ... or we can insert it into a text to be used elsewhere. bidirectional_text.insert(bidirectional_text.end(), grapheme); } } // Prints same as the loop above. std::cout << bidirectional_text;
Note | |
---|---|
By default,
If you want to force a particular embedding level, say if you're laying
out text in an HTML table, so the text you're processing is separate from
the surrounding text, but you know that the surrounding text is all RTL,
you may want to specify the optional Most users can leave this parameter unchanged. |
There is another form of bidirectional_subranges()
. This second form takes a line width extent
and a callable to provide the extent of a subsequence of code points. These
parameters are the same as the ones passed to some of the line break overloads.
This form is necessary because the locations of line breaks within the text can affect the output of the bidirectional algorithm. Therefore, the bidirectional algorithm needs to know the positions of line breaks in order to operate correctly. For example:
boost::text::text const memory_order_text = (char const *) u8"When I type \"Hello, bidirectional world\" into Google translate " u8"English->Arabic, it produces \"هاجتالا يئانث ملاع ، ابحرم\". I have no " u8"idea if it's correct.\n"; #ifdef BOOST_NO_CXX14_GENERIC_LAMBDAS // This is an out-of-line callable with a operator() member template. It's // just like the lambda below. extent_callable extent; #else // The extent callable used in the line-breaking example took parameters of // fixed type. The bidirectional algorithm needs to call this using some of // its own internal iterators, so the extent-callable it expects should be // written as a template or a generic lambda. auto const extent = [](auto first, auto last) { boost::text::grapheme_view<decltype(first)> range(first, last); return std::distance(range.begin(), range.end()); }; #endif /* Prints: ************************************************************ When I type "Hello, bidirectional world" into Google translate English->Arabic, it produces "ثنائي الاتجاه مرحبا ، عالم". I have no idea if it's correct. ************************************************************ */ std::cout << "**************************************************\n"; for (auto range : memory_order_text | boost::text::bidirectional_subranges(60, extent)) { for (auto grapheme : range) { std::cout << grapheme; } // With the example for line breaking, our predicate in this spot was // !hard_break(). In this case though, there are some subranges that have // no line breaks at all. Since we don't want double line breaks, we // still don't add a line break after a hard_break() (because those will // come after a sequence that already causes a line break, like "\r\n"). // That leaves the need to break only after allowed breaks. if (range.allowed_break()) std::cout << "\n"; } std::cout << "**************************************************\n";
Important | |
---|---|
The extent-callable used in the line breaking API takes two parameters
of exact type. The extent-callable used above must take two parameters
of any type that models |
As with most of the other Unicode layer algorithms, overloads of the two
forms above exist for grapheme_range
, code_point_range
, and code_point_iter
/sentinel inputs.