Collation is the relative ordering of sequences of code points for purposes of sorting or searching. The Unicode collation algorithm takes a sequence of code points and produces a sequence of numbers (a "sort key") that can be lexicographically compared to another sequence's sort key.
Why can't we just lexicographically compare two Unicode strings? Because
Unicode. Consider two code points A
and B
. There may be some
languages for which the proper ordering of A
and B
is A
< B
.
There may also be other languages for which B
< A
is the proper order. There may be yet other languages for which A == B
is the proper order. As a concrete example, in Swedish z
< ö
,
whereas in German ö <
z
.
So, if I want to implement a simple function like this:
// Compare two code points for dictionary ordering. bool impossible_less(uint32_t lhs, uint32_t rhs) { return /* What goes here? */; }
I can't, because I don't know what language we're using these code points in.
Sadly, collation is even more complicated than this. Collation must also
handle the different ordering priorities of different characteristics of
code points within a single language or context. For instance, I may want
capitals first, implying that G
< g
,
or I may want capitals last, implying g
< G
.
Some languages sort based on accents, and some do not; collation must know
whether o <
ô
or o
== ô
.
To handle this correctly, collation sort keys are created with support for four levels of comparison. The primary level (L1) represents differences in the base letter or symbol being represented; the secondary level (L2) represents differences in accent; the tertiary level (L3) represents differences in case, or variants of symbols; and the quaternary level (L4) represents differences in punctuation.
From the Unicode documentation:
Table 1.4. Comparison Levels
Level |
Description |
Examples |
---|---|---|
L1 |
Base characters |
role < roles < rule |
L2 |
Accents |
role < rôle < roles |
L3 |
Case/Variants |
role < Role < rôle |
L4 |
Punctuation |
role < “role” < Role |
When forming a sort key, all the L1 weights come first — for all code points in the sequence — then all the L2 weights, then the L3 weights, etc. This means that any L1 difference is treated as more important than any L2 difference, and any L2 difference trumps any L3 difference, etc.
It is possible to consider only a subset of levels (L1 through LN) when comparing sequences; this is known as the collation strength. For example, L1 strength means "Ignore accents, case, and punctuation", and L2 strength means "Ignore case and punctuation".
There are also parameters you can provide to the collation algorithm that
create variations such as "Ignore accents, but do consider case".
See the collation_flags
for details.
Since there exists no unique mapping of code points to collation weights that works for every language and context, there needs to be a means available for users of the Unicode collation algorithm of tailor collation to a particular language or use case. Boost.Text supports the LDML format for specifying collation tailoring; see that website for details. An example of this super simple and convenient format is:
[normalization on] [reorder Grek] &N<ñ<<<Ñ &C<ch<<<Ch<<<CH &l<ll<<<Ll<<<LL
You should never need to write one of these tailoring scripts, but if you
do, there's a full parser of the tailoring scripting language built into
Boost.Text. For most users, it should be sufficient to use one of the canned
tailoring scripts in the boost/text/data/
directory.
These come from CLDR, and the
files use the CLDR locale naming
scheme. Just rummage about until you find the language you're looking for.
Note that many of the languages have multiple variants.
There is also a default table that can be used for languages with no tailorings.
Note | |
---|---|
Collation tailoring is quite expensive for some languages, typically the
CJK (Chinese, Japanese, and Korean) language tailorings — sometimes
as much as multiple seconds. There is serialization of collation tables
to/from a buffer or to/from a |
The collation-related functions all require a collation table. There are
two ways to collate two code point sequences relative to one another. First
is to just call collate()
.
This can be very expensive to call in a loop or other hot code path, because
the sort keys are not kept for re-use. The other way is to create the sort
keys for the sequences, and then compare the keys yourself. Here is a simple
example using the default table:
boost::text::collation_table default_table = boost::text::default_collation_table(); // The Danish town of Arrhus changed the town's name in 2010. Go figure. boost::text::text const aarhus_old = "Århus"; boost::text::text const aarhus_new = "Aarhus"; // This is fine for one-off comparisons. Beware though that creating sort // keys is pretty expensive, and collate() just throws them away after it // computes its result. int const collation = boost::text::collate(aarhus_old, aarhus_new, default_table); // "Århus" < "Aarhus" using the default collation, because accents are // considered after the primary weights of the characters. assert(0 < collation); // If you know the strings are very long-lived, it may make more sense to // generate keys for your two sequences, keep them somewhere, and compare them // later. boost::text::text_sort_key aarhus_new_key = boost::text::collation_sort_key(aarhus_new, default_table); boost::text::text_sort_key aarhus_old_key = boost::text::collation_sort_key(aarhus_old, default_table); assert(aarhus_old_key > aarhus_new_key); assert(boost::text::compare(aarhus_old_key, aarhus_new_key) > 0);
Here's a similar example, this time using the Danish collation tailorings.
Note that we're using two overloads that respectively take code_point_range
s and grapheme_range
s.
boost::text::collation_table da_table = boost::text::tailored_collation_table( boost::text::data::da::standard_collation_tailoring()); boost::text::text const aarhus_old = "Århus"; std::array<uint32_t, 6> const aarhus_new = {{'A', 'a', 'r', 'h', 'u', 's'}}; boost::text::text_sort_key aarhus_new_key = boost::text::collation_sort_key(aarhus_new, da_table); boost::text::text_sort_key aarhus_old_key = boost::text::collation_sort_key(aarhus_old, da_table); // Under Danish collation with a default configuration, "Aarhus" < "Århus". assert(aarhus_old_key < aarhus_new_key); assert(boost::text::compare(aarhus_old_key, aarhus_new_key) < 0);
I've made every effort to hide the messy details of collation configuration
from you, because they are super confusing. Instead, Boost.Text uses a set
of collation_flags
that have more recognizable semantics. The flags map directly onto the low-level
configuration settings. Combinations of the flags that map to incompatible
low-level settings will not compile. The low-level configuration is still
available for those already familiar with Unicode collation.
When generating a sort key, the default configuration is to use tertiary strength (consider everything but punctuation), with no other options enabled. Other options can be specified to get different collations with the same table:
boost::text::collation_table default_table = boost::text::default_collation_table(); // For the boost::text::text "_t" UDL. using namespace boost::text::literals; int result = 0; // No configuration, which implies tertiary strength. result = boost::text::collate("resume"_t, "RÉSUMÉ"_t, default_table); assert(result < 0); // Ignore everything but the letters themselves. result = boost::text::collate( "resume"_t, "RÉSUMÉ"_t, default_table, boost::text::collation_flags::ignore_accents | boost::text::collation_flags::ignore_case | boost::text::collation_flags::ignore_punctuation); assert(result == 0); // Ignore accents; case is still considered of course. result = boost::text::collate( "resume"_t, "Résumé"_t, default_table, boost::text::collation_flags::ignore_accents); assert(result < 0); result = boost::text::collate( "resume"_t, "résumé"_t, default_table, boost::text::collation_flags::ignore_accents); assert(result == 0); // Ignore case; accents are still considered of course. result = boost::text::collate( "résumé"_t, "Résumé"_t, default_table, boost::text::collation_flags::ignore_case); assert(result == 0); result = boost::text::collate( "résumé"_t, "Resume"_t, default_table, boost::text::collation_flags::ignore_case); assert(result > 0); // Say you want to put one case explicitly before or after the other: result = boost::text::collate( "Resume"_t, "resume"_t, default_table, boost::text::collation_flags::upper_case_first); assert(result < 0); result = boost::text::collate( "Resume"_t, "resume"_t, default_table, boost::text::collation_flags::lower_case_first); assert(result > 0); // You can also completely ignore punctuation. result = boost::text::collate( "ellipsis"_t, "ellips...is"_t, default_table, boost::text::collation_flags::ignore_punctuation); assert(result == 0);
Important | |
---|---|
The default Unicode collation algorithm requires the NFD normalization form, one of the least compact normalization forms. NFC is the most widely-used normalization form (most European keyboards produce NFC, and W3C recommends NFC for all web text). This means either keeping your data in the most interoperable normalization form (NFC), and paying the cost of normalization every time you do a collation, or the less interoperable NFD and not having to renormalize when collating.
FCC is very similar to NFC, except that it is less compact in a few cases.
Boost.Text's collation implementation relies on the inputs being in either
FCC or NFD, so the collation functions require inputs to be normalized
FCC or NFD. This happens automatically when using the collation API specific
to the text layer types ( |
One thing people use associative containers for is to make it easy to look for elements, while keeping the elements sorted. Let's see how that goes in the world of Unicode:
boost::text::collation_table da_table = boost::text::tailored_collation_table( boost::text::data::da::standard_collation_tailoring()); boost::text::text const aarhus_old = "Århus"; boost::text::text const aarhus_new = "Aarhus"; std::multiset<boost::text::text> set1; // So far so good, ... // set1.insert(aarhus_old); // Nope! There's no operator<.
Hm. Not well. Ok, we can fix this by making a callable to use for comparisons:
struct text_cmp { bool operator()( boost::text::text const & lhs, boost::text::text const & rhs) const noexcept { // Binary comparison of code point values. return std::lexicographical_compare( lhs.begin().base(), lhs.end().base(), rhs.begin().base(), rhs.end().base()); } };
Then we can make a set using that:
std::multiset<boost::text::text, text_cmp> set2; set2.insert(aarhus_old); // Yay! set2.insert(aarhus_new); // Prints "Aarhus Århus", which is not collation-correct. for (auto const & t : set2) { std::cout << t << " "; } std::cout << "\n";
But wait, I want to do a bunch of stuff in Danish, and I care about collation order. In fact, I'm building a directory of old and new town names, and I want to print an index of them for end-users. Ah, I'll just collate the values instead of binary-comparing them.
First, my new callable:
struct text_coll_cmp { bool operator()( boost::text::text const & lhs, boost::text::text const & rhs) const noexcept { return boost::text::collate(lhs, rhs, table) < 0; } boost::text::collation_table table; // Cheap to copy. };
And then the using code:
std::multiset<boost::text::text, text_coll_cmp> set3(text_coll_cmp{da_table}); set3.insert(aarhus_old); set3.insert(aarhus_new); // Prints "Århus Aarhus". Cool! for (auto const & t : set3) { std::cout << t << " "; } std::cout << "\n";
The problem is that we're doing the very expensive operation of creating two sort keys for each comparison, and immediately throwing them away.
Note | |
---|---|
Ordering sequences of code points creates expectations about their order that can be misleading and/or confusing. Some collation-incorrect sort orders are fine, because end-users will never see them; some sort orders must be collation-correct because end-users eventually will see them. Boost.Text therefore does not allow implicit ordering of sequences of code points. |
Let's look at how much overhead we've incurred by doing all this repetitive
collation. Below is the result of running a pair of perf tests. The first
test is insertion of N
code
point sequences into a std::multiset
that just does binary comparison, like set2
in the example above. The second inserts the same N
sequences into a std::multiset
that does collation and throws
away the keys, like set3
in the other example above.
Table 1.5. Code Point Binary Comparison Vs. Unretained Key Collation
N |
Code Point Binary Comparison |
Full Collation |
Code Point Binary Comparison / Full Collation |
---|---|---|---|
16 |
4968 ns |
8680 ns |
0.572 |
64 |
53410 ns |
82759 ns |
0.645 |
512 |
791259 ns |
1232150 ns |
0.642 |
4096 |
9450019 ns |
14795779 ns |
0.639 |
8192 |
20675781 ns |
32193859 ns |
0.642 |
So the collation is a significant cost. But wait, isn't all that transcoding
from UTF-8 to UTF-32 taking up a lot of time? We can do better that the code
point binary comparison above by doing char
binary comparison instead:
struct text_cmp_2 { bool operator()( boost::text::text const & lhs, boost::text::text const & rhs) const noexcept { // Binary comparison of char values. return std::lexicographical_compare( lhs.begin().base().base(), lhs.end().base().base(), rhs.begin().base().base(), rhs.end().base().base()); } };
std::multiset<boost::text::text, text_cmp_2> set4; set4.insert(aarhus_old); set4.insert(aarhus_new); // Prints "Århus Aarhus", which is collation-correct, but only by accident. // The UTF-8 representation happens to compare that way. for (auto const & t : set4) { std::cout << t << " "; } std::cout << "\n";
And here are the two binary comparison approaches compared to each other:
Table 1.6. Code Point Binary Comparison Vs. `char` Binary Comparison (Tree Set)
N |
Code Point Binary Comparison |
|
Code Point Binary Comparison / |
---|---|---|---|
16 |
4968 ns |
4743 ns |
1.05 |
64 |
53410 ns |
51537 ns |
1.04 |
512 |
791259 ns |
759093 ns |
1.04 |
4096 |
9450019 ns |
8932070 ns |
1.06 |
8192 |
20675781 ns |
19541320 ns |
1.06 |
Interestingly, getting rid of UTF-8 -> UTF-32 transcoding only amounts to a few percent.
The cost of generating a collation sort key that can be used in comparison after comparison is very large. It is roughly an order of magnitude slower than comparing two strings.
In all these tests, I've intentionally chosen to use multisets and multimaps. Maps and sets have an unfair advantage, or a bug, depending on your use case.
If you use a map instead of a multimap, you will not insert a value that has the same sort key as an element already in the map, even if that code point sequence itself is not in the map. That is, two different code point sequences can collate to the same sort key. If you want to consider two code point sequences equivalent when they collate the same, it is a feature, and a possibly large optimization, not to insert one of them. If you want to keep all unique code point sequences, regardless of their associated keys, this is a bug.
As with everything else associated with Unicode, it's complicated:
char
s
directly, though it's not a large effect.
Associative containers are not the best choice when the key-type is a sequence of things, because comparing sequences is inherently expensive. Why not use the unordered associative containers then?
boost::text::collation_table da_table = boost::text::tailored_collation_table( boost::text::data::da::standard_collation_tailoring()); boost::text::text const aarhus_old = "Århus"; boost::text::text const aarhus_new = "Aarhus"; // This works. std::unordered_multiset<boost::text::text> set; set.insert(aarhus_old); set.insert(aarhus_new); assert(set.size() == 2); // So does this. std::unordered_multimap<boost::text::text_sort_key, boost::text::text> map; map.emplace(boost::text::collation_sort_key(aarhus_old, da_table), aarhus_old); map.emplace(boost::text::collation_sort_key(aarhus_new, da_table), aarhus_new); assert(map.size() == 2); // In fact, hashing support is built in for all the text layer types. std::unordered_multiset<boost::text::rope> rope_set; rope_set.insert(boost::text::rope(aarhus_old)); rope_set.insert(boost::text::rope(aarhus_new)); assert(rope_set.size() == 2); std::unordered_multiset<boost::text::rope_view> rope_view_set; rope_view_set.insert(aarhus_old); rope_view_set.insert(aarhus_new); assert(rope_view_set.size() == 2); std::unordered_multiset<boost::text::text_view> text_view_set; text_view_set.insert(aarhus_old); text_view_set.insert(aarhus_new); assert(text_view_set.size() == 2);
Hashing is your friend. Hashing containers sidestep the question of misleading orderings ("Is this order collation-correct?") completely, because they are unordered.