Collation

Collation is the relative ordering of sequences of code points for purposes of sorting or searching. The Unicode collation algorithm takes a sequence of code points and produces a sequence of numbers (a "sort key") that can be lexicographically compared to another sequence's sort key.

Why can't we just lexicographically compare two Unicode strings? Because Unicode. Consider two code points A and B. There may be some languages for which the proper ordering of A and B is A < B. There may also be other languages for which B < A is the proper order. There may be yet other languages for which A == B is the proper order. As a concrete example, in Swedish z < ö, whereas in German ö < z.

So, if I want to implement a simple function like this:

// Compare two code points for dictionary ordering.
bool impossible_less(uint32_t lhs, uint32_t rhs)
{
    return /* What goes here? */;
}

I can't, because I don't know what language we're using these code points in.

Sadly, collation is even more complicated than this. Collation must also handle the different ordering priorities of different characteristics of code points within a single language or context. For instance, I may want capitals first, implying that G < g, or I may want capitals last, implying g < G. Some languages sort based on accents, and some do not; collation must know whether o < ô or o == ô.

To handle this correctly, collation sort keys are created with support for four levels of comparison. The primary level (L1) represents differences in the base letter or symbol being represented; the secondary level (L2) represents differences in accent; the tertiary level (L3) represents differences in case, or variants of symbols; and the quaternary level (L4) represents differences in punctuation.

From the Unicode documentation:

Table 1.4. Comparison Levels

Level	Description	Examples
L1	Base characters	role < roles < rule
L2	Accents	role < rôle < roles
L3	Case/Variants	role < Role < rôle
L4	Punctuation	role < “role” < Role

When forming a sort key, all the L1 weights come first — for all code points in the sequence — then all the L2 weights, then the L3 weights, etc. This means that any L1 difference is treated as more important than any L2 difference, and any L2 difference trumps any L3 difference, etc.

It is possible to consider only a subset of levels (L1 through LN) when comparing sequences; this is known as the collation strength. For example, L1 strength means "Ignore accents, case, and punctuation", and L2 strength means "Ignore case and punctuation".

There are also parameters you can provide to the collation algorithm that create variations such as "Ignore accents, but do consider case". See the collation_flags for details.

Tailoring

Since there exists no unique mapping of code points to collation weights that works for every language and context, there needs to be a means available for users of the Unicode collation algorithm of tailor collation to a particular language or use case. Boost.Text supports the LDML format for specifying collation tailoring; see that website for details. An example of this super simple and convenient format is:

[normalization on]
[reorder Grek]
&N<ñ<<<Ñ
&C<ch<<<Ch<<<CH
&l<ll<<<Ll<<<LL

You should never need to write one of these tailoring scripts, but if you do, there's a full parser of the tailoring scripting language built into Boost.Text. For most users, it should be sufficient to use one of the canned tailoring scripts in the boost/text/data/ directory. These come from CLDR, and the files use the CLDR locale naming scheme. Just rummage about until you find the language you're looking for. Note that many of the languages have multiple variants.

There is also a default table that can be used for languages with no tailorings.

Note

Collation tailoring is quite expensive for some languages, typically the CJK (Chinese, Japanese, and Korean) language tailorings — sometimes as much as multiple seconds. There is serialization of collation tables to/from a buffer or to/from a boost::filesystem::path. This enables you to do expensive tailorings offline, and just load the results at runtime. See the headers table_serialization.hpp and save_load_table.hpp for details.

The Collation API

The collation-related functions all require a collation table. There are two ways to collate two code point sequences relative to one another. First is to just call collate(). This can be very expensive to call in a loop or other hot code path, because the sort keys are not kept for re-use. The other way is to create the sort keys for the sequences, and then compare the keys yourself. Here is a simple example using the default table:

boost::text::collation_table default_table =
    boost::text::default_collation_table();

// The Danish town of Arrhus changed the town's name in 2010.  Go figure.
boost::text::text const aarhus_old = "Århus";
boost::text::text const aarhus_new = "Aarhus";

// This is fine for one-off comparisons.  Beware though that creating sort
// keys is pretty expensive, and collate() just throws them away after it
// computes its result.
int const collation =
    boost::text::collate(aarhus_old, aarhus_new, default_table);
// "Århus" < "Aarhus" using the default collation, because accents are
// considered after the primary weights of the characters.
assert(0 < collation);

// If you know the strings are very long-lived, it may make more sense to
// generate keys for your two sequences, keep them somewhere, and compare them
// later.
boost::text::text_sort_key aarhus_new_key =
    boost::text::collation_sort_key(aarhus_new, default_table);
boost::text::text_sort_key aarhus_old_key =
    boost::text::collation_sort_key(aarhus_old, default_table);

assert(aarhus_old_key > aarhus_new_key);
assert(boost::text::compare(aarhus_old_key, aarhus_new_key) > 0);

Here's a similar example, this time using the Danish collation tailorings. Note that we're using two overloads that respectively take code_point_ranges and grapheme_ranges.

boost::text::collation_table da_table = boost::text::tailored_collation_table(
    boost::text::data::da::standard_collation_tailoring());

boost::text::text const aarhus_old = "Århus";
std::array<uint32_t, 6> const aarhus_new = {{'A', 'a', 'r', 'h', 'u', 's'}};

boost::text::text_sort_key aarhus_new_key =
    boost::text::collation_sort_key(aarhus_new, da_table);
boost::text::text_sort_key aarhus_old_key =
    boost::text::collation_sort_key(aarhus_old, da_table);

// Under Danish collation with a default configuration, "Aarhus" < "Århus".
assert(aarhus_old_key < aarhus_new_key);
assert(boost::text::compare(aarhus_old_key, aarhus_new_key) < 0);

I've made every effort to hide the messy details of collation configuration from you, because they are super confusing. Instead, Boost.Text uses a set of collation_flags that have more recognizable semantics. The flags map directly onto the low-level configuration settings. Combinations of the flags that map to incompatible low-level settings will not compile. The low-level configuration is still available for those already familiar with Unicode collation.

When generating a sort key, the default configuration is to use tertiary strength (consider everything but punctuation), with no other options enabled. Other options can be specified to get different collations with the same table:

boost::text::collation_table default_table =
    boost::text::default_collation_table();

// For the boost::text::text "_t" UDL.
using namespace boost::text::literals;

int result = 0;

// No configuration, which implies tertiary strength.
result = boost::text::collate("resume"_t, "RÉSUMÉ"_t, default_table);
assert(result < 0);

// Ignore everything but the letters themselves.
result = boost::text::collate(
    "resume"_t,
    "RÉSUMÉ"_t,
    default_table,
    boost::text::collation_flags::ignore_accents |
        boost::text::collation_flags::ignore_case |
        boost::text::collation_flags::ignore_punctuation);
assert(result == 0);

// Ignore accents; case is still considered of course.
result = boost::text::collate(
    "resume"_t,
    "Résumé"_t,
    default_table,
    boost::text::collation_flags::ignore_accents);
assert(result < 0);
result = boost::text::collate(
    "resume"_t,
    "résumé"_t,
    default_table,
    boost::text::collation_flags::ignore_accents);
assert(result == 0);

// Ignore case; accents are still considered of course.
result = boost::text::collate(
    "résumé"_t,
    "Résumé"_t,
    default_table,
    boost::text::collation_flags::ignore_case);
assert(result == 0);
result = boost::text::collate(
    "résumé"_t,
    "Resume"_t,
    default_table,
    boost::text::collation_flags::ignore_case);
assert(result > 0);

// Say you want to put one case explicitly before or after the other:
result = boost::text::collate(
    "Resume"_t,
    "resume"_t,
    default_table,
    boost::text::collation_flags::upper_case_first);
assert(result < 0);
result = boost::text::collate(
    "Resume"_t,
    "resume"_t,
    default_table,
    boost::text::collation_flags::lower_case_first);
assert(result > 0);

// You can also completely ignore punctuation.
result = boost::text::collate(
    "ellipsis"_t,
    "ellips...is"_t,
    default_table,
    boost::text::collation_flags::ignore_punctuation);
assert(result == 0);

Important

	Important
The default Unicode collation algorithm requires the NFD normalization form, one of the least compact normalization forms. NFC is the most widely-used normalization form (most European keyboards produce NFC, and W3C recommends NFC for all web text). This means either keeping your data in the most interoperable normalization form (NFC), and paying the cost of normalization every time you do a collation, or the less interoperable NFD and not having to renormalize when collating. FCC is very similar to NFC, except that it is less compact in a few cases. Boost.Text's collation implementation relies on the inputs being in either FCC or NFD, so the collation functions require inputs to be normalized FCC or NFD. This happens automatically when using the collation API specific to the text layer types (`text`, `rope`, etc.).

The default Unicode collation algorithm requires the NFD normalization form, one of the least compact normalization forms. NFC is the most widely-used normalization form (most European keyboards produce NFC, and W3C recommends NFC for all web text). This means either keeping your data in the most interoperable normalization form (NFC), and paying the cost of normalization every time you do a collation, or the less interoperable NFD and not having to renormalize when collating.

FCC is very similar to NFC, except that it is less compact in a few cases. Boost.Text's collation implementation relies on the inputs being in either FCC or NFD, so the collation functions require inputs to be normalized FCC or NFD. This happens automatically when using the collation API specific to the text layer types (text, rope, etc.).

Associative Container Woes

One thing people use associative containers for is to make it easy to look for elements, while keeping the elements sorted. Let's see how that goes in the world of Unicode:

boost::text::collation_table da_table = boost::text::tailored_collation_table(
    boost::text::data::da::standard_collation_tailoring());

boost::text::text const aarhus_old = "Århus";
boost::text::text const aarhus_new = "Aarhus";

std::multiset<boost::text::text> set1; // So far so good, ...
// set1.insert(aarhus_old);            // Nope! There's no operator<.

Hm. Not well. Ok, we can fix this by making a callable to use for comparisons:

struct text_cmp
{
    bool operator()(
        boost::text::text const & lhs, boost::text::text const & rhs) const
        noexcept
    {
        // Binary comparison of code point values.
        return std::lexicographical_compare(
            lhs.begin().base(),
            lhs.end().base(),
            rhs.begin().base(),
            rhs.end().base());
    }
};

Then we can make a set using that:

std::multiset<boost::text::text, text_cmp> set2;
set2.insert(aarhus_old);  // Yay!
set2.insert(aarhus_new);

// Prints "Aarhus Århus", which is not collation-correct.
for (auto const & t : set2) {
    std::cout << t << " ";
}
std::cout << "\n";

But wait, I want to do a bunch of stuff in Danish, and I care about collation order. In fact, I'm building a directory of old and new town names, and I want to print an index of them for end-users. Ah, I'll just collate the values instead of binary-comparing them.

First, my new callable:

struct text_coll_cmp
{
    bool operator()(
        boost::text::text const & lhs, boost::text::text const & rhs) const
        noexcept
    {
        return boost::text::collate(lhs, rhs, table) < 0;
    }

    boost::text::collation_table table; // Cheap to copy.
};

And then the using code:

std::multiset<boost::text::text, text_coll_cmp> set3(text_coll_cmp{da_table});
set3.insert(aarhus_old);
set3.insert(aarhus_new);

// Prints "Århus Aarhus".  Cool!
for (auto const & t : set3) {
    std::cout << t << " ";
}
std::cout << "\n";

The problem is that we're doing the very expensive operation of creating two sort keys for each comparison, and immediately throwing them away.

	Note
	Ordering sequences of code points creates expectations about their order that can be misleading and/or confusing. Some collation-incorrect sort orders are fine, because end-users will never see them; some sort orders must be collation-correct because end-users eventually will see them. Boost.Text therefore does not allow implicit ordering of sequences of code points.

Let's look at how much overhead we've incurred by doing all this repetitive collation. Below is the result of running a pair of perf tests. The first test is insertion of N code point sequences into a std::multiset that just does binary comparison, like set2 in the example above. The second inserts the same N sequences into a std::multiset that does collation and throws away the keys, like set3 in the other example above.

Table 1.5. Code Point Binary Comparison Vs. Unretained Key Collation

N	Code Point Binary Comparison	Full Collation	Code Point Binary Comparison / Full Collation
16	4968 ns	8680 ns	0.572
64	53410 ns	82759 ns	0.645
512	791259 ns	1232150 ns	0.642
4096	9450019 ns	14795779 ns	0.639
8192	20675781 ns	32193859 ns	0.642

So the collation is a significant cost. But wait, isn't all that transcoding from UTF-8 to UTF-32 taking up a lot of time? We can do better that the code point binary comparison above by doing char binary comparison instead:

struct text_cmp_2
{
    bool operator()(
        boost::text::text const & lhs, boost::text::text const & rhs) const
        noexcept
    {
        // Binary comparison of char values.
        return std::lexicographical_compare(
            lhs.begin().base().base(),
            lhs.end().base().base(),
            rhs.begin().base().base(),
            rhs.end().base().base());
    }
};

std::multiset<boost::text::text, text_cmp_2> set4;
set4.insert(aarhus_old);
set4.insert(aarhus_new);

// Prints "Århus Aarhus", which is collation-correct, but only by accident.
// The UTF-8 representation happens to compare that way.
for (auto const & t : set4) {
    std::cout << t << " ";
}
std::cout << "\n";

And here are the two binary comparison approaches compared to each other:

Table 1.6. Code Point Binary Comparison Vs. `char` Binary Comparison (Tree Set)

N	Code Point Binary Comparison	`char` Binary Comparison	Code Point Binary Comparison / `char` Binary Comparison
16	4968 ns	4743 ns	1.05
64	53410 ns	51537 ns	1.04
512	791259 ns	759093 ns	1.04
4096	9450019 ns	8932070 ns	1.06
8192	20675781 ns	19541320 ns	1.06

Interestingly, getting rid of UTF-8 -> UTF-32 transcoding only amounts to a few percent.

The cost of generating a collation sort key that can be used in comparison after comparison is very large. It is roughly an order of magnitude slower than comparing two strings.

In all these tests, I've intentionally chosen to use multisets and multimaps. Maps and sets have an unfair advantage, or a bug, depending on your use case.

If you use a map instead of a multimap, you will not insert a value that has the same sort key as an element already in the map, even if that code point sequence itself is not in the map. That is, two different code point sequences can collate to the same sort key. If you want to consider two code point sequences equivalent when they collate the same, it is a feature, and a possibly large optimization, not to insert one of them. If you want to keep all unique code point sequences, regardless of their associated keys, this is a bug.

Associative Container Perf Bottom Line

As with everything else associated with Unicode, it's complicated:

When using sequences of code points with associative containers, consider using binary comparison if you don't care about the sort order being collation-correct — that is, when the user will not see the sort order.
If you do binary comparisons of UTF-8 encoded types, consider comparing the bytes of the underlying chars directly, though it's not a large effect.
If you do care about the sort order being collation-correct, you should only use a container that retains the sort keys if the sort keys are very long-lived, and are used very often — often enough to justify the 10x increased cost of generating a full sort key.

Hashing Containers

Associative containers are not the best choice when the key-type is a sequence of things, because comparing sequences is inherently expensive. Why not use the unordered associative containers then?

boost::text::collation_table da_table = boost::text::tailored_collation_table(
    boost::text::data::da::standard_collation_tailoring());

boost::text::text const aarhus_old = "Århus";
boost::text::text const aarhus_new = "Aarhus";

// This works.
std::unordered_multiset<boost::text::text> set;
set.insert(aarhus_old);
set.insert(aarhus_new);
assert(set.size() == 2);

// So does this.
std::unordered_multimap<boost::text::text_sort_key, boost::text::text> map;
map.emplace(boost::text::collation_sort_key(aarhus_old, da_table), aarhus_old);
map.emplace(boost::text::collation_sort_key(aarhus_new, da_table), aarhus_new);
assert(map.size() == 2);

// In fact, hashing support is built in for all the text layer types.
std::unordered_multiset<boost::text::rope> rope_set;
rope_set.insert(boost::text::rope(aarhus_old));
rope_set.insert(boost::text::rope(aarhus_new));
assert(rope_set.size() == 2);

std::unordered_multiset<boost::text::rope_view> rope_view_set;
rope_view_set.insert(aarhus_old);
rope_view_set.insert(aarhus_new);
assert(rope_view_set.size() == 2);

std::unordered_multiset<boost::text::text_view> text_view_set;
text_view_set.insert(aarhus_old);
text_view_set.insert(aarhus_new);
assert(text_view_set.size() == 2);

Hashing is your friend. Hashing containers sidestep the question of misleading orderings ("Is this order collation-correct?") completely, because they are unordered.