Searching

This section concerns itself with collation-aware searching. That is, given a string S and a pattern to search for P, the search operation uses collation comparison — not binary comparison — to find P within S.

You can search any range in a variety of ways, using the standard algorithms and the string algorithms in Boost.StringAlgo.

However, many of these algorithms do simple element equality using the default operator==, which is obviously not suitable for collation comparison. Some of them have overloads that take a comparator, though in most cases the algorithm expects the comparator to be stateless; this is also not suitable for collation comparison, which is inherently stateful.

Boost.Text includes a collation-aware search API. The benefit of using the general-purpose collation mechanism for search is that we get all the collation matching functionality included: tailoring plus the collation configury (consider case or not, accents or not, etc.).

Our examples will all use this string to search within:

boost::text::text const string =
    (char const *)
    u8"Århus changed the way they spell the name of their town, which has had "
    u8"the same name for centuries.  What's going on in those city council "
    u8"meetings?";

Below is the simplest way to do a search using this API. Here, we use a convenience function that does a brute-force search (much like a call to the non-searcher overloads of std::search()):

boost::text::collation_table default_table =
    boost::text::default_collation_table();

boost::text::text const pattern = "What";

auto const result =
    boost::text::collation_search(string, pattern, default_table);

// Prints "Found 'What' at [100, 104): What".
std::cout << "Found '" << pattern << "' at ["
          << std::distance(string.begin(), result.begin()) << ", "
          << std::distance(string.begin(), result.end())
          << "): " << boost::text::text(result) << "\n";

That convenience function is equivalent to the code below. Here, we're creating a searcher and passing it to collation_search(). The rest of the collation search API uses this interface.

boost::text::collation_table default_table =
    boost::text::default_collation_table();

boost::text::text const pattern = "What";

auto const searcher =
    boost::text::make_simple_collation_searcher(pattern, default_table);
auto const result = boost::text::collation_search(string, searcher);

// Prints "Found 'What' at [100, 104): What".
std::cout << "Found '" << pattern << "' at ["
          << std::distance(string.begin(), result.begin()) << ", "
          << std::distance(string.begin(), result.end())
          << "): " << boost::text::text(result) << "\n";

If we want to use a different searching algorithm, we can try that instead:

boost::text::collation_table default_table =
    boost::text::default_collation_table();

boost::text::text const pattern = "What";

auto const searcher = boost::text::make_boyer_moore_horspool_collation_searcher(
    pattern, default_table);
auto const result = boost::text::collation_search(string, searcher);

// Prints "Found 'What' at [100, 104): What".
std::cout << "Found '" << pattern << "' at ["
          << std::distance(string.begin(), result.begin()) << ", "
          << std::distance(string.begin(), result.end())
          << "): " << boost::text::text(result) << "\n";

And of course we can configure the search in the same way as we can configure collation; here we see a case-ignoring search:

boost::text::collation_table default_table =
    boost::text::default_collation_table();

boost::text::text const pattern = "what";

auto const searcher = boost::text::make_boyer_moore_horspool_collation_searcher(
    pattern, default_table, boost::text::collation_flags::ignore_case);
auto const result = boost::text::collation_search(string, searcher);

// Prints "Found 'what' at [100, 104): What".
std::cout << "Found '" << pattern << "' at ["
          << std::distance(string.begin(), result.begin()) << ", "
          << std::distance(string.begin(), result.end())
          << "): " << boost::text::text(result) << "\n";

One more example — one that uses a non-default collation:

boost::text::collation_table da_table = boost::text::tailored_collation_table(
    boost::text::data::da::standard_collation_tailoring());

boost::text::text const pattern = "Aarhus";
assert(pattern.distance() == 6); // 6 graphemes.

// The tailoring for Danish creates a tertiary-difference between Å and Aa;
// this implies that they are the same at secondary and primary strengths.  By
// ignoring case, we ensure that we only use secondary strength or higher.
auto const result = boost::text::collation_search(
    string, pattern, da_table, boost::text::collation_flags::ignore_case);

// We found what we were looking for, but it is not what we started with.
// Collation matches can be longer or shorter than the pattern matched, so we
// return a range instead of an iterator from all the collation search
// functions.
assert(std::distance(result.begin(), result.end()) == 5); // 5 graphemes!

// Prints "Found 'Aarhus' at [0, 5), but it looks like this: Århus".
std::cout << "Found '" << pattern << "' at ["
          << std::distance(string.begin(), result.begin()) << ", "
          << std::distance(string.begin(), result.end())
          << "), but it looks like this: " << boost::text::text(result) << "\n";

I've only shown the grapheme_range overloads here for brevity, but there are overloads that take a code_point_range or a pair of code_point_iters as well, just like the other APIs we've already seen.

Important

	Important
The searching API does not require inputs aligned to grapheme boundaries. You can get correct but suspicious-seeming results when you try to match using code points that may be part of longer graphemes in the text being searched. Consider the NFD string `"a\u0300\u0301"` — an `'a'` followed by a grave accent followed by an acute accent. This string will not always be matched correctly by pattern `"a\u0300"` (or `"a\u0301"`), due to the arbitrary ordering of combining marks with the same canonical combining class (CCC). This does not matter in the whole-grapheme case, which handles this correctly. It will only matter in code point level searches, because you may search for partial graphemes, as in `collation_search("a\u0300", "a\u0300\u0301")` for example. If none of that made sense to you, don't worry. Just use the text layer types, which always deal in whole graphemes, and you'll never have to consider this effect.

The searching API does not require inputs aligned to grapheme boundaries. You can get correct but suspicious-seeming results when you try to match using code points that may be part of longer graphemes in the text being searched.

Consider the NFD string "a\u0300\u0301" — an 'a' followed by a grave accent followed by an acute accent. This string will not always be matched correctly by pattern "a\u0300" (or "a\u0301"), due to the arbitrary ordering of combining marks with the same canonical combining class (CCC). This does not matter in the whole-grapheme case, which handles this correctly. It will only matter in code point level searches, because you may search for partial graphemes, as in collation_search("a\u0300", "a\u0300\u0301") for example.

If none of that made sense to you, don't worry. Just use the text layer types, which always deal in whole graphemes, and you'll never have to consider this effect.