This section concerns itself with collation-aware searching. That is, given
a string S
and a pattern
to search for P
, the search
operation uses collation comparison — not binary comparison —
to find P
within S
.
You can search any range in a variety of ways, using the standard algorithms and the string algorithms in Boost.StringAlgo.
However, many of these algorithms do simple element equality using the default
operator==
,
which is obviously not suitable for collation comparison. Some of them have
overloads that take a comparator, though in most cases the algorithm expects
the comparator to be stateless; this is also not suitable for collation comparison,
which is inherently stateful.
Boost.Text includes a collation-aware search API. The benefit of using the general-purpose collation mechanism for search is that we get all the collation matching functionality included: tailoring plus the collation configury (consider case or not, accents or not, etc.).
Our examples will all use this string to search within:
boost::text::text const string = (char const *) u8"Århus changed the way they spell the name of their town, which has had " u8"the same name for centuries. What's going on in those city council " u8"meetings?";
Below is the simplest way to do a search using this API. Here, we use a convenience
function that does a brute-force search (much like a call to the non-searcher
overloads of std::search()
):
boost::text::collation_table default_table = boost::text::default_collation_table(); boost::text::text const pattern = "What"; auto const result = boost::text::collation_search(string, pattern, default_table); // Prints "Found 'What' at [100, 104): What". std::cout << "Found '" << pattern << "' at [" << std::distance(string.begin(), result.begin()) << ", " << std::distance(string.begin(), result.end()) << "): " << boost::text::text(result) << "\n";
That convenience function is equivalent to the code below. Here, we're creating
a searcher and passing it to collation_search()
. The rest of the collation search API uses
this interface.
boost::text::collation_table default_table = boost::text::default_collation_table(); boost::text::text const pattern = "What"; auto const searcher = boost::text::make_simple_collation_searcher(pattern, default_table); auto const result = boost::text::collation_search(string, searcher); // Prints "Found 'What' at [100, 104): What". std::cout << "Found '" << pattern << "' at [" << std::distance(string.begin(), result.begin()) << ", " << std::distance(string.begin(), result.end()) << "): " << boost::text::text(result) << "\n";
If we want to use a different searching algorithm, we can try that instead:
boost::text::collation_table default_table = boost::text::default_collation_table(); boost::text::text const pattern = "What"; auto const searcher = boost::text::make_boyer_moore_horspool_collation_searcher( pattern, default_table); auto const result = boost::text::collation_search(string, searcher); // Prints "Found 'What' at [100, 104): What". std::cout << "Found '" << pattern << "' at [" << std::distance(string.begin(), result.begin()) << ", " << std::distance(string.begin(), result.end()) << "): " << boost::text::text(result) << "\n";
And of course we can configure the search in the same way as we can configure collation; here we see a case-ignoring search:
boost::text::collation_table default_table = boost::text::default_collation_table(); boost::text::text const pattern = "what"; auto const searcher = boost::text::make_boyer_moore_horspool_collation_searcher( pattern, default_table, boost::text::collation_flags::ignore_case); auto const result = boost::text::collation_search(string, searcher); // Prints "Found 'what' at [100, 104): What". std::cout << "Found '" << pattern << "' at [" << std::distance(string.begin(), result.begin()) << ", " << std::distance(string.begin(), result.end()) << "): " << boost::text::text(result) << "\n";
One more example — one that uses a non-default collation:
boost::text::collation_table da_table = boost::text::tailored_collation_table( boost::text::data::da::standard_collation_tailoring()); boost::text::text const pattern = "Aarhus"; assert(pattern.distance() == 6); // 6 graphemes. // The tailoring for Danish creates a tertiary-difference between Å and Aa; // this implies that they are the same at secondary and primary strengths. By // ignoring case, we ensure that we only use secondary strength or higher. auto const result = boost::text::collation_search( string, pattern, da_table, boost::text::collation_flags::ignore_case); // We found what we were looking for, but it is not what we started with. // Collation matches can be longer or shorter than the pattern matched, so we // return a range instead of an iterator from all the collation search // functions. assert(std::distance(result.begin(), result.end()) == 5); // 5 graphemes! // Prints "Found 'Aarhus' at [0, 5), but it looks like this: Århus". std::cout << "Found '" << pattern << "' at [" << std::distance(string.begin(), result.begin()) << ", " << std::distance(string.begin(), result.end()) << "), but it looks like this: " << boost::text::text(result) << "\n";
I've only shown the grapheme_range
overloads here for
brevity, but there are overloads that take a code_point_range
or a pair of code_point_iter
s
as well, just like the other APIs we've already seen.
Important | |
---|---|
The searching API does not require inputs aligned to grapheme boundaries. You can get correct but suspicious-seeming results when you try to match using code points that may be part of longer graphemes in the text being searched.
Consider the NFD string If none of that made sense to you, don't worry. Just use the text layer types, which always deal in whole graphemes, and you'll never have to consider this effect. |