Matching prefixes in Swift strings

How do you determine if one string starts with another, in Swift?

Surely that’s exactly what the hasPrefix(_:) method is for:

let greeting = "Hello"
let sentence = "hello, this is dog."


if sentence.hasPrefix(greeting) {
    print("Hi!  Nice to meet you!")
} else {
    print("No can haz etiquette?")
}

No can haz etiquette?

Wot?

The problem is that hasPrefix is not meant for general use with human text; it’s barely better than a byte-wise comparison. It only guarantees that it won’t be fooled by mere differences in Unicode encoding, which is a good start, but not remotely sufficient for general use.

Let’s step back a bit, and first consider the slightly simpler case of just comparing two whole strings. We can worry about the prefix-matching aspect later.

NSString (from Foundation) provides Swift Strings with a variety of more powerful comparison methods, such as caseInsensitiveCompare(_:) (which is really just an alias for compare(_:options:range:locale:) with the caseInsensitive option).

let a = "hello"
let b = "Hello"

if a.caseInsensitiveCompare(b) {
    print("Hello indeed.")
} else {
    print("Hmph… rude.")
}

Hello indeed.

So that works. For case sensitivity. But what about other situations?

let plain = "cafe"
let fancy = "cafΓ©"

if plain.caseInsensitiveCompare(fancy) {
    print("Right, either way it's a shop that sells coffee.")
} else {
    print("So… no coffee, then?")
}

So… no coffee, then?

Well, shit.

It may vary in other languages, but in English “cafΓ©” is just an alternative spelling of “cafe”, and you almost always want to consider them equal. In fact, in English it’s basically never really required that you observe accents on letters – some are technically required, such as blasΓ©, but English speakers are very blase about such things. Unlike e.g. Spanish, with n vs Γ±, accented letters are not considered distinct letters in English.

But, letter accents may creep into English text anyway (just like spoken accents). Some people prefer them, for any of numerous reasons, like:

  • In proper nouns out of respect for the so-named.
  • To honour words’ roots in other languages.
  • For technical correctness of pronunciation.
  • Just aesthetically.

So you do need to support them, which means accepting and preserving them, but (usually) otherwise ignoring them.

Ah, but wait, the documentation for caseInsensitiveCompare(_:) has a footnote which surely addresses exactly this problem, albeit obliquely:

Important

When working with text that’s presented to the user, use the localizedCaseInsensitiveCompare(_:) method instead.

No worries – we’ll just use that instead:

let plain = "cafe"
let fancy = "cafΓ©"

if plain.localizedCaseInsensitiveCompare(fancy) {
    print("Finally!")
} else {
    print("Oh come on!")
}

Oh come on!

It turns out this mistake is made by most of the String / NSString methods of similar ilk. And the discrepancies are inscrutable – e.g. localizedStandardCompare(_:) doesn’t handle accents correctly but localizedStandardRange(of:) does.

Long story short, you need to base most (if not all) your string comparison on compare(_:options:range:locale:) or its sibling range(of:options:range:locale:), because the other string methods don’t work properly.

So, with compare(…) you can do e.g.:

let plain = "cafe"
let fancy = "CafΓ©"

if .orderedSame == plain.compare(fancy,
                                 options: [.caseInsensitive,
                                           .diacriticInsensitive],
                                 locale: .current) {
    print("Finally!")
} else {
    print("😀")
}

Finally.

But there’s two other options you should almost always use, which are easy to overlook:

  • widthInsensitive.

    In all the Latin-alphabet languages of which I’m aware, there is no notion of “width” and therefore no issues with width [in]sensitivity. It seems it most-often comes up in Japanese, where for historical reasons there were multiple versions of the same character that merely different in their visual dimensions. e.g. “γ‚«” and “ο½Ά”. They are semantically the exact same character, even moreso than “a” is to “A”.

    Even if the locale uses a Latin alphabet, there may still be mixed character sets and languages in the text your app processes – e.g. someone writing mostly in English but including Japanese names.
  • numeric.

    There are more numeric systems than just the modern Arabic numerals as used in English. e.g. “Ω€Ω’” is 42 in Eastern Arabic. What matters is usually their meaning (i.e. numeric value), not their representation, just like the other factors we’ve already covered.

So, incorporating all that, the magic incantation required to correctly compare two human pieces of English text, in Swift, is:

let plain = "cafe γ‚« 42"
let fancy = "CafΓ© ο½Ά Ω€Ω’"

if .orderedSame == plain.compare(fancy,
                                 options: [.caseInsensitive,
                                           .diacriticInsensitive,
                                           .numeric,
                                           .widthInsensitive],
                                 locale: .current) {
    print("Actually equivalent.")
} else {
    print("Not equivalent")
}

Actually equivalent.

Note that the explicit locale argument may be important for some use-cases, for two reasons:

  • Generally it seems to turn on some – but not all – locale-appropriate options, in addition to any you specify explicitly.

    While that may be redundant when you’re explicitly turning them on anyway, it’s possible it will have additional effects that aren’t expressible with the options parameter. You’ll probably want those too, as if they exist they’ll be things like special handling of unusual cases and exceptions to the rules.
  • Sometimes it turns hidden options off, such as whether to consider superscripts & subscripts equivalent. This might be a reason to not use it sometimes, if you don’t like the end result.

It’s less clear, even just considering English, what the correct default behaviour is regarding superscripts and subscripts, or “baseline sensitivity”. It’s quite conceivable that a user might intend to match a superscript or subscript even though they entered a plain digit, because most people don’t know how to actually type superscripts and subscripts (it’s not easy on most computers, at least not without 3rd party utilities like Rocket, and practically impossible on mobile devices).

And plenty of programs – particularly those not written in Swift, that might not handle Unicode correctly even at the most basic levels – erroneously devolve superscripts and subscripts to plain digits, which ideally wouldn’t prevent subsequent tools from still working with them (e.g. still finding “but1” when looking for “butΒΉ”).

Yet in some contexts the differences very much do matter – e.g. in mathematical notation, xΒ² is very different to xβ‚‚.

For reference, here’s a breakdown of the behaviour of some key String / NSString methods (as tested in the en_AU locale), to help you decide what specific incantation you need in a given situation:

MethodCase insensitive
(“Hello” vs “hello”)
Diacritic insensitive
(“cafe” vs “cafΓ©”)
Width insensitive
(“γ‚«” vs “ο½Ά”)
Numerals insensitive
(“42” vs “Ω€Ω’”)
Baseline insensitive
(“but1” vs “butΒΉ”)
==❌❌❌❌❌
hasPrefix❌❌❌❌❌
commonPrefix(…, options: .caseInsensitive)βœ…βŒβŒβŒβŒ
commonPrefix(…, options: .diacriticInsensitive)βŒβœ…βŒβŒβŒ
commonPrefix(…, options: .widthInsensitive)βŒβŒβœ…βŒβŒ
commonPrefix(…, options: .numeric)❌❌❌❌1βœ…
localizedCompareβŒβŒβœ…βœ…βŒ
localizedCaseInsensitiveCompareβœ…βŒβœ…βœ…βŒ
localizedStandardCompare❌❌❌❌❌
localizedStandardRange(of:)βœ…βœ…βŒβŒβŒ
compare(…, options: .caseInsensitive)βœ…βŒβŒβŒβŒ
compare(…, options: .caseInsensitive, locale: .current)βœ…βŒβœ…βœ…βœ…
compare(…, options: .diacriticInsensitive)βŒβœ…βŒβŒβŒ
compare(…, options: .diacriticInsensitive, locale: .current)βŒβœ…βœ…βœ…βœ…
compare(…, options: .widthInsensitive)βŒβŒβœ…βŒβŒ
compare(…, options: .widthInsensitive, locale: .current)βŒβŒβœ…βœ…βœ…
compare(…, options: .numeric)βŒβŒβŒβœ…βœ…
compare(…, options: .numeric, locale: .current)βŒβŒβœ…βœ…βŒ2
compare(…, options: [.caseInsensitive, .diacriticInsensitive, .numeric, .widthInsensitive])βœ…βœ…βœ…βœ…βœ…
compare(…, options: [.caseInsensitive, .diacriticInsensitive, .numeric, .widthInsensitive], locale: .current)βœ…βœ…βœ…βœ…βŒ

Now, the real challenge is making code that works across all locales. In a nutshell, that’s practically impossible with Swift’s standard libraries today – they just don’t support it. To do it right, you’d have to determine what the appropriate comparison options are for every possible locale, manually, and bundle that database with your app.

But, given it’s usually better anyway to err on the side of matching rather than not matching, you can get pretty far by just assuming insensitivity to the above five factors.

Even in cases where this does cause mistakes – e.g. conflating “Maßen” (in moderation) with “Massen” (en masse) in German – it’s potentially unavoidable without deeper, context-specific knowledge anyway (since “ß” is normally equivalent to “ss” in German, just not regarding those specific two words – you can read more about this unpleasant situation on e.g. Wikipedia).

So, back to prefixes…

Fortunately, compare(_:options:range:locale:) and range(of:options:range:locale:) have a couple of additional options which make them easier to apply to situations other than just comparing whole strings.

Checking for a specific prefix

There is an anchored option which is perfect for this – it restricts the match to the start of the receiving string. e.g.:

let reaction = "πŸ˜πŸ‘"

if nil != reaction.range(of: "😁",
                         options: [.anchored,
                                   .caseInsensitive,
                                   .diacriticInsensitive,
                                   .numeric,
                                   .widthInsensitive],
                         locale: .current)) {
    print("Happy!")
} else {
    print("SadΒ‘")
}

Happy!

Note that you must use the range(of:…) variant, not compare(…), because the latter essentially requires that the two strings fully match, not merely that one is a prefix of the other (more on that later, in case you’re not convinced).

Finding common prefixes

Fortunately, there’s a convenience method for exactly this, commonPrefix(with:options:):

let happy = "πŸ˜πŸ‘"
let party = "πŸ˜πŸŽ‰"

print("Similarities:", happy.commonPrefix(with: party,
                                          options: [.caseInsensitive,
                                                    .diacriticInsensitive,
                                                    .numeric,
                                                    .widthInsensitive]))

Similarities: 😁

Do not use this if you merely want to see if they share a specific prefix, because:

  • It’s more efficient to just check that directly on each string separately (rather than allocating and returning an intermediary string).
  • You still have to use compare(…) with the full set of options to check the result.

Note also that it does not have a locale parameter, so you cannot opt in to any system-default options defined for the current locale; you must explicitly specify every option you need.

⚠️ Beware: it doesn’t honour the numeric option.

Working with suffixes

You can of course reverse both strings and then compare what are now their prefixes, but this is expensive and awkward, since the result of String‘s reversed() method is a ReversedCollection<String>, not a String or even a Substring, and it does not have the necessary comparison methods, so you have to convert it to a real String first.

Far easier and more efficient is to make use the backwards option to range(of:…). e.g.:

let word = "doing"

if nil != word.range(of: "ing",
                     options: [.anchored,
                               .backwards,
                               .caseInsensitive,
                               .diacriticInsensitive,
                               .numeric,
                               .widthInsensitive],
                     locale: .current)) {
    print("It's an 'ing' word.")
} else {
    print("Nyet.")
}

It's an 'ing' word.

Note how – conveniently – it does not require reversal of the argument string (“ing” in the above example).

Beware the range parameter

The compare(…) and range(of:…) methods also have a range parameter. This seems like a great idea – you can specify which specific subset of a string you care about, without having to actually break it out into a whole new String instance.

However, the range parameter is both a little unintuitive in its behaviour and fundamentally hard to use correctly.

On the first aspect, it’s critical to realise that it specifies the range within only the receiver (“happy” in the example below). It has no effect on the argument string (“party” in the example below). So you might innocently write the following:

let happy = "πŸ˜πŸ‘"
let party = "πŸ˜πŸŽ‰"

if .orderedSame == happy.compare(party,
                                 options: [.caseInsensitive,
                                           .diacriticInsensitive,
                                           .numeric,
                                           .widthInsensitive],
                                 range: happy.startIndex ..< happy.index(after: happy.startIndex),
                                 locale: .current)) {
    print("Grins all round.")
} else {
    print("…not happy?")
}

…not happy?

If you want to compare subsets of both strings, you need to explicitly slice the second string, e.g.:

let happy = "πŸ˜πŸ‘"
let party = "πŸ˜πŸŽ‰"

if .orderedSame == happy.compare(party.prefix(1),
                                 options: [.caseInsensitive,
                                           .diacriticInsensitive,
                                           .numeric,
                                           .widthInsensitive],
                                 range: happy.startIndex ..< happy.index(after: happy.startIndex),
                                 locale: .current)) {
    print("Grins all round.")
} else {
    print("…not happy?")
}

Grins all round.

Or, more simply:

let happy = "πŸ˜πŸ‘"
let party = "πŸ˜πŸŽ‰"

if .orderedSame == happy.prefix(1).compare(party.prefix(1),
                                           options: [.caseInsensitive,
                                                     .diacriticInsensitive,
                                                     .numeric,
                                                     .widthInsensitive],
                                           locale: .current)) {
    print("Grins all round.")
} else {
    print("…not happy?")
}

Grins all round.

But, you should rarely if ever actually do the above, because of the second aspect: slicing strings is actually really hard. Not technically, obviously, but if you want to do it correctly. The crux of the challenge is that two strings can have different lengths but still be equivalent (e.g. “ß” and “ss” in German), so slicing them independently is error-prone, unless you somehow account for the specific differences in their encoding. If you naively assume things like a specific length for a target string (e.g. the single character of “ß”) and apply that length to the input string, you might get incorrect results. e.g.:

let input = "FΓΌssen"
let target = "Füß"

if .orderedSame == input.prefix(target.count).compare(target,
                                                      options: [.caseInsensitive,
                                                                .diacriticInsensitive,
                                                                .numeric,
                                                                .widthInsensitive],
                                                      locale: .current)) {
    print("Something about feet.")
} else {
    print("Nothing afoot.")
}

Nothing afoot.

(in case you don’t speak German, that’s the wrong result logically – Füß is a prefix of FΓΌssen)

  1. Yes, really. I have no idea why commonPrefix(with:options:) doesn’t work correctly with the numeric option, given it presumably uses compare(_:options:range:locale:) or range(of:options:range:locale:) under the hood. Possibly some bad interaction with locale-specific settings, given it doesn’t let you specify the locale and it doesn’t document what it hard-codes it to. β†©οΈŽ
  2. Yes, really. I don’t know why using the current locale (en_AU in this case) turns off baseline insensitivity when the numeric option is used, whereas it turns it on otherwise. Seems like a bug in Apple’s framework. β†©οΈŽ

Leave a Comment