Collection enumeration performance in Swift

Wed, 08 Nov 2023 07:10:08 +0000

Swift’s Collection and Sequence protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively. For example:

let result = data
    .filter { 0 != $0 }
    .map { $0 * $0 }
    .reduce(into: 0, &+=)

Or:

var result = 0

for value in data {
    if 0 != value {
        result &+= value * value
    }
}

Nominally these are equivalent – they’ll produce the same results for all correctly-implemented Collections and Sequences. So in principle which you use is purely a matter of stylistic preference.

But is it?

Do they actually perform equivalently?

Let’s examine an example that’s a little more involved than the above snippets, but still fundamentally pretty straightforward. The extra processing steps are to help distinguish any performance differences.

The pertinent parts are:

testData.next
    .filter { 0 != $0 }
    .map { $0.byteSwapped }
    .filter { ($0 & 0xff00) >> 8 < $0 & 0xff }
    .map { $0.leadingZeroBitCount }
    .filter { Int.bitWidth - 8 >= $0 }
    .reduce(into: 0, &+=))

And the imperative equivalent:

var result = 0

for value in testData.next {
    if 0 != value {
        let value = value.byteSwapped

        if (value & 0xff00) >> 8 < value & 0xff {
            let value = value.leadingZeroBitCount

            if Int.bitWidth - 8 >= value {
                result &+= value
            }
        }
    }
}

I’ve published the full source code, in case you’d like to review it further or run it yourself.

How does the performance compare?

On my iMac Pro (10 cores (Xeon W-2150B)):

Dataset size	Functional (median)	Imperative (median)	Performance difference
0	234 ns	133 ns	1.67x
32 KiB	57 µs	16 µs	3.56x
1 MiB	1.7 ms	0.5 ms	3.36x
8 MiB	27 ms	4.2 ms	6.36x
32 MiB	147 ms	17 ms	8.65x

On my M2 MacBook Air:

Dataset size	Functional (median)	Imperative (median)	Performance difference
0	167 ns	83 ns	2.01x
32 KiB	37 µs	3.6 µs	10.20x
1 MiB	1,058µs	112 µs	9.45x
8 MiB	12 ms	0.9 ms	13.23x
32 MiB	50 ms	3.7 ms	13.76x

The imperative version is many times faster! And the performance difference increases as the collection size increases. The functional version starts off not super terrible – at least on the same order magnitude as the imperative version – but it tends rapidly towards being an order of magnitude slower.

Worse, the difference is much more pronounced on more modern CPUs, like Apple Silicon.

What’s going on?

There’s a few compounding factors.

Smaller datasets are more likely to fit into CPU caches (the dataset sizes shown above were chosen to correspond to L1 / L1 / L2 / L3 / RAM, respectively, on my iMac Pro). Working on data in CPU caches is by nature faster – the lower-level the cache the better – and so helps hides inefficiencies.

The functional version creates intermediary Arrays to store the intermediary results of every filter and map operation. This introduces malloc traffic, retains & releases, and writes to memory. The imperative version has none of that overhead – it simply reads every value in the collection once, performing the whole sequence of operations all in one go for each element, using only CPU registers (not so much as a function call, even!).

Can you tell which approach results in way more memory use (and is a lot slower)?

Sidenote: Dataset load costs

Swift Sequence – Wade Tregaskis

Collection enumeration performance in Swift

How does the performance compare?

What’s going on?