Benchmarked – Wade Tregaskis

URLSession performance for reading a byte stream

Fri, 03 May 2024 23:52:00 +0000

What’s the best way to read a stream of bytes with URLSession? That’s the simple question I set out to answer. I wrote some benchmarks. They read a 128 MiB file and perform a contrived aggregation of its content bytes (a joking “hash” of them, merely to ensure the actual reads aren’t optimised out).

⚠️ In a nutshell, the results here demonstrate the best-case performance for each of the methods evaluated. These benchmarks are very simple, which makes them relatively easy for the Swift compiler to optimise well. In less trivial, real-world code, the optimiser might not do so great. So these benchmarks and their results are merely one collective data point in the bigger picture of just how the heck to read files efficiently.

There’s two key decisions you must make: which specific URLSession API will you use, and how will you access the bytes themselves.

Measurements

Each benchmark was run a hundred times or for 30 seconds (whichever limit was hit first). I’m highlighting here just the medians (in general there wasn’t much variation anyway), but you can dig into the other percentiles & metrics via the disclosure triangles, if you like.

I’m pretty sure the reads were all served out of the kernel’s in-memory file system cache, judging by the lack of SSD read I/O reported by iStat Menus. But I didn’t go out of my way to verify this.

⚠️ “Peak RAM” is as reported by the Benchmark package, based on (if I understand it correctly) periodic sampling of the process RSS. As such it’s not necessarily completely accurate, due to the potential to miss brief peaks.

M2 MacBook Air

Method	Wall time (ms)	CPU time (ms)	Throughput (MiB/s)	Peak RAM (MB)
`bytes(from:)` and for loop	79	138	1,620	91
`bytes(from:)` and `reduce`	79	138	1,620	84
`data(from:)` and for loop	605	641	212	265
`data(from:)` and for loop inside `withUnsafeBytes`	60	95	2,133	338
`data(from:)` and `forEach`	765	800	167	262
`data(from:)` and `reduce`	750	784	171	290
`dataTask(with:)` and an incremental delegate with for loop	560	617	229	53
`dataTask(with:)` and an incremental delegate with for loop inside `withUnsafeBytes`	36	75	3,556	26
`dataTask(with:)` and an incremental delegate with `forEach`	719	775	178	51
`dataTask(with:)` and an incremental delegate with `reduce`	709	765	167	45
`dataTask(with:completionHandler:)` and for loop	590	630	217	442
`dataTask(with:completionHandler:)` and for loop inside `withUnsafeBytes`	57	98	2,246	504
`dataTask(with:completionHandler:)` and `forEach`	742	783	173	470
`dataTask(with:completionHandler:)` and `reduce`	742	786	173	485

Full results (raw text)

Swift’s native Clocks are very inefficient

Fri, 03 May 2024 02:10:07 +0000

By which I mean, things like ContinuousClock and SuspendingClock.

In absolute terms they don’t have much overhead – think sub-microsecond for most uses. Which makes them perfectly acceptable when they’re used sporadically (e.g. only a few times per second).

However, if you need to deal with time and timing more frequently, their inefficiency can become a serious bottleneck.

I stumbled into this because of a fairly common and otherwise uninteresting pattern – throttling UI updates on an I/O operation’s progress. This might look something like:

struct Example: View {
    let bytes: AsyncSequence<UInt8>

    @State var byteCount = 0

    var body: some View {
        Text("Bytes so far: \(byteCount.formatted(.byteCount(style: .binary)))")
            .task {
                var unpostedByteCount = 0
                let clock = ContinuousClock()
                var lastUpdate = clock.now

                for try await byte in bytes {
                    … // Do something with the byte.

                    unpostedByteCount += 1

                    let now = clock.now
                    let delta = now - lastUpdate

                    if (    delta > .seconds(1)
                         || (    (delta > .milliseconds(100)
                              && 1_000_000 <= unpostedByteCount))) {
                        byteCount += unpostedByteCount
                        unpostedByteCount = 0
                        lastUpdate = now
                    }
                }
            }
    }
}

☝️ This isn’t a complete implementation, as it won’t update the byte count if the download stalls (since the lack of incoming bytes will mean no iteration on the loop, and therefore no updates even if a full second passes). But it’s sufficient for demonstration purposes here.

🖐️ Why didn’t I just use throttle from swift-async-algorithms? I did, at first, and quickly discovered that its performance is horrible. While I do suspect I can ‘optimise’ it to not be atrocious, I haven’t pursued that as it was easier to just write my own throttling system.

The above seems fairly straightforward, but if you run it and have any non-trivial I/O rate – even just a few hundred kilobytes per second – you’ll find that it saturates an entire CPU core, not just wasting CPU time but limiting the I/O rate severely.

Using a SuspendingClock makes no difference.

In a nutshell, the problem is that Swift’s Clock protocol has significant overheads by design¹. If you look at a time profile of code like this, you’ll see things like:

That’s a lot of time wasted in function calls and struct initialisation and type conversion and protocol witnesses and all that guff. The only part that’s actually retrieving the time is the swift_get_time call (which is just a wrapper over clock_gettime, which is just a wrapper over clock_gettime_nsec_np(CLOCK_UPTIME_RAW), which is just a wrapper over mach_absolute_time).

I wrote some simple benchmarks of various alternative time-tracking methods, with these results with Swift 5.10 (showing the median runtime of the benchmark, which is a million iterations of checking the time):

Method	10-core iMac Pro	M2 MacBook Air
`ContinuousClock`	429 ms	258 ms
`SuspendingClock`	430 ms	247 ms
`Date`	30 ms	19 ms
`clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW)`	32 ms	10 ms
`clock_gettime_nsec_np(CLOCK_UPTIME_RAW)`	27 ms	10 ms
`gettimeofday`	24 ms	12 ms
`mach_absolute_time`	15 ms	6 ms

All these alternative methods are well over an order of magnitude faster than Swift’s native clock APIs, showing just how dreadfully inefficient the Swift Clock API is.

mach_absolute_time for the win

Unsurprisingly, mach_absolute_time is the fastest. It is what all these other APIs are actually based on; it is the lowest level of the time stack.

The downside to calling mach_absolute_time directly, though, is that it’s on Apple’s “naughty” list – apparently it’s been abused for device fingerprinting, so Apple require you to beg for special permission if you want to use it (even though it’s used by all these other APIs anyway, as the basis for their implementations, and there’s nothing you can get from mach_absolute_time that you can’t get from them too 🤨).

`Date` surprisingly not bad

I was quite surprised to see good ol’ Date performing competitively with the traditional C-level APIs, at least on x86-64. Even on arm64 it’s not bad, at still a third to half the speed of the C APIs. This surprised me because ~~it has the overhead of at least one Objective-C message send (for timeIntervalSinceNow), unless somehow the Swift compiler is optimising that into a static function call, or inlining it entirely…?~~

Update: I later looked at the disassembly, and found no message sends, only a plain function call to Foundation.Date.timeIntervalSinceNow.getter (which is only 40 instructions, on arm64, over clock_gettime and __stack_chk_fail – and the former is hundreds of instructions, so it’s adding relatively little overhead to the C API).

This isn’t being done by the compiler, it’s because that’s actually how it’s implemented in Foundation. I keep forgetting that Foundation from Swift is no longer just the old Objective-C Foundation, but rather mostly the new Foundation that’s written in native Swift. So these performance results likely don’t apply once you go back far enough in Apple OS releases (to when Swift really was calling into the Objective-C code for NSDate) – but it’s safe to rely on good Date performance now and in future.

I certainly wouldn’t be afraid to use Date broadly, going down to lower APIs only when truly necessary – which is pretty rarely, I’d wager; we’re talking a mere 19 to 30 nanoseconds to get the time elapsed since a reference date and compare it to a threshold. If that’s too slow, it might be an indication that there’s a bigger problem (like transferring data a single byte at a time, as in the example that started this post – but more on that in the next post).

Follow-up

This post got some attention on HackerNews. Pleasingly, the comments there were almost all well-intentioned and interesting. It’s a bit beyond me to try to address all of them, but a few in particular raised good points that I would like to answer / clarify:

A lot of folks were curious about mach_absolute_time being on Apple’s naughty list. I don’t know for sure why it is either, but I think it’s very likely that it’s primarily because it essentially provides a reference time point, that’s very precise and pretty unique between computers. It’s not the boot time necessarily – because the timer pauses whenever the system is put to sleep – but even so it provides a simple way to nearly if not exactly identify an individual machine session (between boots & sleeps). It probably wouldn’t take many other fingerprinting data points to reliably pin-point a specific machine.

Secondarily, because it provides very precise timing capabilities (e.g. nanosecond-resolution on x86), it could possibly be a key component of timing attacks and broader device fingerprinting based on timing information (e.g. measuring how long it takes to perform an otherwise innocuous operation).

That all said, the only difference between it and some of the higher-level APIs wrapping it is their overhead. And it’s not apparent to me that merely making the “get-time” functionality 2x slower is going to magically mitigate all the above concerns, especially when we’re still talking just a few nanoseconds.
Admittedly my phrasing regarding Apple’s policies on mach_absolute_time – “beg for permission to use it” – is a little melodramatic. It’s revealing something of my personal opinions on certain Apple “security” practices. I love that Apple genuinely care about protecting everyone’s privacy, but sometimes I chaff at what feels like capricious or impractical specific policies.

In this particular case, it’s not apparent to me why this sort of protection is needed for native apps. In a web browser, sure, you’re running untrustworthy, essentially arbitrary code from all over the place, a lot of which is openly malicious (thanks, Google & Facebook, for your pervasive trackers – fuck you too). But a native app – or heck, even a dodgy non-native one like an Electron app – must be explicitly installed by the end user, among other barriers like code signing.
A few folks looked at the example case, of iterating a single byte at a time, and were suspicious of how performant that could possibly be anyway. This is a very fair reaction – it’s my ingrained instinct as well, from years of C/C++/Objective-C – but it’s relying on a few outdated assumptions. My next post already covered this for the most part, but in short here:

Through inlining, that code basically optimises down to an outer loop that fetches a new chunk of data (a pointer & length) plus an inner loop to iterate over that as direct memory access. The chunks are typically tens of kilobytes to megabytes, in my experience (depending on the source, e.g. network vs local storage, and the buffer sizes chosen by Apple’s framework code). So it actually is quite performant and essentially what you’d conventionally write in a file descriptor read loop. If and when it happens to optimise correctly. That’s the major caveat – sometimes the Swift compiler fails to properly optimise code like this, and then indeed the performance can really suck. But for simple cases like in this post’s example code, the optimiser has no trouble with it.
Similarly, a few folks questioned the need to check the clock on every byte, as in the example. That’s a valid critique of this sort of code in many contexts, and I concur that where possible one should try to be smarter about such things – i.e. use sequences of bunches of bytes, not sequences of individual bytes. e.g. with URLSession you can, and indeed it is faster to do it smarter like that. But, you can get acceptable real-world performance with this code, even in high-throughput cases, and it’s relatively simple and intuitive to write, so it’s not uncommon or necessarily unreasonable.

In addition, sometimes you’re at the mercy of the APIs available – e.g. sometimes you can only get an AsyncSequence. If you don’t care about complete accuracy, you can do things like only considering UI updates every N bytes. You’ll save CPU time and nobody will notice the difference for small enough N on a fast enough iteration, but if those prerequisites aren’t met you might read e.g. N-1 bytes and then hit a long pause, during which time you have the extra N-1 bytes in hand but you’re not showing as such in your UI.
Some folks noted that are a lot of other clock APIs from Apple’s frameworks, like DispatchTime and CACurrentMediaTime. I didn’t include those in the benchmark because I just didn’t think of them at the time. If anyone wants to send me a pull request adding them to the code, I’d be very happy to accept it.

I haven’t checked all those other APIs specifically, but I can pretty much guarantee they’re all built on mach_absolute_time too (possibly via one or more of the other C APIs already covered in this post). In fact those two examples just mentioned are explicitly documented as using mach_absolute_time.
Kallikrates quietly pointed to a very interesting recent change in Apple’s Swift standard library code, Make static [milli/micro/nano]seconds members on Duration inlinable. It’s paired with another patch that together seem very specifically aimed at eliminating some of the absurd overhead in Swift’s ContinuousClock & SuspendingClock implementations. The timing is a bit interesting – I don’t know if they were prompted by this post, but it’d be an unlikely coincidence otherwise.

In any case, I suspect it is possible to eliminate the overheads – there’s no apparent reason why they can’t be at least as efficient as Date already is – and so I hope that is what’s happening. Hopefully I’ll be able to re-run these benchmarks in a few months, with Swift 6, and see the performance gap eliminated. 🤞

One might quibble with the “by design” assertion. What I mean is that because it uses a protocol it’s susceptible to significant overheads – as is seen in these benchmarks – and because its internal implementation (a private _Int128 type, inside the standard library) is kept hidden, it limits the compiler’s ability to inline, which is in turn critical to eliminating what’s technically a lot of boilerplate. In contrast, if it were simply a struct using only public types internally, it would have avoided most of these overheads and been more amenable to inlining.

It’s not an irredeemable design (I think) – and that’s what the recent patches seem to be banking on, by tweaking the design in order to allow inlining and thus hopefully eliminate almost all the overhead. ↩︎

Collection enumeration performance in Swift

Wed, 08 Nov 2023 07:10:08 +0000

Swift’s Collection and Sequence protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively. For example:

let result = data
    .filter { 0 != $0 }
    .map { $0 * $0 }
    .reduce(into: 0, &+=)

Or:

var result = 0

for value in data {
    if 0 != value {
        result &+= value * value
    }
}

Nominally these are equivalent – they’ll produce the same results for all correctly-implemented Collections and Sequences. So in principle which you use is purely a matter of stylistic preference.

But is it?

Do they actually perform equivalently?

Let’s examine an example that’s a little more involved than the above snippets, but still fundamentally pretty straightforward. The extra processing steps are to help distinguish any performance differences.

The pertinent parts are:

testData.next
    .filter { 0 != $0 }
    .map { $0.byteSwapped }
    .filter { ($0 & 0xff00) >> 8 < $0 & 0xff }
    .map { $0.leadingZeroBitCount }
    .filter { Int.bitWidth - 8 >= $0 }
    .reduce(into: 0, &+=))

And the imperative equivalent:

var result = 0

for value in testData.next {
    if 0 != value {
        let value = value.byteSwapped

        if (value & 0xff00) >> 8 < value & 0xff {
            let value = value.leadingZeroBitCount

            if Int.bitWidth - 8 >= value {
                result &+= value
            }
        }
    }
}

I’ve published the full source code, in case you’d like to review it further or run it yourself.

How does the performance compare?

On my iMac Pro (10 cores (Xeon W-2150B)):

Dataset size	Functional (median)	Imperative (median)	Performance difference
0	234 ns	133 ns	1.67x
32 KiB	57 µs	16 µs	3.56x
1 MiB	1.7 ms	0.5 ms	3.36x
8 MiB	27 ms	4.2 ms	6.36x
32 MiB	147 ms	17 ms	8.65x

On my M2 MacBook Air:

Dataset size	Functional (median)	Imperative (median)	Performance difference
0	167 ns	83 ns	2.01x
32 KiB	37 µs	3.6 µs	10.20x
1 MiB	1,058µs	112 µs	9.45x
8 MiB	12 ms	0.9 ms	13.23x
32 MiB	50 ms	3.7 ms	13.76x

The imperative version is many times faster! And the performance difference increases as the collection size increases. The functional version starts off not super terrible – at least on the same order magnitude as the imperative version – but it tends rapidly towards being an order of magnitude slower.

Worse, the difference is much more pronounced on more modern CPUs, like Apple Silicon.

What’s going on?

There’s a few compounding factors.

Smaller datasets are more likely to fit into CPU caches (the dataset sizes shown above were chosen to correspond to L1 / L1 / L2 / L3 / RAM, respectively, on my iMac Pro). Working on data in CPU caches is by nature faster – the lower-level the cache the better – and so helps hides inefficiencies.

The functional version creates intermediary Arrays to store the intermediary results of every filter and map operation. This introduces malloc traffic, retains & releases, and writes to memory. The imperative version has none of that overhead – it simply reads every value in the collection once, performing the whole sequence of operations all in one go for each element, using only CPU registers (not so much as a function call, even!).

Can you tell which approach results in way more memory use (and is a lot slower)?

Sidenote: Dataset load costs

Z9 burst shooting buffer depth

Fri, 04 Feb 2022 03:16:55 +0000

Just some basic tests with the few cards I have.

	Lexar 2933x 128 GiB	ProGrade Gold 256 GiB	Pergear 512 GiB	Angelbird AV PRO 1 TiB
Type	XQD	CFExpress	CFExpress	CFExpress
20 FPS (lossless)	26 (11 – 37)	40 (34 – 43)	36 (36 – 37)	37 (37 – 37)
20 FPS (HE*)	60 (57 – 61)	60 (49 – 77)	60 (59 – 61)	62 (60 – 64)
20 FPS (HE)	75 (34 – 95)	85 (45 – 101)	100 (98 – 103)	104 (98 – 112)
30 FPS	196 (187 – 198)	183 (52 – 198)	192 (137 – 258)	192 (142 – 198)
120 FPS	706 (667 – 736)	706 (558 – 739)	737 (734 – 739)	736 (734 – 739)
Cost per GiB (Feb 2022)	$2.54	$1.13	$0.62	$0.57
Purchase options	Amazon	Amazon	Amazon	Amazon

Values shown are the average over all trials with worst & best individual results shown in parenthesis.

Commentary

Surprisingly little performance difference

None of the cards tested are among the “known fastest” CFExpress cards, like the Delkin Blacks or ProGrade Cobalts. Nonetheless, I’m surprised at how minor the performance difference is between all of them, especially given there’s an XQD card in the mix.

CFExpress cards are not necessarily fast.

Angelbird AV PROs do not meet their promised performance

The Angelbird card claims a 1,000 MB/s minimum, sustained write speed. The XQD format is incapable of speeds above 500 MB/s. Yet the Angelbird is at best just 40% faster than the XQD Lexar. This suggests either the camera is the limiting factor – unlikely given that others have demonstrated much deeper bursts with other, apparently faster cards – or that the Angelbird doesn’t live up to its claims.

Blackmagic Disk Speed Test with a Pergear USB-C reader indicates the Angelbird almost hits 1,000 MB/s at the start of a sequential read or write, but within a second or two falls down to a sustained speed of only about 700 MB/s. And there’s that 40% difference again, vs XQD.

Average performance correlates with consistent performance

e.g. the Pergear 512 GiB is nominally about the same performance on average as the ProGrade 256 GiB, but the Pergear was much more consistent. The Angelbird was a tad faster & more consistent again.

This also highlights why many trials are important, in order to determine the variance. I’d rather have an on-average slower card that’s very consistent than a “bursty” card that might crap out in a critical moment and cause me to miss the moment completely.

30 & 120 FPS modes are camera limited

There was practically no difference in performance between the cards in 30 FPS & 120 FPS modes.

The bandwidth demonstrated is well below the demonstrated capabilities of all these cards, at just a few hundred MB/s.

All this seems quite conclusive that in these extra-fast burst modes the Z9 is the bottleneck, not the memory card.

Sidenote: The ProGrade card showed occasional glitches (three in total across twenty trials) – where the Z9 would suddenly stop shooting mid-burst, where a split second prior it had still shown a significant amount left in the “buffer” (the rXXX counter). I’m not sure what to make of that – perhaps the Z9 relies on some basic level of performance and the ProGrade can’t consistently meet it, or perhaps something is glitching between the Z9 & the ProGrade card that causes the Z9 to error out and stop working.

Methodology

1/250, 24-70/4 @ f4, ISO 5000.

Z9 firmware 1.11.

I enabled the shutter sound at maximum volume, and held down the shutter until I heard a stutter.

For 20 FPS mode:

I counted any extra frames after the stutter and subtracted those from the numbers.
I also tested 1/2500 and saw no meaningful difference in results, and ISO 64 & 25,600 which improved and decreased (respectively) buffer depth by about 10% each (very likely corresponding to the file size differences, though I didn’t check).
Five trials, each testing each format in turn: lossless, HE*, HE.

For 30 & 120 FPS modes:

I never heard an extra frame after the first stutter – I don’t know if that means the camera ground to a complete halt or merely that it doesn’t reliably play the fake shutter sound in these modes. The consistency of the results in those modes leads me to believe it’s the former.
Ten trials, sequentially.

Cards were formatted in camera and empty at the start of each class of testing (20, 30, 120). Images were not erased between trials (empty cards are not representative of real-world conditions).

Autofocus was not engaged during shooting. I haven’t tested it comprehensively, but so far I’ve seen no impact on burst performance from using autofocus (including subject recognition).