<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
>

<channel>
	<title>Benchmarked &#8211; Wade Tregaskis</title>
	<atom:link href="https://wadetregaskis.com/tags/benchmarked/feed/" rel="self" type="application/rss+xml" />
	<link>https://wadetregaskis.com</link>
	<description></description>
	<lastBuildDate>Sun, 19 May 2024 15:24:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://wadetregaskis.com/wp-content/uploads/2016/03/Stitch-512x512-1-256x256.png</url>
	<title>Benchmarked &#8211; Wade Tregaskis</title>
	<link>https://wadetregaskis.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">226351702</site>	<item>
		<title>URLSession performance for reading a byte stream</title>
		<link>https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/</link>
					<comments>https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/#respond</comments>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Fri, 03 May 2024 23:52:00 +0000</pubDate>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Benchmarked]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[NSData]]></category>
		<category><![CDATA[Swift]]></category>
		<category><![CDATA[URLSession]]></category>
		<category><![CDATA[withUnsafeBytes]]></category>
		<guid isPermaLink="false">https://wadetregaskis.com/?p=8006</guid>

					<description><![CDATA[What&#8217;s the best way to read a stream of bytes with URLSession? That&#8217;s the simple question I set out to answer. I wrote some benchmarks. They read a 128 MiB file and perform a contrived aggregation of its content bytes (a joking &#8220;hash&#8221; of them, merely to ensure the actual reads aren&#8217;t optimised out). ⚠️&#8230; <a class="read-more-link" href="https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/" data-wpel-link="internal">Read more</a>]]></description>
										<content:encoded><![CDATA[
<p>What&#8217;s the best way to read a stream of bytes with <code><a href="https://developer.apple.com/documentation/foundation/urlsession" data-wpel-link="external" target="_blank" rel="external noopener">URLSession</a></code>? That&#8217;s the simple question I set out to answer. I wrote <a href="https://github.com/wadetregaskis/Swift-Benchmarks/blob/main/Benchmarks/URLSession/URLSession.swift" data-wpel-link="external" target="_blank" rel="external noopener">some benchmarks</a>. They read a 128 MiB file and perform a contrived aggregation of its content bytes (a joking &#8220;hash&#8221; of them, merely to ensure the actual reads aren&#8217;t optimised out).</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>⚠️ In a nutshell, the results here demonstrate the <em>best-case</em> performance for each of the methods evaluated.  These benchmarks are very simple, which makes them relatively easy for the Swift compiler to optimise well.  In less trivial, real-world code, the optimiser might not do so great.  So these benchmarks and their results are merely one collective data point in the bigger picture of just how the heck to read files efficiently.  </p>
</div></div>



<p>There&#8217;s two key decisions you must make:  which specific <code>URLSession</code> API will you use, and how will you access the bytes themselves.</p>



<h2 class="wp-block-heading">Measurements</h2>



<p>Each benchmark was run a hundred times or for 30 seconds (whichever limit was hit first).  I&#8217;m highlighting here just the medians (in general there wasn&#8217;t much variation anyway), but you can dig into the other percentiles &amp; metrics via the disclosure triangles, if you like.</p>



<p>I&#8217;m pretty sure the reads were all served out of the kernel&#8217;s in-memory file system cache, judging by the lack of SSD read I/O reported by <a href="https://bjango.com/mac/istatmenus/" data-wpel-link="external" target="_blank" rel="external noopener">iStat Menus</a>.  But I didn&#8217;t go out of my way to verify this.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>⚠️ &#8220;Peak RAM&#8221; is as reported by the <a href="https://github.com/ordo-one/package-benchmark" data-wpel-link="external" target="_blank" rel="external noopener">Benchmark</a> package, based on (if I understand it correctly) periodic sampling of the process RSS.  As such it&#8217;s not necessarily completely accurate, due to the potential to miss brief peaks.</p>
</div></div>



<h3 class="wp-block-heading">M2 MacBook Air</h3>



<figure class="wp-block-table aligncenter"><table><thead><tr><th>Method</th><th class="has-text-align-right" data-align="right">Wall time (ms)</th><th class="has-text-align-right" data-align="right">CPU time (ms)</th><th class="has-text-align-right" data-align="right">Throughput (MiB/s)</th><th class="has-text-align-right" data-align="right">Peak RAM (MB)</th></tr></thead><tbody><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767351-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes(from:)</a></code> and for loop</td><td class="has-text-align-right" data-align="right">79</td><td class="has-text-align-right" data-align="right">138</td><td class="has-text-align-right" data-align="right">1,620</td><td class="has-text-align-right" data-align="right">91</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767351-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/urlsession/asyncbytes/3767347-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">79</td><td class="has-text-align-right" data-align="right">138</td><td class="has-text-align-right" data-align="right">1,620</td><td class="has-text-align-right" data-align="right">84</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and for loop</td><td class="has-text-align-right" data-align="right">605</td><td class="has-text-align-right" data-align="right">641</td><td class="has-text-align-right" data-align="right">212</td><td class="has-text-align-right" data-align="right">265</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">60</td><td class="has-text-align-right" data-align="right">95</td><td class="has-text-align-right" data-align="right">2,133</td><td class="has-text-align-right" data-align="right">338</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">765</td><td class="has-text-align-right" data-align="right">800</td><td class="has-text-align-right" data-align="right">167</td><td class="has-text-align-right" data-align="right">262</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">750</td><td class="has-text-align-right" data-align="right">784</td><td class="has-text-align-right" data-align="right">171</td><td class="has-text-align-right" data-align="right">290</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with for loop</td><td class="has-text-align-right" data-align="right">560</td><td class="has-text-align-right" data-align="right">617</td><td class="has-text-align-right" data-align="right">229</td><td class="has-text-align-right" data-align="right">53</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">36</td><td class="has-text-align-right" data-align="right">75</td><td class="has-text-align-right" data-align="right">3,556</td><td class="has-text-align-right" data-align="right">26</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">719</td><td class="has-text-align-right" data-align="right">775</td><td class="has-text-align-right" data-align="right">178</td><td class="has-text-align-right" data-align="right">51</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">709</td><td class="has-text-align-right" data-align="right">765</td><td class="has-text-align-right" data-align="right">167</td><td class="has-text-align-right" data-align="right">45</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and for loop&nbsp;</td><td class="has-text-align-right" data-align="right">590</td><td class="has-text-align-right" data-align="right">630</td><td class="has-text-align-right" data-align="right">217</td><td class="has-text-align-right" data-align="right">442</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">57</td><td class="has-text-align-right" data-align="right">98</td><td class="has-text-align-right" data-align="right">2,246</td><td class="has-text-align-right" data-align="right">504</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">742</td><td class="has-text-align-right" data-align="right">783</td><td class="has-text-align-right" data-align="right">173</td><td class="has-text-align-right" data-align="right">470</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">742</td><td class="has-text-align-right" data-align="right">786</td><td class="has-text-align-right" data-align="right">173</td><td class="has-text-align-right" data-align="right">485</td></tr></tbody></table></figure>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Full results (raw text)</summary>
<pre class="wp-block-preformatted">bytewise read using bytes(from:) and for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      35 │      36 │      37 │      37 │      37 │      67 │      69 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      68 │      91 │      91 │      91 │      94 │      94 │      94 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    1841 │    1869 │    1883 │    1911 │    1925 │    1939 │    1953 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    1312 │    1332 │    1342 │    1362 │    1372 │    1382 │    1392 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      24 │      25 │      25 │      25 │      25 │      25 │      25 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     136 │     137 │     138 │     138 │     138 │     139 │     140 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      79 │      79 │      79 │      79 │      80 │      81 │      81 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using bytes(from:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      35 │      36 │      37 │      37 │      37 │      67 │      69 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      68 │      83 │      84 │      85 │      88 │      88 │      88 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    1840 │    1869 │    1897 │    1911 │    1925 │    1953 │    1967 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    1312 │    1332 │    1352 │    1362 │    1372 │    1392 │    1402 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      24 │      25 │      25 │      25 │      25 │      25 │      25 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     137 │     138 │     138 │     138 │     138 │     139 │     140 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      79 │      79 │      79 │      79 │      80 │      80 │      81 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      40 │      43 │      44 │      45 │      47 │      76 │      76 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     167 │     230 │     265 │     278 │     286 │     297 │     297 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      21 │      23 │      23 │      23 │      23 │      23 │      23 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     631 │     639 │     641 │     647 │     659 │     676 │     676 │      50 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     600 │     603 │     605 │     610 │     625 │     642 │     642 │      50 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      39 │      44 │      44 │      45 │      47 │      75 │      76 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     150 │     293 │     338 │     398 │     419 │     439 │     445 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │      11 │      11 │      11 │      11 │      11 │      11 │      11 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │       5 │       5 │       5 │       5 │       5 │       5 │       5 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      21 │      22 │      22 │      23 │      23 │      23 │      23 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │      79 │      93 │      95 │      96 │      98 │     106 │     110 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      49 │      59 │      60 │      61 │      62 │      68 │      71 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      41 │      43 │      44 │      45 │      47 │      77 │      77 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     145 │     228 │     262 │     279 │     292 │     296 │     296 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      22 │      22 │      23 │      23 │      23 │      23 │      23 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     782 │     794 │     800 │     805 │     806 │     816 │     816 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     750 │     760 │     765 │     771 │     773 │     783 │     783 │      40 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      41 │      44 │      44 │      45 │      47 │      77 │      77 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     165 │     254 │     290 │     338 │     353 │     361 │     361 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      22 │      22 │      22 │      23 │      23 │      23 │      23 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     771 │     777 │     784 │     789 │     792 │     796 │     796 │      40 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     738 │     742 │     750 │     756 │     758 │     760 │     760 │      40 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      34 │      35 │      36 │      36 │      37 │      69 │      69 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      52 │      52 │      53 │      57 │      57 │      57 │      57 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      26 │      26 │      27 │      27 │      27 │      27 │      27 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     616 │     616 │     617 │     617 │     617 │     620 │     620 │      54 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     559 │     559 │     560 │     560 │     560 │     564 │     564 │      54 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      47 │      51 │      53 │      55 │      57 │      87 │      87 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      22 │      26 │      26 │      27 │      27 │      27 │      27 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    6135 │    6147 │    6147 │    6147 │    6150 │    6150 │    6150 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    2046 │    2051 │    2051 │    2051 │    2051 │    2051 │    2051 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      35 │      35 │      35 │      35 │      35 │      35 │      35 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │      73 │      74 │      75 │      75 │      76 │      76 │      76 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      35 │      36 │      36 │      36 │      36 │      37 │      37 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      34 │      35 │      36 │      36 │      37 │      68 │      68 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      47 │      50 │      51 │      51 │      51 │      51 │      51 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      26 │      27 │      27 │      27 │      27 │      27 │      27 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     773 │     774 │     775 │     775 │     776 │     779 │     779 │      42 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     717 │     718 │     719 │     719 │     719 │     722 │     722 │      42 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      34 │      35 │      36 │      36 │      37 │      70 │      70 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      45 │      45 │      45 │      45 │      45 │      45 │      45 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      26 │      26 │      27 │      27 │      27 │      27 │      27 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     763 │     764 │     765 │     766 │     766 │     768 │     768 │      43 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     707 │     709 │     709 │     710 │     710 │     714 │     714 │      43 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and for loop <br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      39 │      43 │      44 │      47 │      50 │      79 │      79 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     378 │     410 │     442 │     471 │     496 │     528 │     528 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      27 │      27 │      28 │      28 │      28 │      28 │      28 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     624 │     627 │     630 │     632 │     634 │     640 │     640 │      51 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     587 │     589 │     590 │     594 │     597 │     601 │     601 │      51 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      38 │      43 │      45 │      46 │      48 │      76 │      76 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     394 │     462 │     504 │     525 │     547 │     583 │     587 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │       5 │       5 │       5 │       5 │       5 │       5 │       5 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │       3 │       3 │       3 │       3 │       3 │       3 │       3 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      27 │      27 │      28 │      28 │      28 │      28 │      28 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │      92 │      96 │      98 │     100 │     102 │     105 │     106 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      55 │      57 │      57 │      58 │      59 │      61 │      63 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      38 │      41 │      44 │      47 │      51 │      74 │      74 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     391 │     430 │     470 │     560 │     587 │     617 │     617 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      27 │      28 │      28 │      28 │      28 │      28 │      28 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     773 │     776 │     783 │     789 │     790 │     795 │     795 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     734 │     737 │     742 │     749 │     751 │     755 │     755 │      41 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      38 │      43 │      44 │      46 │      51 │      77 │      77 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     348 │     408 │     451 │     485 │     499 │     548 │     548 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      27 │      28 │      28 │      28 │      28 │      28 │      28 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     770 │     776 │     786 │     788 │     793 │     796 │     796 │      41 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     732 │     737 │     742 │     749 │     752 │     755 │     755 │      41 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛</pre>
</details>



<h3 class="wp-block-heading">10-core iMac Pro</h3>



<figure class="wp-block-table aligncenter"><table><thead><tr><th>Method</th><th class="has-text-align-right" data-align="right">Wall time (ms)</th><th class="has-text-align-right" data-align="right">CPU time (ms)</th><th class="has-text-align-right" data-align="right">Throughput (MiB/s)</th><th class="has-text-align-right" data-align="right">Peak RAM (MB)</th></tr></thead><tbody><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767351-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes(from:)</a></code> and for loop</td><td class="has-text-align-right" data-align="right">138</td><td class="has-text-align-right" data-align="right">250</td><td class="has-text-align-right" data-align="right">928</td><td class="has-text-align-right" data-align="right">54</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767351-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/urlsession/asyncbytes/3767347-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">139</td><td class="has-text-align-right" data-align="right">251</td><td class="has-text-align-right" data-align="right">921</td><td class="has-text-align-right" data-align="right">53</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and for loop</td><td class="has-text-align-right" data-align="right">953</td><td class="has-text-align-right" data-align="right">1,023</td><td class="has-text-align-right" data-align="right">134</td><td class="has-text-align-right" data-align="right">257</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">140</td><td class="has-text-align-right" data-align="right">208</td><td class="has-text-align-right" data-align="right">914</td><td class="has-text-align-right" data-align="right">344</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">1,181</td><td class="has-text-align-right" data-align="right">1,254</td><td class="has-text-align-right" data-align="right">108</td><td class="has-text-align-right" data-align="right">236</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/3767353-data" data-wpel-link="external" target="_blank" rel="external noopener">data(from:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">1,163</td><td class="has-text-align-right" data-align="right">1,233</td><td class="has-text-align-right" data-align="right">110</td><td class="has-text-align-right" data-align="right">232</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with for loop</td><td class="has-text-align-right" data-align="right">848</td><td class="has-text-align-right" data-align="right">964</td><td class="has-text-align-right" data-align="right">151</td><td class="has-text-align-right" data-align="right">40</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">56</td><td class="has-text-align-right" data-align="right">140</td><td class="has-text-align-right" data-align="right">2,286</td><td class="has-text-align-right" data-align="right">23</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">1,066</td><td class="has-text-align-right" data-align="right">1,181</td><td class="has-text-align-right" data-align="right">120</td><td class="has-text-align-right" data-align="right">35</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1411554-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:)</a></code> and an incremental delegate with <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">1,072</td><td class="has-text-align-right" data-align="right">1,185</td><td class="has-text-align-right" data-align="right">119</td><td class="has-text-align-right" data-align="right">43</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and for loop&nbsp;</td><td class="has-text-align-right" data-align="right">948</td><td class="has-text-align-right" data-align="right">1,026</td><td class="has-text-align-right" data-align="right">135</td><td class="has-text-align-right" data-align="right">375</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and for loop inside <code><a href="https://developer.apple.com/documentation/foundation/data/3139154-withunsafebytes" data-wpel-link="external" target="_blank" rel="external noopener">withUnsafeBytes</a></code></td><td class="has-text-align-right" data-align="right">137</td><td class="has-text-align-right" data-align="right">215</td><td class="has-text-align-right" data-align="right">934</td><td class="has-text-align-right" data-align="right">591</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/1780184-foreach" data-wpel-link="external" target="_blank" rel="external noopener">forEach</a></code></td><td class="has-text-align-right" data-align="right">1,179</td><td class="has-text-align-right" data-align="right">1,258</td><td class="has-text-align-right" data-align="right">109</td><td class="has-text-align-right" data-align="right">370</td></tr><tr><td><code><a href="https://developer.apple.com/documentation/foundation/urlsession/1410330-datatask" data-wpel-link="external" target="_blank" rel="external noopener">dataTask(with:completionHandler:)</a></code> and <code><a href="https://developer.apple.com/documentation/foundation/data/3126633-reduce" data-wpel-link="external" target="_blank" rel="external noopener">reduce</a></code></td><td class="has-text-align-right" data-align="right">1,176</td><td class="has-text-align-right" data-align="right">1,254</td><td class="has-text-align-right" data-align="right">109</td><td class="has-text-align-right" data-align="right">416</td></tr></tbody></table></figure>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Full results (raw text)</summary>
<pre class="wp-block-preformatted">bytewise read using bytes(from:) and for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      36 │      37 │      37 │      37 │      37 │      68 │      69 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      45 │      51 │      54 │      56 │      62 │      66 │      66 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    1952 │    2023 │    2051 │    2079 │    2093 │    2135 │    2135 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    1392 │    1442 │    1462 │    1482 │    1492 │    1522 │    1522 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      21 │      21 │      21 │      21 │      21 │      21 │      21 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     245 │     248 │     250 │     251 │     254 │     259 │     267 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     136 │     138 │     138 │     140 │     141 │     145 │     146 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using bytes(from:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      36 │      37 │      37 │      37 │      37 │      67 │      69 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      48 │      52 │      53 │      54 │      56 │      58 │      60 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    1953 │    2009 │    2037 │    2065 │    2079 │    2121 │    2121 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    1392 │    1432 │    1452 │    1472 │    1482 │    1512 │    1512 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      21 │      21 │      21 │      21 │      21 │      21 │      21 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     246 │     249 │     251 │     253 │     256 │     274 │     278 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     136 │     138 │     139 │     140 │     141 │     150 │     151 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      41 │      42 │      43 │      43 │      72 │      76 │      76 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     188 │     243 │     257 │     266 │     279 │     316 │     316 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      18 │      18 │      19 │      19 │      19 │      19 │      19 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1010 │    1015 │    1023 │    1031 │    1040 │    1060 │    1060 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     941 │     946 │     953 │     958 │     968 │     985 │     985 │      32 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      38 │      42 │      43 │      44 │      47 │      72 │      73 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     244 │     328 │     344 │     352 │     358 │     369 │     370 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │      11 │      11 │      11 │      11 │      11 │      11 │      11 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │       5 │       5 │       5 │       5 │       5 │       5 │       5 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      18 │      18 │      18 │      18 │      18 │      19 │      19 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     203 │     207 │     208 │     210 │     213 │     221 │     234 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     135 │     139 │     140 │     141 │     143 │     148 │     159 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      41 │      42 │      43 │      43 │      73 │      75 │      75 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     171 │     225 │     236 │     263 │     283 │     288 │     288 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      18 │      18 │      18 │      19 │      19 │      19 │      19 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1247 │    1250 │    1254 │    1266 │    1271 │    1284 │    1284 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1177 │    1179 │    1181 │    1194 │    1199 │    1211 │    1211 │      26 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using data(from:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      39 │      42 │      43 │      43 │      72 │      79 │      79 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     207 │     226 │     232 │     243 │     260 │     278 │     278 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      18 │      18 │      18 │      19 │      19 │      19 │      19 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1225 │    1228 │    1233 │    1243 │    1250 │    1269 │    1269 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1157 │    1160 │    1163 │    1172 │    1178 │    1198 │    1198 │      26 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with for loop<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      35 │      35 │      36 │      36 │      65 │      68 │      68 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      32 │      36 │      40 │      45 │      48 │      52 │      52 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      23 │      23 │      23 │      23 │      23 │      23 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     954 │     958 │     964 │     965 │     975 │     992 │     992 │      36 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     841 │     845 │     848 │     851 │     857 │     871 │     871 │      36 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      47 │      51 │      53 │      54 │      56 │      83 │      84 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      17 │      20 │      23 │      24 │      25 │      25 │      25 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │    5895 │    6067 │    6087 │    6099 │    6111 │    6123 │    6123 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │    1966 │    2024 │    2031 │    2035 │    2039 │    2043 │    2043 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      32 │      32 │      32 │      32 │      33 │      33 │      33 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     137 │     139 │     140 │     142 │     147 │     152 │     155 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │      55 │      55 │      56 │      56 │      58 │      60 │      62 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      35 │      35 │      36 │      36 │      65 │      69 │      69 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      29 │      33 │      35 │      37 │      39 │      41 │      41 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      23 │      23 │      23 │      23 │      23 │      23 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1169 │    1179 │    1181 │    1184 │    1190 │    1205 │    1205 │      29 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1056 │    1065 │    1066 │    1069 │    1074 │    1086 │    1086 │      29 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:) and an incremental delegate with reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      35 │      35 │      35 │      36 │      65 │      68 │      68 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │      30 │      35 │      43 │      46 │      48 │      52 │      52 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      23 │      23 │      23 │      23 │      23 │      23 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1176 │    1182 │    1185 │    1190 │    1198 │    1200 │    1200 │      28 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1061 │    1067 │    1072 │    1075 │    1080 │    1083 │    1083 │      28 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and for loop <br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      40 │      42 │      43 │      43 │      72 │      75 │      75 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     356 │     368 │     375 │     386 │     400 │     407 │     407 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      24 │      24 │      24 │      24 │      24 │      24 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1016 │    1023 │    1026 │    1032 │    1041 │    1048 │    1048 │      32 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     940 │     946 │     948 │     952 │     961 │     967 │     967 │      32 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and for loop inside withUnsafeBytes<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      40 │      42 │      43 │      43 │      44 │      74 │      74 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     370 │     580 │     591 │     599 │     607 │     613 │     614 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases                   │       5 │       5 │       5 │       5 │       5 │       5 │       5 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains                    │       3 │       3 │       3 │       3 │       3 │       3 │       3 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      23 │      24 │      24 │      24 │      24 │      24 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │     210 │     213 │     215 │     217 │     220 │     236 │     238 │     100 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │     133 │     135 │     137 │     138 │     140 │     153 │     154 │     100 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and forEach<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      41 │      42 │      42 │      43 │      71 │      75 │      75 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     332 │     364 │     370 │     384 │     415 │     418 │     418 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      24 │      24 │      24 │      24 │      24 │      24 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1243 │    1253 │    1258 │    1268 │    1270 │    1277 │    1277 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1164 │    1175 │    1179 │    1188 │    1190 │    1201 │    1201 │      26 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛<br><br>bytewise read using dataTask(with:completionHandler:) and reduce<br>╒════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕<br>│ Metric                     │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │<br>╞════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡<br>│ Malloc (total) (K)         │      40 │      42 │      42 │      43 │      72 │      75 │      75 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Memory (resident peak) (M) │     353 │     392 │     416 │     452 │     472 │     473 │     473 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Releases (K)               │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Retains (K)                │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │    4194 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Syscalls (total) (K)       │      23 │      24 │      24 │      24 │      24 │      24 │      24 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (total CPU) (ms)      │    1241 │    1250 │    1254 │    1266 │    1278 │    1281 │    1281 │      26 │<br>├────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤<br>│ Time (wall clock) (ms)     │    1164 │    1172 │    1176 │    1185 │    1195 │    1200 │    1200 │      26 │<br>╘════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛</pre>
</details>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>☝️ I didn&#8217;t include it in these initial results, but there&#8217;s also <code><a href="https://developer.apple.com/documentation/foundation/data/1780329-enumeratebytes" data-wpel-link="external" target="_blank" rel="external noopener">enumerateBytes</a></code>, carried over from <code>NSData</code>.  It performs identically to <code>withUnsafeBytes</code>.  However, it is officially deprecated (Apple claims that for-loops are the replacement, even though they&#8217;re an order of magnitude slower 🤨).</p>



<p>🤔 I also tried to test the <code>NSData</code> <code><a href="https://developer.apple.com/documentation/foundation/nsdata/1410616-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes</a></code> property, but no matter how it&#8217;s used, it always results in the benchmark crashing.  It seems like it is actually unusable from Swift due to a memory management bug in the bridging layer and/or Swift compiler…?</p>
</div></div>



<h2 class="wp-block-heading">Observations</h2>



<h3 class="wp-block-heading">Incremental reads + withUnsafeBytes is unequivocally the best method</h3>



<p>It&#8217;s dramatically faster than any other approach &#8211; both in wall time and overall CPU usage &#8211; <em>and</em> uses the least amount of memory by far.</p>



<p>Within the <code><a href="https://developer.apple.com/documentation/foundation/data" data-wpel-link="external" target="_blank" rel="external noopener">Data</a></code>-centric methods this doesn&#8217;t surprise me &#8211; with the incremental approach <code>URLSession</code> can just hand data back as it comes in, in whatever chunk sizes are most convenient.  In the other <code>Data</code>-based approaches it has to aggregate everything into one final contiguous blob.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>🤔 This is all assuming that <code>URLSession</code> never memory-maps files, which I did not actually verify but does seem to be the case based on the performance and behaviour.  This strikes me as very odd, however, because memory-mapping the files would very likely be significantly faster, in the cases where it has to provide the entire contents as a single <code>Data</code> instance.  And <code>Data</code> already supports memory-mapping a file, very easily.</p>



<p>Of course, if your use-case doesn&#8217;t involve local files, then memory-mapping probably doesn&#8217;t apply anyway (unless <code>URLSession</code> uses a disk cache and the file is already in the cache &#8211; but I don&#8217;t know if it supports that).</p>
</div></div>



<h4 class="wp-block-heading">Implementation note</h4>



<p>I utilised the incremental API by basically adapting it to invoke a closure (shown below), mainly because it made it easier to then test different byte enumeration approaches within it, but the results should hold for typical implementations of <code><a href="https://developer.apple.com/documentation/foundation/urlsessiondatadelegate" data-wpel-link="external" target="_blank" rel="external noopener">URLSessionDataDelegate</a></code>.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>⚠️ This isn&#8217;t a robust implementation; it&#8217;s not suitable for use in a real program, merely sufficient for this very specific application in these benchmarks. It doesn&#8217;t communicate failures correctly, behaves very poorly if misused (e.g. by using it for more than one operation), naively blocks the thread that&#8217;s awaiting the data, etc. Please don&#8217;t use it as-is, but feel free to evolve it into a real solution for your own uses.</p>
</div></div>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span role="button" tabindex="0" data-code="class IncrementalDataDelegate: NSObject, URLSessionDataDelegate {
    private let task: URLSessionTask
    private let handler: (Data) -&gt; ()
    private let done = NSCondition()

    init(_ task: URLSessionTask,
         handler: @escaping (Data) -&gt; ()) {
        self.task = task
        self.handler = handler
        super.init()
    }

    func urlSession(_ session: URLSession, dataTask: URLSessionDataTask, didReceive data: Data) {
        precondition(self.task == dataTask)
        self.handler(data)
    }

    func urlSession(_ session: URLSession, task: URLSessionTask, didCompleteWithError error: (any Error)?) {
        precondition(self.task == task)

        if let error {
            preconditionFailure(&quot;Error: \(error)&quot;)
        }

        self.done.broadcast()
    }

    func wait() {
        self.done.wait()
    }
}" style="color:#000000;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">class</span><span style="color: #000000"> </span><span style="color: #267F99">IncrementalDataDelegate</span><span style="color: #000000">: NSObject, URLSessionDataDelegate {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">private</span><span style="color: #000000"> </span><span style="color: #0000FF">let</span><span style="color: #000000"> task: URLSessionTask</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">private</span><span style="color: #000000"> </span><span style="color: #0000FF">let</span><span style="color: #000000"> handler: (Data) -&gt; ()</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">private</span><span style="color: #000000"> </span><span style="color: #0000FF">let</span><span style="color: #000000"> done = </span><span style="color: #795E26">NSCondition</span><span style="color: #000000">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">init</span><span style="color: #000000">(</span><span style="color: #795E26">_</span><span style="color: #000000"> </span><span style="color: #001080">task</span><span style="color: #000000">: URLSessionTask,</span></span>
<span class="line"><span style="color: #000000">         </span><span style="color: #795E26">handler</span><span style="color: #000000">: </span><span style="color: #0000FF">@escaping</span><span style="color: #000000"> (Data) -&gt; ()) {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">task</span><span style="color: #000000"> = task</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">handler</span><span style="color: #000000"> = handler</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">super</span><span style="color: #000000">.</span><span style="color: #0000FF">init</span><span style="color: #000000">()</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">func</span><span style="color: #000000"> </span><span style="color: #795E26">urlSession</span><span style="color: #000000">(</span><span style="color: #795E26">_</span><span style="color: #000000"> </span><span style="color: #001080">session</span><span style="color: #000000">: URLSession, </span><span style="color: #795E26">dataTask</span><span style="color: #000000">: URLSessionDataTask, </span><span style="color: #795E26">didReceive</span><span style="color: #000000"> </span><span style="color: #001080">data</span><span style="color: #000000">: Data) {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #795E26">precondition</span><span style="color: #000000">(</span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">task</span><span style="color: #000000"> == dataTask)</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #795E26">handler</span><span style="color: #000000">(data)</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">func</span><span style="color: #000000"> </span><span style="color: #795E26">urlSession</span><span style="color: #000000">(</span><span style="color: #795E26">_</span><span style="color: #000000"> </span><span style="color: #001080">session</span><span style="color: #000000">: URLSession, </span><span style="color: #795E26">task</span><span style="color: #000000">: URLSessionTask, </span><span style="color: #795E26">didCompleteWithError</span><span style="color: #000000"> </span><span style="color: #001080">error</span><span style="color: #000000">: (any </span><span style="color: #267F99">Error</span><span style="color: #000000">)?) {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #795E26">precondition</span><span style="color: #000000">(</span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">task</span><span style="color: #000000"> == task)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #0000FF">let</span><span style="color: #000000"> error {</span></span>
<span class="line"><span style="color: #000000">            </span><span style="color: #795E26">preconditionFailure</span><span style="color: #000000">(</span><span style="color: #A31515">&quot;Error: </span><span style="color: #0000FF">\(</span><span style="color: #000000FF">error</span><span style="color: #0000FF">)</span><span style="color: #A31515">&quot;</span><span style="color: #000000">)</span></span>
<span class="line"><span style="color: #000000">        }</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">done</span><span style="color: #000000">.</span><span style="color: #795E26">broadcast</span><span style="color: #000000">()</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">func</span><span style="color: #000000"> </span><span style="color: #795E26">wait</span><span style="color: #000000">() {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">self</span><span style="color: #000000">.</span><span style="color: #001080">done</span><span style="color: #000000">.</span><span style="color: #795E26">wait</span><span style="color: #000000">()</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<h3 class="wp-block-heading"><a href="https://developer.apple.com/documentation/foundation/urlsession/asyncbytes" data-wpel-link="external" target="_blank" rel="external noopener">AsyncBytes</a> (<a href="https://developer.apple.com/documentation/foundation/urlsession/3767351-bytes" data-wpel-link="external" target="_blank" rel="external noopener">bytes(from:)</a>) is surprisingly not bad  &#8211; <em>in this specific case</em></h3>



<p>This was surprising to me because generally I&#8217;ve seen Swift&#8217;s <code><a href="https://developer.apple.com/documentation/swift/asyncsequence" data-wpel-link="external" target="_blank" rel="external noopener">AsyncSequence</a></code> stuff &#8211; especially for operating on individual bytes &#8211; being unusably slow and inefficient.  It&#8217;s actually what prompted me to do these benchmarks, because I stubbornly tried using <code>bytes(from:)</code> in a current project and the performance in that real app was god-awful.  These benchmarks demonstrate that it doesn&#8217;t necessarily <em>have</em> to be, and there&#8217;s something more complicated going on.  I&#8217;m yet to get to the bottom of that.</p>



<p>The problem seems to be that async code in general &#8211; but <em>especially</em> anything involving <code>AsyncSequence</code>s &#8211; is <em>terribly</em> dependent on the compiler&#8217;s optimiser.  If the optimiser does anything less than an astounding job on it, the performance can drop off a cliff.</p>



<p>So, while these results nominally recommend <code>bytes(from:)</code> as a decent way to use <code>URLSession</code> &#8211; being a respectable second-fastest in these benchmarks and noticeably easier to use than the fastest method &#8211; I&#8217;d be very cautious about it and test the performance early and often.</p>



<h3 class="wp-block-heading"><code>withUnsafeBytes</code> is <em>way</em> faster than &#8220;safe&#8221; access to <code>Data</code>&#8216;s contents</h3>



<p>It&#8217;s an order of magnitude faster an Apple Silicon, and &#8216;merely&#8217; seven to nine times faster on Intel.</p>



<p>This isn&#8217;t surprising &#8211; <code>Data</code>&#8216;s regular APIs involve actual <em>function calls</em> (if not also Objective-C message sends, depending on what exactly is being returned by <code>URLSession</code> (native <code>Data</code> or actual <code>NSData</code>) and how Swift imports <code>NSData</code> from Objective-C.  <code>withUnsafeBytes</code> provides basically direct memory access, with practically zero overhead.</p>



<p>However, it has one notable downside…</p>



<h3 class="wp-block-heading"><code>withUnsafeBytes</code> doubles the memory usage of the target <code>Data</code></h3>



<p>This surprised and disappointed me &#8211; <code>Data</code> is <em>supposed</em> to already be a contiguous array of bytes, internally, so accessing those bytes with <code>withUnsafeBytes</code> should be nothing more than returning a pointer to that internal storage.  But in at least some cases, it doesn&#8217;t &#8211; instead, it allocates a whole new memory allocation, copies its contents to that allocation, and then provides that instead (and releases it afterwards &#8211; so repeated calls will incur this overhead every time).</p>



<p>This isn&#8217;t a <em>huge</em> issue when the <code>Data</code>s in question are small &#8211; clearly it doesn&#8217;t hamper performance all that much, since it&#8217;s still the fastest way to access the bytes of even a 128 MiB <code>Data</code> &#8211; but it can be an issue when the <code>Data</code>s in question are not small.  If you run out of free RAM, the cost of the kernel&#8217;s in-memory compression or swapping is very likely going to cripple the performance far beyond the degrees seen here by using the slow APIs.</p>



<h3 class="wp-block-heading">Reading a file with URLSession takes more than one CPU core</h3>



<p>I pointedly included <em>both</em> wall time and CPU time to highlight that there&#8217;s multiple cores engaged simultaneously for a single read.  This isn&#8217;t surprising, but it&#8217;s important to remember if you&#8217;re doing lots of parallel I/O &#8211; you can&#8217;t just naively allocate one read operation per CPU core and expect linear scaling (notwithstanding CPU frequency scaling etc anyway).</p>



<p>Though, that&#8217;s generally true anyway because most systems don&#8217;t have enough disk or network I/O to keep up with the CPU anyway.</p>



<h3 class="wp-block-heading">for loops are faster than <code>forEach</code> &amp; <code>reduce</code></h3>



<p>This might surprise some folks.  It&#8217;s surely a bit of a sore point with functional programming dogmatists.  The difference isn&#8217;t <em>massive</em> &#8211; in these benchmarks it&#8217;s only about 20%.  Still, it&#8217;s measurable and noticeable.</p>



<p>I find the for-loop approach easier to write and read anyway, so IMO this is just another reason to favour that instead of functional programming styles.  But not a reason to unilaterally favour one over the other.</p>



<h3 class="wp-block-heading"><code>forEach</code> &amp; <code>reduce</code> perform the same</h3>



<p>Not surprising or news, but worth noting.  In principle the optimiser should reduce them to the exact same machine code in the end.</p>



<h3 class="wp-block-heading">Similar performance characteristics between [old] Intel Xeons and Apple Silicon</h3>



<p>The M2 was faster, of course, but maybe not <em>as much</em> faster than one would expect &#8211; at best only twice as fast, which (subjectively) feels underwhelming given Apple&#8217;s numerous manufacturing and design advantages versus my iMac Pro&#8217;s old Xeon.</p>



<p>What I mean, though, is that the relative performance of the different methods is about the same irrespective of the platform.  What&#8217;s good (or bad) on an M2 is likewise on an Intel CPU.  Which is worth appreciating &#8211; although x86 is rapidly fading into irrelevance, it&#8217;s not quite there yet, and it&#8217;s always unpleasant when optimising for one architecture has the opposite effect on another.</p>



<h3 class="wp-block-heading">You&#8217;re probably not going to be I/O limited on Apple SSDs</h3>



<p>(for a single read at a time, that is)</p>



<p>Even in the very best case shown here &#8211; and despite the <em>very</em> light workload these benchmarks impose, on the actual file data &#8211; the best throughput was merely ~3.6 GB/s.  That&#8217;s not bad, of course &#8211; only a few years ago that would have easily saturated any consumer storage device, even a big Thunderbolt RAID array of SSDs.  But these days most of Apple&#8217;s computers have PCIe 4, quad-lane SSDs that have read speeds of about 7 GB/s.</p>



<p><em>And</em>, this is all ignoring the fact that <em>actually</em> reading from an SSD has more overhead than merely reading from the kernel&#8217;s file system cache, as these benchmarks almost certainly did. So actual SSD read performance is very likely lower than what these benchmarks achieved.</p>



<p>That all said, it doesn&#8217;t necessarily take much to be I/O-limited &#8211; two or three concurrent reads, efficiently implemented, would probably do it.  Certainly if you just spin up an operation per CPU core and they all try to do I/O at once, even a rather inefficient implementation will hit the SSD&#8217;s limits.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8006</post-id>	</item>
		<item>
		<title>Swift&#8217;s native Clocks are very inefficient</title>
		<link>https://wadetregaskis.com/swifts-native-clocks-are-very-inefficient/</link>
					<comments>https://wadetregaskis.com/swifts-native-clocks-are-very-inefficient/#comments</comments>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Fri, 03 May 2024 02:10:07 +0000</pubDate>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Benchmarked]]></category>
		<category><![CDATA[clock_gettime_nsec_np]]></category>
		<category><![CDATA[ContinuousClock]]></category>
		<category><![CDATA[gettimeofday]]></category>
		<category><![CDATA[Inefficient by design]]></category>
		<category><![CDATA[mach_absolute_time]]></category>
		<category><![CDATA[Sad]]></category>
		<category><![CDATA[SuspendingClock]]></category>
		<category><![CDATA[Swift]]></category>
		<guid isPermaLink="false">https://wadetregaskis.com/?p=7990</guid>

					<description><![CDATA[By which I mean, things like ContinuousClock and SuspendingClock. In absolute terms they don&#8217;t have much overhead &#8211; think sub-microsecond for most uses. Which makes them perfectly acceptable when they&#8217;re used sporadically (e.g. only a few times per second). However, if you need to deal with time and timing more frequently, their inefficiency can become&#8230; <a class="read-more-link" href="https://wadetregaskis.com/swifts-native-clocks-are-very-inefficient/" data-wpel-link="internal">Read more</a>]]></description>
										<content:encoded><![CDATA[
<p>By which I mean, things like <code><a href="https://developer.apple.com/documentation/swift/continuousclock" data-wpel-link="external" target="_blank" rel="external noopener">ContinuousClock</a></code> and <code><a href="https://developer.apple.com/documentation/swift/suspendingclock" data-wpel-link="external" target="_blank" rel="external noopener">SuspendingClock</a></code>.</p>



<p>In absolute terms they don&#8217;t have much overhead &#8211; think sub-microsecond for most uses. Which makes them perfectly acceptable when they&#8217;re used sporadically (e.g. only a few times per second).</p>



<p>However, if you need to deal with time and timing more frequently, their inefficiency can become a serious bottleneck.</p>



<p>I stumbled into this because of a fairly common and otherwise uninteresting pattern &#8211; throttling UI updates on an I/O operation&#8217;s progress. This might look something like:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">struct</span><span style="color: #000000"> </span><span style="color: #267F99">Example</span><span style="color: #000000">: View {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">let</span><span style="color: #000000"> bytes: AsyncSequence&lt;</span><span style="color: #267F99">UInt8</span><span style="color: #000000">&gt;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">@State</span><span style="color: #000000"> </span><span style="color: #0000FF">var</span><span style="color: #000000"> byteCount = </span><span style="color: #098658">0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #0000FF">var</span><span style="color: #000000"> body: some View {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #795E26">Text</span><span style="color: #000000">(</span><span style="color: #A31515">&quot;Bytes so far: </span><span style="color: #0000FF">\(</span><span style="color: #000000FF">byteCount.</span><span style="color: #795E26">formatted</span><span style="color: #000000FF">(.</span><span style="color: #795E26">byteCount</span><span style="color: #000000FF">(</span><span style="color: #795E26">style</span><span style="color: #000000FF">: .</span><span style="color: #001080">binary</span><span style="color: #000000FF">))</span><span style="color: #0000FF">)</span><span style="color: #A31515">&quot;</span><span style="color: #000000">)</span></span>
<span class="line"><span style="color: #000000">            .</span><span style="color: #001080">task</span><span style="color: #000000"> {</span></span>
<span class="line"><span style="color: #000000">                </span><span style="color: #0000FF">var</span><span style="color: #000000"> unpostedByteCount = </span><span style="color: #098658">0</span></span>
<span class="line"><span style="color: #000000">                </span><span style="color: #0000FF">let</span><span style="color: #000000"> clock = </span><span style="color: #795E26">ContinuousClock</span><span style="color: #000000">()</span></span>
<span class="line"><span style="color: #000000">                </span><span style="color: #0000FF">var</span><span style="color: #000000"> lastUpdate = clock.</span><span style="color: #001080">now</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">                </span><span style="color: #AF00DB">for</span><span style="color: #000000"> </span><span style="color: #AF00DB">try</span><span style="color: #000000"> </span><span style="color: #AF00DB">await</span><span style="color: #000000"> byte </span><span style="color: #AF00DB">in</span><span style="color: #000000"> bytes {</span></span>
<span class="line"><span style="color: #000000">                    … </span><span style="color: #008000">// Do something with the byte.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">                    unpostedByteCount += </span><span style="color: #098658">1</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">                    </span><span style="color: #0000FF">let</span><span style="color: #000000"> now = clock.</span><span style="color: #001080">now</span></span>
<span class="line"><span style="color: #000000">                    </span><span style="color: #0000FF">let</span><span style="color: #000000"> delta = now - lastUpdate</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">                    </span><span style="color: #AF00DB">if</span><span style="color: #000000"> (    delta &gt; .</span><span style="color: #795E26">seconds</span><span style="color: #000000">(</span><span style="color: #098658">1</span><span style="color: #000000">)</span></span>
<span class="line"><span style="color: #000000">                         || (    (delta &gt; .</span><span style="color: #795E26">milliseconds</span><span style="color: #000000">(</span><span style="color: #098658">100</span><span style="color: #000000">)</span></span>
<span class="line"><span style="color: #000000">                              &amp;&amp; </span><span style="color: #098658">1_000_000</span><span style="color: #000000"> &lt;= unpostedByteCount))) {</span></span>
<span class="line"><span style="color: #000000">                        byteCount += unpostedByteCount</span></span>
<span class="line"><span style="color: #000000">                        unpostedByteCount = </span><span style="color: #098658">0</span></span>
<span class="line"><span style="color: #000000">                        lastUpdate = now</span></span>
<span class="line"><span style="color: #000000">                    }</span></span>
<span class="line"><span style="color: #000000">                }</span></span>
<span class="line"><span style="color: #000000">            }</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p>☝️ This isn&#8217;t a complete implementation, as it won&#8217;t update the byte count if the download stalls (since the lack of incoming bytes will mean no iteration on the loop, and therefore no updates even if a full second passes). But it&#8217;s sufficient for demonstration purposes here.</p>



<p>🖐️ Why didn&#8217;t I just use <code><a href="https://github.com/apple/swift-async-algorithms/blob/main/Sources/AsyncAlgorithms/AsyncAlgorithms.docc/Guides/Throttle.md" data-wpel-link="external" target="_blank" rel="external noopener">throttle</a></code> from <a href="https://github.com/apple/swift-async-algorithms" data-wpel-link="external" target="_blank" rel="external noopener">swift-async-algorithms</a>? I did, at first, and quickly discovered that its performance is <em>horrible</em>. While I do suspect I can &#8216;optimise&#8217; it to not be atrocious, I haven&#8217;t pursued that as it was easier to just write my own throttling system.</p>
</div></div>



<p>The above seems fairly straightforward, but if you run it and have any non-trivial I/O rate &#8211; even just a few hundred kilobytes per second &#8211; you&#8217;ll find that it saturates an entire CPU core, not just wasting CPU time but limiting the I/O rate severely.</p>



<p>Using a <code>SuspendingClock</code> makes no difference.</p>



<p>In a nutshell, the problem is that Swift&#8217;s <code><a href="https://developer.apple.com/documentation/swift/clock" data-wpel-link="external" target="_blank" rel="external noopener">Clock</a></code> protocol has significant overheads by design<sup data-fn="2f4a7c64-e213-44df-a3da-0e5020545aad" class="fn"><a href="#2f4a7c64-e213-44df-a3da-0e5020545aad" id="2f4a7c64-e213-44df-a3da-0e5020545aad-link">1</a></sup>. If you look at a time profile of code like this, you&#8217;ll see things like:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img fetchpriority="high" decoding="async" width="900" height="716" src="https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead.webp" alt="Screenshot of Instruments showing the outline view for a Time Profile, expanded to show dozens of spurious, overhead functions taking up the vast majority of the runtime." class="wp-image-7991" srcset="https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead.webp 900w, https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead-256x204.webp 256w, https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead-768x611.webp 768w, https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead@2x.webp 1800w, https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead-256x204@2x.webp 512w" sizes="(max-width: 900px) 100vw, 900px" /></figure>
</div>


<p>That&#8217;s a lot of time wasted in function calls and struct initialisation and type conversion and protocol witnesses and all that guff. The only part that&#8217;s <em>actually</em> retrieving the time is the <code><a href="https://github.com/apple/swift/blob/625436af05b1cf8f1904096530235489daec9dac/stdlib/public/Concurrency/Clock.cpp#L30" data-wpel-link="external" target="_blank" rel="external noopener">swift_get_time</a></code> call (which is just a wrapper over <code><a href="https://www.manpagez.com/man/3/clock_gettime/" data-wpel-link="external" target="_blank" rel="external noopener">clock_gettime</a></code>, which is just a wrapper over <code><a href="https://www.manpagez.com/man/3/clock_gettime_nsec_np/" data-wpel-link="external" target="_blank" rel="external noopener">clock_gettime_nsec_np</a>(CLOCK_UPTIME_RAW)</code>, which is just a wrapper over <code><a href="https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time" data-wpel-link="external" target="_blank" rel="external noopener">mach_absolute_time</a></code>).</p>



<p>I wrote <a href="https://github.com/wadetregaskis/Swift-Benchmarks/blob/main/Benchmarks/Clocks/Clocks.swift" data-wpel-link="external" target="_blank" rel="external noopener">some simple benchmarks of various alternative time-tracking methods</a>, with these results with Swift 5.10 (showing the median runtime of the benchmark, which is a million iterations of checking the time):</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Method</th><th class="has-text-align-center" data-align="center">10-core iMac Pro</th><th class="has-text-align-center" data-align="center">M2 MacBook Air</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right"><code><a href="https://developer.apple.com/documentation/swift/continuousclock" data-wpel-link="external" target="_blank" rel="external noopener">ContinuousClock</a></code></td><td class="has-text-align-center" data-align="center">429 ms</td><td class="has-text-align-center" data-align="center">258 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://developer.apple.com/documentation/swift/suspendingclock" data-wpel-link="external" target="_blank" rel="external noopener">SuspendingClock</a></code></td><td class="has-text-align-center" data-align="center">430 ms</td><td class="has-text-align-center" data-align="center">247 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://developer.apple.com/documentation/foundation/date" data-wpel-link="external" target="_blank" rel="external noopener">Date</a></code></td><td class="has-text-align-center" data-align="center">30 ms</td><td class="has-text-align-center" data-align="center">19 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://www.manpagez.com/man/3/clock_gettime_nsec_np/" data-wpel-link="external" target="_blank" rel="external noopener">clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW)</a></code></td><td class="has-text-align-center" data-align="center">32 ms</td><td class="has-text-align-center" data-align="center">10 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://www.manpagez.com/man/3/clock_gettime_nsec_np/" data-wpel-link="external" target="_blank" rel="external noopener">clock_gettime_nsec_np(CLOCK_UPTIME_RAW)</a></code></td><td class="has-text-align-center" data-align="center">27 ms</td><td class="has-text-align-center" data-align="center">10 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/gettimeofday.2.html" data-wpel-link="external" target="_blank" rel="external noopener">gettimeofday</a></code></td><td class="has-text-align-center" data-align="center">24 ms</td><td class="has-text-align-center" data-align="center">12 ms</td></tr><tr><td class="has-text-align-right" data-align="right"><code><a href="https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time" data-wpel-link="external" target="_blank" rel="external noopener">mach_absolute_time</a></code></td><td class="has-text-align-center" data-align="center">15 ms</td><td class="has-text-align-center" data-align="center">6 ms</td></tr></tbody></table></figure>



<p>All these alternative methods are <em>well</em> over an order of magnitude faster than Swift&#8217;s native clock APIs, showing just how dreadfully inefficient the Swift <code>Clock</code> API is.</p>



<h3 class="wp-block-heading">mach_absolute_time for the win</h3>



<p>Unsurprisingly, <code>mach_absolute_time</code> is the fastest. It is what all these other APIs are actually based on; it is the lowest level of the time stack.</p>



<p>The downside to calling <code>mach_absolute_time</code> <em>directly</em>, though, is that <a href="https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time#discussion" data-wpel-link="external" target="_blank" rel="external noopener">it&#8217;s on Apple&#8217;s &#8220;naughty&#8221; list</a> &#8211; apparently it&#8217;s been abused for device fingerprinting, so Apple require you to beg for special permission if you want to use it (even though it&#8217;s used by all these other APIs anyway, as the basis for their implementations, and there&#8217;s nothing you can get from <code>mach_absolute_time</code> that you can&#8217;t get from them too 🤨).</p>



<h3 class="wp-block-heading"><code>Date</code> surprisingly not bad</h3>



<p>I was quite surprised to see good ol&#8217; <code><a href="https://developer.apple.com/documentation/foundation/date" data-wpel-link="external" target="_blank" rel="external noopener">Date</a></code> performing competitively with the traditional C-level APIs, at least on x86-64. Even on arm64 it&#8217;s not bad, at still a third to half the speed of the C APIs. This surprised me because <s>it has the overhead of at least one Objective-C message send (for <code><a href="https://developer.apple.com/documentation/foundation/date/1780473-timeintervalsincenow" data-wpel-link="external" target="_blank" rel="external noopener">timeIntervalSinceNow</a></code>), unless somehow the Swift compiler is optimising that into a static function call, or inlining it entirely…?</s></p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<p><strong>Update</strong>: I later looked at the disassembly, and found no message sends, only a plain function call to <code>Foundation.Date.timeIntervalSinceNow.getter</code> (which is only 40 instructions, on arm64, over <code>clock_gettime</code> and <code>__stack_chk_fail</code> &#8211; and the former is hundreds of instructions, so it&#8217;s adding relatively little overhead to the C API).</p>



<p>This isn&#8217;t being done by the compiler, it&#8217;s because <a href="https://github.com/apple/swift-foundation/blob/main/Sources/FoundationEssentials/Date.swift" data-wpel-link="external" target="_blank" rel="external noopener">that&#8217;s <em>actually</em> how it&#8217;s implemented in Foundation</a>. I keep forgetting that Foundation from Swift is no longer just the old Objective-C Foundation, but rather mostly the <em>new</em> Foundation that&#8217;s written in native Swift. So these performance results likely don&#8217;t apply once you go back far enough in Apple OS releases (to when Swift really was calling into the Objective-C code for <code>NSDate</code>) &#8211; but it&#8217;s safe to rely on good <code>Date</code> performance now and in future.</p>
</div></div>



<p>I certainly wouldn&#8217;t be afraid to use <code>Date</code> broadly, going down to lower APIs only when truly necessary &#8211; which is pretty rarely, I&#8217;d wager; we&#8217;re talking a mere 19 to 30 <em>nanoseconds</em> to get the time elapsed since a reference date <em>and</em> compare it to a threshold. If that&#8217;s too slow, it might be an indication that there&#8217;s a bigger problem (like transferring data a single byte at a time, as in the example that started this post &#8211; but more on that in <a href="https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/" data-wpel-link="internal">the next post</a>).</p>



<hr class="wp-block-separator has-alpha-channel-opacity is-style-dots"/>



<h3 class="wp-block-heading">Follow-up</h3>



<p>This post <a href="https://news.ycombinator.com/item?id=40262897" data-wpel-link="external" target="_blank" rel="external noopener">got some attention on HackerNews</a>. Pleasingly, the comments there were almost all well-intentioned and interesting. It&#8217;s a bit beyond me to try to address all of them, but a few in particular raised good points that I would like to answer / clarify:</p>



<ul class="wp-block-list">
<li>A lot of folks were curious about <code>mach_absolute_time</code> being on Apple&#8217;s naughty list. I don&#8217;t know for sure why it is either, but I think it&#8217;s very likely that it&#8217;s <em>primarily</em> because it essentially provides a reference time point, that&#8217;s very precise and pretty unique between computers. It&#8217;s not the boot time necessarily &#8211; because the timer pauses whenever the system is put to sleep &#8211; but even so it provides a simple way to nearly if not exactly identify an individual machine session (between boots &amp; sleeps). It probably wouldn&#8217;t take many other fingerprinting data points to reliably pin-point a specific machine.<br><br>Secondarily, because it provides very precise timing capabilities (e.g. nanosecond-resolution on x86), it could possibly be a key component of <a href="https://en.wikipedia.org/wiki/Timing_attack" data-wpel-link="external" target="_blank" rel="external noopener">timing attacks</a> and broader device fingerprinting based on timing information (e.g. measuring how long it takes to perform an otherwise innocuous operation).<br><br>That all said, the only difference between it and some of the higher-level APIs wrapping it is their overhead. And it&#8217;s not apparent to me that merely making the &#8220;get-time&#8221; functionality 2x slower is going to magically mitigate all the above concerns, especially when we&#8217;re still talking just a few nanoseconds.</li>



<li>Admittedly my phrasing regarding Apple&#8217;s policies on <code>mach_absolute_time</code> &#8211; &#8220;beg for permission to use it&#8221; &#8211; is a little melodramatic. It&#8217;s revealing something of my personal opinions on certain Apple &#8220;security&#8221; practices. I love that Apple genuinely care about protecting everyone&#8217;s privacy, but sometimes I chaff at what feels like capricious or impractical specific policies.<br><br>In this particular case, it&#8217;s not apparent to me why this sort of protection is needed for <em>native</em> apps. In a web browser, sure, you&#8217;re running untrustworthy, essentially arbitrary code from all over the place, a <em>lot</em> of which is openly malicious (thanks, Google &amp; Facebook, for your pervasive trackers &#8211; fuck you too). But a native app &#8211; or heck, even a dodgy non-native one like an Electron app &#8211; must be explicitly installed by the end user, among other barriers like code signing.</li>



<li>A few folks looked at the example case, of iterating a single byte at a time, and were suspicious of how performant that could possibly be anyway. This is a very fair reaction &#8211; it&#8217;s my ingrained instinct as well, from years of C/C++/Objective-C &#8211; <em>but</em> it&#8217;s relying on a few outdated assumptions. <a href="https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/" data-wpel-link="internal">My next post</a> already covered this for the most part, but in short here:<br><br>Through inlining, that code basically optimises down to an outer loop that fetches a new <em>chunk</em> of data (a pointer &amp; length) plus an inner loop to iterate over that as direct memory access. The chunks are typically tens of kilobytes to megabytes, in my experience (depending on the source, e.g. network vs local storage, and the buffer sizes chosen by Apple&#8217;s framework code). So it actually is quite performant and essentially what you&#8217;d conventionally write in a file descriptor read loop. <em>If and when</em> it happens to optimise correctly. That&#8217;s the major caveat &#8211; sometimes the Swift compiler fails to properly optimise code like this, and then indeed the performance can really suck. But for simple cases like in this post&#8217;s example code, the optimiser has no trouble with it.</li>



<li>Similarly, a few folks questioned the need to check the clock on <em>every</em> byte, as in the example. That&#8217;s a valid critique of this sort of code in many contexts, and I concur that where possible one <em>should</em> try to be smarter about such things &#8211; i.e. use sequences of bunches of bytes, not sequences of individual bytes.  <a href="https://wadetregaskis.com/urlsession-performance-for-reading-a-byte-stream/" data-wpel-link="internal">e.g. with <code>URLSession</code> you can</a>, and indeed it is faster to do it smarter like that.  But, you <em>can</em> get acceptable real-world performance with this code, even in high-throughput cases, and it&#8217;s relatively simple and intuitive to write, so it&#8217;s not uncommon or necessarily unreasonable.<br><br>In addition, sometimes you&#8217;re at the mercy of the APIs available &#8211; e.g. sometimes you can <em>only</em> get an <code>AsyncSequence&lt;UInt8&gt;</code>. If you don&#8217;t care about complete accuracy, you can do things like only considering UI updates every N bytes. You&#8217;ll save CPU time and nobody will notice the difference for small enough N on a fast enough iteration, but if those prerequisites aren&#8217;t met you might read e.g. N-1 bytes and then hit a long pause, during which time you <em>have</em> the extra N-1 bytes in hand but you&#8217;re not showing as such in your UI.</li>



<li>Some folks noted that are a <em>lot</em> of other clock APIs from Apple&#8217;s frameworks, like <code><a href="https://developer.apple.com/documentation/dispatch/dispatchtime" data-wpel-link="external" target="_blank" rel="external noopener">DispatchTime</a></code> and <code><a href="https://developer.apple.com/documentation/quartzcore/1395996-cacurrentmediatime" data-wpel-link="external" target="_blank" rel="external noopener">CACurrentMediaTime</a></code>. I didn&#8217;t include those in the benchmark because I just didn&#8217;t think of them at the time. If anyone wants to send me a pull request adding them to <a href="https://github.com/wadetregaskis/Swift-Benchmarks/blob/main/Benchmarks/Clocks/Clocks.swift" data-wpel-link="external" target="_blank" rel="external noopener">the code</a>, I&#8217;d be very happy to accept it.<br><br>I haven&#8217;t checked all those other APIs specifically, but I can pretty much guarantee they&#8217;re all built on <code>mach_absolute_time</code> too (possibly via one or more of the other C APIs already covered in this post). In fact those two examples just mentioned are explicitly documented as using <code>mach_absolute_time</code>.</li>



<li><a href="https://news.ycombinator.com/user?id=Kallikrates" data-wpel-link="external" target="_blank" rel="external noopener">Kallikrates</a> quietly pointed to a very interesting recent change in Apple&#8217;s Swift standard library code, <a href="https://github.com/apple/swift/pull/73429" data-wpel-link="external" target="_blank" rel="external noopener">Make static [milli/micro/nano]seconds members on Duration inlinable</a>. It&#8217;s paired with <a href="https://github.com/apple/swift/pull/73419" data-wpel-link="external" target="_blank" rel="external noopener">another patch</a> that together seem very specifically aimed at eliminating some of the absurd overhead in Swift&#8217;s <code>ContinuousClock</code> &amp; <code>SuspendingClock</code> implementations. The timing is a bit interesting &#8211; I don&#8217;t know if they were prompted by this post, but it&#8217;d be an unlikely coincidence otherwise.<br><br>In any case, I suspect it is possible to eliminate the overheads &#8211; there&#8217;s no apparent reason why they can&#8217;t be at least as efficient as <code>Date</code> already is &#8211; and so I hope that is what&#8217;s happening. Hopefully I&#8217;ll be able to re-run these benchmarks in a few months, with Swift 6, and see the performance gap eliminated. 🤞</li>
</ul>


<ol class="wp-block-footnotes"><li id="2f4a7c64-e213-44df-a3da-0e5020545aad">One might quibble with the &#8220;by design&#8221; assertion.  What I mean is that because it uses a protocol it&#8217;s susceptible to significant overheads &#8211; as is seen in these benchmarks &#8211; and because its internal implementation (a private <code>_Int128</code> type, inside the standard library) is kept hidden, it limits the compiler&#8217;s ability to inline, which is in turn critical to eliminating what&#8217;s technically a lot of boilerplate.  In contrast, if it were simply a struct using only public types internally, it would have avoided most of these overheads and been more amenable to inlining.<br><br>It&#8217;s not an irredeemable design (I think) &#8211; and that&#8217;s what the <a href="https://github.com/apple/swift/pull/73429" data-wpel-link="external" target="_blank" rel="external noopener">recent</a> <a href="https://github.com/apple/swift/pull/73419" data-wpel-link="external" target="_blank" rel="external noopener">patches</a> seem to be banking on, by tweaking the design in order to allow inlining and thus hopefully eliminate almost all the overhead. <a href="#2f4a7c64-e213-44df-a3da-0e5020545aad-link" aria-label="Jump to footnote reference 1">↩︎</a></li></ol>]]></content:encoded>
					
					<wfw:commentRss>https://wadetregaskis.com/swifts-native-clocks-are-very-inefficient/feed/</wfw:commentRss>
			<slash:comments>13</slash:comments>
		
		
			<media:content url="https://wadetregaskis.com/wp-content/uploads/2024/05/ContinuousClock-overhead.webp" medium="image" />
<post-id xmlns="com-wordpress:feed-additions:1">7990</post-id>	</item>
		<item>
		<title>Collection enumeration performance in Swift</title>
		<link>https://wadetregaskis.com/collection-enumeration-performance-in-swift/</link>
					<comments>https://wadetregaskis.com/collection-enumeration-performance-in-swift/#comments</comments>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Wed, 08 Nov 2023 07:10:08 +0000</pubDate>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Benchmarked]]></category>
		<category><![CDATA[Compiler optimisations]]></category>
		<category><![CDATA[Functional programming style]]></category>
		<category><![CDATA[Imperative programming style]]></category>
		<category><![CDATA[Programming style]]></category>
		<category><![CDATA[Swift]]></category>
		<category><![CDATA[Swift Collection]]></category>
		<category><![CDATA[Swift Sequence]]></category>
		<guid isPermaLink="false">https://blog.wadetregaskis.com/?p=5283</guid>

					<description><![CDATA[Swift&#8217;s Collection and Sequence protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively. For example: Or: Nominally these are equivalent &#8211; they&#8217;ll produce the same results for all correctly-implemented Collections and Sequences. So in principle which you use is purely a matter of stylistic preference. But is it? Do they&#8230; <a class="read-more-link" href="https://wadetregaskis.com/collection-enumeration-performance-in-swift/" data-wpel-link="internal">Read more</a>]]></description>
										<content:encoded><![CDATA[
<p>Swift&#8217;s <code>Collection</code> and <code>Sequence</code> protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively.  For example:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">let</span><span style="color: #000000"> result = data</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000"> * </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=)</span></span></code></pre></div>



<p>Or:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">var</span><span style="color: #000000"> result = </span><span style="color: #098658">0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #AF00DB">for</span><span style="color: #000000"> value </span><span style="color: #AF00DB">in</span><span style="color: #000000"> data {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #098658">0</span><span style="color: #000000"> != value {</span></span>
<span class="line"><span style="color: #000000">        result &amp;+= value * value</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<p>Nominally these are equivalent &#8211; they&#8217;ll produce the same results for all correctly-implemented <code>Collection</code>s and <code>Sequence</code>s.  So in principle which you use is purely a matter of stylistic preference.</p>



<p>But is it?</p>



<p>Do they actually <em>perform</em> equivalently?</p>



<p>Let&#8217;s examine an example that&#8217;s a <em>little</em> more involved than the above snippets, but still fundamentally pretty straightforward.  The extra processing steps are to help distinguish any performance differences.</p>



<p>The pertinent parts are:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #000000">testData.</span><span style="color: #001080">next</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">byteSwapped</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { (</span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; </span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">leadingZeroBitCount</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=))</span></span></code></pre></div>



<p>And the imperative equivalent:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">var</span><span style="color: #000000"> result = </span><span style="color: #098658">0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #AF00DB">for</span><span style="color: #000000"> value </span><span style="color: #AF00DB">in</span><span style="color: #000000"> testData.</span><span style="color: #001080">next</span><span style="color: #000000"> {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #098658">0</span><span style="color: #000000"> != value {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">let</span><span style="color: #000000"> value = value.</span><span style="color: #001080">byteSwapped</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #AF00DB">if</span><span style="color: #000000"> (value &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; value &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> {</span></span>
<span class="line"><span style="color: #000000">            </span><span style="color: #0000FF">let</span><span style="color: #000000"> value = value.</span><span style="color: #001080">leadingZeroBitCount</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">            </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= value {</span></span>
<span class="line"><span style="color: #000000">                result &amp;+= value</span></span>
<span class="line"><span style="color: #000000">            }</span></span>
<span class="line"><span style="color: #000000">        }</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<p>I&#8217;ve published <a href="https://github.com/wadetregaskis/Swift-Benchmarks/blob/6566bd0c785053a3b6c9d6b7c43604ff2a636b35/Benchmarks/ArrayProcessing/ArrayProcessing.swift" data-type="link" data-id="https://github.com/wadetregaskis/Swift-Benchmarks/blob/6566bd0c785053a3b6c9d6b7c43604ff2a636b35/Benchmarks/ArrayProcessing/ArrayProcessing.swift" data-wpel-link="external" target="_blank" rel="external noopener">the full source code</a>, in case you&#8217;d like to review it further or run it yourself.</p>



<h2 class="wp-block-heading">How does the performance compare?</h2>



<p>On my iMac Pro (10 cores (Xeon W-2150B)):</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">234 ns</td><td class="has-text-align-right" data-align="right">133 ns</td><td class="has-text-align-right" data-align="right">1.67x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">57 µs</td><td class="has-text-align-right" data-align="right">16 µs</td><td class="has-text-align-right" data-align="right">3.56x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">1.7 ms</td><td class="has-text-align-right" data-align="right">0.5 ms</td><td class="has-text-align-right" data-align="right">3.36x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">27 ms</td><td class="has-text-align-right" data-align="right">4.2 ms</td><td class="has-text-align-right" data-align="right">6.36x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">147 ms</td><td class="has-text-align-right" data-align="right">17 ms</td><td class="has-text-align-right" data-align="right">8.65x</td></tr></tbody></table></figure>



<p>On my M2 MacBook Air:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">167 ns</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">2.01x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">37 µs</td><td class="has-text-align-right" data-align="right">3.6 µs</td><td class="has-text-align-right" data-align="right">10.20x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">1,058µs</td><td class="has-text-align-right" data-align="right">112 µs</td><td class="has-text-align-right" data-align="right">9.45x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">12 ms</td><td class="has-text-align-right" data-align="right">0.9 ms</td><td class="has-text-align-right" data-align="right">13.23x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">50 ms</td><td class="has-text-align-right" data-align="right">3.7 ms</td><td class="has-text-align-right" data-align="right">13.76x</td></tr></tbody></table></figure>



<p>The imperative version is <em>many</em> times faster!  And the performance difference increases as the collection size increases.  The functional version starts off not <em>super</em> terrible &#8211; at least on the same order magnitude as the imperative version &#8211; but it tends rapidly towards being an order of magnitude slower.</p>



<p>Worse, the difference is much more pronounced on more modern CPUs, like Apple Silicon.</p>



<h3 class="wp-block-heading">What&#8217;s going on?</h3>



<p>There&#8217;s a few compounding factors.</p>



<p>Smaller datasets are more likely to fit into CPU caches (the dataset sizes shown above were chosen to correspond to L1 / L1 / L2 / L3 / RAM, respectively, on my iMac Pro).  Working on data in CPU caches is by nature faster &#8211; the lower-level the cache the better &#8211; and so helps hides inefficiencies.</p>



<p>The functional version creates intermediary <code>Array</code>s to store the intermediary results of every <code>filter</code> and <code>map</code> operation.  This introduces malloc traffic, retains &amp; releases, and <em>writes to memory</em>.  The imperative version has none of that overhead &#8211; it simply reads every value in the collection once, performing the whole sequence of operations all in one go for each element, using only CPU registers (not so much as a function call, even!).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img decoding="async" width="1358" height="352" src="https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations.webp" alt="Screenshot from Instruments showing the Allocations (memory usage) during the benchmarks' execution" class="wp-image-5365" style="object-fit:cover;width:679px;height:176px" srcset="https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations.webp 1358w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-512x133@2x.webp 1024w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-256x66.webp 256w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-512x133.webp 512w" sizes="(max-width: 1358px) 100vw, 1358px" /><figcaption class="wp-element-caption">Can you tell which approach results in <em>way</em> more memory use (and is a lot slower)?</figcaption></figure>
</div>


<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Sidenote: Dataset load costs</summary>
<p>In real-world cases there&#8217;s sometimes a substantial baseline cost of reading the dataset in from memory (or disk, or the network), unless your dataset is small enough to fit in caches <em>and</em> was very recently populated there.  But even creaky old Intel machines like my iMac Pro have pretty decent memory bandwidth, such that it&#8217;s rarely the performance bottleneck unless your algorithm is quite trivial <em>and</em> well-optimised (e.g. to utilise SIMD instructions).</p>



<p>In the benchmarks I somewhat emptied the caches before each run, by alternating between two similar datasets, but this really just means that during each run it has to load in the initial dataset from one further level out (e.g. L2 instead of L1).  So these benchmarks aren&#8217;t really demonstrating the potential full cost of loading the dataset from RAM &#8211; let-alone from disk or the network.</p>
</details>
</div></div>



<p>Furthermore, because the functional version is allocating those additional arrays, which take up more space, it tends to overflow caches sooner.  For example, in the 1 MiB case, instead of being able to operate entirely out of L2 on the iMac Pro, it has to fall back (at least partially) to L3.  L3 is significantly slower &#8211; higher latency &#8211; than L2, so that makes everything it does noticeably slower.  It&#8217;s worse when it no longer fits in any CPU caches and has to start going back and forth to RAM, as there&#8217;s a <em>big</em> jump in latency between CPU caches and RAM.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Sidenote: Time per element</summary>
<p>I didn&#8217;t include it in the results table because it&#8217;s tangential, but as a quick note:  the time per element varies depending on which level of cache the execution fits within.</p>



<p>It&#8217;s a pretty consistent 4ns (iMac Pro) / 0.9ns (M2) in the imperative case (memory prefetching is able to keep up with the trivial linear read pattern).</p>



<p>But for the functional version it ranges from 13ns (iMac Pro) / 9ns (M2) for L1, up to 35ns (iMac Pro) / 12ns (M2) for RAM.  The memory prefetcher still does an admirable job keeping the performance relatively consistent, but it can&#8217;t completely cover up the inefficiencies.</p>



<p>The M2 scales better &#8211; suffers less of a performance impact as datasets get larger &#8211; because it has both greater memory bandwidth and <em>much</em> lower latency (especially when we get to RAM, as its RAM is in the CPU package rather than miles away across the motherboard on separate DIMMs).</p>
</details>
</div></div>



<p>Things get <em>far worse</em> if you exceed available RAM, too.  I haven&#8217;t shown that in these results &#8211; mainly because it&#8217;s painfully time-consuming to run such benchmarks on my iMac Pro with 64 GiB of RAM &#8211; but suffice to say that once you start swapping, the performance goes <em>completely</em> down the toilet.</p>



<h2 class="wp-block-heading">So I should avoid the filter &amp; map methods?</h2>



<p>For trivially small datasets, the difference might be negligible.  Especially if you&#8217;re only using the input data once (if you&#8217;re reusing it many times over, you might want to look at caching the results anyway, or other such optimisations).</p>



<p>For non-trivial datasets, it is often wise to avoid using <code>filter</code> and <code>map</code>, at least &#8220;eagerly&#8221;.  What does that mean?  Well, there are actually <em>two</em> functional styles supported by <code>Collection</code> and <code>Sequence</code>…</p>



<h3 class="wp-block-heading">Enter lazy…</h3>



<p>By default <code>filter</code>, <code>map</code>, and other such operations are &#8220;eager&#8221; &#8211; as soon as they&#8217;re executed they enumerate their <em>entire</em> input, generate their <em>entire</em> output, and only then does execution move on to the <em>next</em> operation in the pipeline.</p>



<p>But there is an alternative &#8211; <em>lazy</em> versions of all of these.  You access them via the <code>lazy</code> property of <code>Collection</code> / <code>Sequence</code>, e.g.:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #000000">testData.</span><span style="color: #001080">next</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #001080">lazy</span><span style="color: #000000"> </span><span style="color: #008000">// New!</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">byteSwapped</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { (</span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; </span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">leadingZeroBitCount</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=))</span></span></code></pre></div>



<p>The <code>lazy</code> property returns a special &#8220;lazy&#8221; view of the underlying <code>Collection</code> or <code>Sequence</code>.  That view looks a lot like the original object &#8211; it has the same <code>map</code>, <code>filter</code>, etc methods &#8211; but its version of those methods return <em>further</em> lazy views, rather than the actual results of the operation.  It doesn&#8217;t actually perform the operation until it&#8217;s strictly necessary. And when it is necessary &#8211; such as when some code like <code>reduce</code> enumerates the results to produce a concrete value &#8211; it calculates the results on the fly, with no intermediary storage.</p>



<p>So, that all sounds good &#8211; should be faster, right?  Let&#8217;s see.</p>



<p>On my iMac Pro:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Lazy functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">155 ns</td><td class="has-text-align-right" data-align="right">133 ns</td><td class="has-text-align-right" data-align="right">1.17x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">15 µs</td><td class="has-text-align-right" data-align="right">16 µs</td><td class="has-text-align-right" data-align="right">0.94x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">491 µs</td><td class="has-text-align-right" data-align="right">511 µs</td><td class="has-text-align-right" data-align="right">0.96x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">4.1 ms</td><td class="has-text-align-right" data-align="right">4.2 ms</td><td class="has-text-align-right" data-align="right">0.98x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">16 ms</td><td class="has-text-align-right" data-align="right">17 ms</td><td class="has-text-align-right" data-align="right">0.94x</td></tr></tbody></table></figure>



<p>A dramatic difference versus the eager style.  The lazy functional style is still slightly slower than the imperative style for <em>very</em> small collections, such as the empty one here, but it&#8217;s actually <em>slightly faster</em> for most!</p>



<p>The Swift compiler is doing a pretty amazing job in this case.  Nominally it still needs to do a bunch of overhead &#8211; each of those lazy <code>filter</code> and <code>map</code> methods returns a lazy view object, and those objects form a logical chain, and have to call various methods on each other in order to pass data through the pipeline.  Indeed, in debug builds that&#8217;s exactly what you see in the compiled binary, and the performance is much worse.  But with the optimiser engaged, the compiler sees through all that boilerplate and eliminates it, reducing the whole thing down to a very efficient form.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Wait, faster…?</summary>
<p>That it&#8217;s actually a tad faster is odd as in principle they should be almost identical &#8211; any optimisation the compiler can apply to the functional version should also be applicable to the imperative one, since the imperative one is basically an easier subcase where we&#8217;ve manually done the hard optimisations already.</p>



<p>Indeed looking at the machine code that the Swift compiler emits, they are <em>very</em> similar.  It&#8217;s unclear to me why there&#8217;s a reliable, measurable performance difference between them &#8211; perhaps an accidental consequence of slightly different instruction orderings and register selection.</p>
</details>
</div></div>



<p>However, on my M2 MacBook Air:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Lazy functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">1.00x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">12 µs</td><td class="has-text-align-right" data-align="right">3.6 µs</td><td class="has-text-align-right" data-align="right">3.31x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">442 µs</td><td class="has-text-align-right" data-align="right">112 µs</td><td class="has-text-align-right" data-align="right">3.95x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">3.6 ms</td><td class="has-text-align-right" data-align="right">0.9 ms</td><td class="has-text-align-right" data-align="right">3.94x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">14 ms</td><td class="has-text-align-right" data-align="right">3.6 ms</td><td class="has-text-align-right" data-align="right">3.85x</td></tr></tbody></table></figure>



<p>Oh no &#8211; while the lazy functional version is several times faster than the eager functional version, it&#8217;s still <em>many</em> times slower than the imperative version (for non-empty collections).</p>



<p>It&#8217;s not entirely clear to me why this is the case; why the behaviour is so different to x86-64.  Looking at the machine code, the compiler&#8217;s optimiser has still successfully removed all the boilerplate and simplified it down to a tight loop of trivial integer operations.  It appears the difference might arise from the use of conditional instructions (for the functional version) versus branching (for the imperative version).  It&#8217;s not clear why the compiler uses different approaches for what are otherwise very similar blocks of code.  As such, the behaviour might change between compiler versions (this exploration used Swift 5.9) or cases (variations in code structure &#8211; or details &#8211; might cause the compiler to make different instruction selections).</p>



<p>Alas, explicable or not, the conclusion is clear:</p>



<figure class="wp-block-pullquote"><blockquote><p>Lazy functional-style performs better than eager functional-style, but still much worse than imperative style.</p></blockquote></figure>



<p>So my advice is to generally avoid the functional style.  Not religiously, but with moderate determination.</p>



<p>The only clear exception, where it&#8217;s okay to use the functional style, is if you&#8217;re inherently doing single operations at a time, like a simple <code>map</code> where you actually need to store the resulting <code>Array</code>.  You&#8217;ll get no benefit from using <code>lazy</code> in such cases, nor will the imperative version be meaningfully faster (usually).</p>



<h3 class="wp-block-heading">Preview: Lazy considered harmful</h3>



<p>Unfortunately, in addition to still being slower than the imperative style on modern CPUs, there&#8217;s several <em>further</em> aspects of lazy <code>Collection</code>s and <code>Sequence</code>s that are problematic.  I plan to dive deeper into this in a follow-up post, but here&#8217;s a teaser:</p>



<ul class="wp-block-list">
<li>In debug builds the performance is poor, because the optimiser essentially isn&#8217;t used.  Thus debugging (e.g. the regular <code>Run</code> action in Xcode) and unit testing may be slowed down significantly.</li>



<li>The optimisations only work if the compiler can see your whole pipeline.  If you start splitting things up in your code, or making things dynamic at runtime, the compiler might become unable to make the necessary optimisations.  In general if you start storing lazy <code>Collection</code>s / <code>Sequence</code>s anywhere, or returning them from properties or methods, you&#8217;re likely to miss out on the optimisations.</li>



<li>There are some serious pitfalls and sharp edges around lazy <code>Collection</code>s and <code>Sequence</code>s which can lead to them being not just <em>slower</em> than their eager brethren, but potentially dangerous (in the sense of not producing the expected results)!  Stay tuned for details.</li>
</ul>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://wadetregaskis.com/collection-enumeration-performance-in-swift/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5283</post-id>	</item>
		<item>
		<title>Z9 burst shooting buffer depth</title>
		<link>https://wadetregaskis.com/z9-burst-shooting-buffer-depth/</link>
					<comments>https://wadetregaskis.com/z9-burst-shooting-buffer-depth/#respond</comments>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Fri, 04 Feb 2022 03:16:55 +0000</pubDate>
				<category><![CDATA[Photography]]></category>
		<category><![CDATA[Reviews]]></category>
		<category><![CDATA[Angelbird AV PRO]]></category>
		<category><![CDATA[Benchmarked]]></category>
		<category><![CDATA[CFExpress]]></category>
		<category><![CDATA[Lexar 2933x Professional]]></category>
		<category><![CDATA[Nikon]]></category>
		<category><![CDATA[Pergear]]></category>
		<category><![CDATA[ProGrade Gold]]></category>
		<category><![CDATA[Tested]]></category>
		<category><![CDATA[XQD]]></category>
		<category><![CDATA[Z9]]></category>
		<guid isPermaLink="false">https://blog.wadetregaskis.com/?p=4983</guid>

					<description><![CDATA[Just some basic tests with the few cards I have… <a class="read-more-link" href="https://wadetregaskis.com/z9-burst-shooting-buffer-depth/" data-wpel-link="internal">Read more</a>]]></description>
										<content:encoded><![CDATA[
<p>Just some basic tests with the few cards I have.</p>



<figure class="wp-block-table aligncenter"><table class="has-fixed-layout"><thead><tr><th class="has-text-align-center" data-align="center"></th><th class="has-text-align-center" data-align="center">Lexar 2933x 128 GiB</th><th class="has-text-align-center" data-align="center">ProGrade Gold 256 GiB</th><th class="has-text-align-center" data-align="center">Pergear 512 GiB</th><th class="has-text-align-center" data-align="center">Angelbird AV PRO 1 TiB</th></tr></thead><tbody><tr><td class="has-text-align-center" data-align="center">Type</td><td class="has-text-align-center" data-align="center">XQD</td><td class="has-text-align-center" data-align="center">CFExpress</td><td class="has-text-align-center" data-align="center">CFExpress</td><td class="has-text-align-center" data-align="center">CFExpress</td></tr><tr><td class="has-text-align-center" data-align="center">20 FPS (lossless)</td><td class="has-text-align-center" data-align="center">26 (11 &#8211; 37)</td><td class="has-text-align-center" data-align="center">40 (34 &#8211; 43)</td><td class="has-text-align-center" data-align="center">36 (36 &#8211; 37)</td><td class="has-text-align-center" data-align="center">37 (37 &#8211; 37)</td></tr><tr><td class="has-text-align-center" data-align="center">20 FPS (HE*)</td><td class="has-text-align-center" data-align="center">60 (57 &#8211; 61)</td><td class="has-text-align-center" data-align="center">60 (49 &#8211; 77)</td><td class="has-text-align-center" data-align="center">60 (59 &#8211; 61)</td><td class="has-text-align-center" data-align="center">62 (60 &#8211; 64)</td></tr><tr><td class="has-text-align-center" data-align="center">20 FPS (HE)</td><td class="has-text-align-center" data-align="center">75 (34 &#8211; 95)</td><td class="has-text-align-center" data-align="center">85 (45 &#8211; 101)</td><td class="has-text-align-center" data-align="center">100 (98 &#8211; 103)</td><td class="has-text-align-center" data-align="center">104 (98 &#8211; 112)</td></tr><tr><td class="has-text-align-center" data-align="center">30 FPS</td><td class="has-text-align-center" data-align="center">196 (187 &#8211; 198)</td><td class="has-text-align-center" data-align="center">183 (52 &#8211; 198)</td><td class="has-text-align-center" data-align="center">192 (137 &#8211; 258)</td><td class="has-text-align-center" data-align="center">192 (142 &#8211; 198)</td></tr><tr><td class="has-text-align-center" data-align="center">120 FPS</td><td class="has-text-align-center" data-align="center">706 (667 &#8211; 736)</td><td class="has-text-align-center" data-align="center">706 (558 &#8211; 739)</td><td class="has-text-align-center" data-align="center">737 (734 &#8211; 739)</td><td class="has-text-align-center" data-align="center">736 (734 &#8211; 739)</td></tr><tr><td class="has-text-align-center" data-align="center">Cost per GiB (Feb 2022)</td><td class="has-text-align-center" data-align="center">$2.54</td><td class="has-text-align-center" data-align="center">$1.13</td><td class="has-text-align-center" data-align="center">$0.62</td><td class="has-text-align-center" data-align="center">$0.57</td></tr><tr><td class="has-text-align-center" data-align="center">Purchase options</td><td class="has-text-align-center" data-align="center"><a href="https://www.amazon.com/dp/B012PKYW1U?th=1&amp;linkCode=ll1&amp;tag=wasbl08-20&amp;linkId=d7fbe9d94901562132d5cfadc387ffb5&amp;language=en_US&amp;ref_=as_li_ss_tl" data-wpel-link="external" target="_blank" rel="external noopener">Amazon</a></td><td class="has-text-align-center" data-align="center"><a href="https://www.amazon.com/dp/B0863981FZ?th=1&amp;linkCode=ll1&amp;tag=wasbl08-20&amp;linkId=047b79d2496108279c6b9fc16e153b98&amp;language=en_US&amp;ref_=as_li_ss_tl" data-wpel-link="external" target="_blank" rel="external noopener">Amazon</a></td><td class="has-text-align-center" data-align="center"><a href="https://www.amazon.com/dp/B08TH5N442?&amp;linkCode=ll1&amp;tag=wasbl08-20&amp;linkId=6adf8f79df10bf6206333404fcae8fae&amp;language=en_US&amp;ref_=as_li_ss_tl" data-wpel-link="external" target="_blank" rel="external noopener">Amazon</a></td><td class="has-text-align-center" data-align="center"><a href="https://www.amazon.com/dp/B08KFDTQW5?th=1&amp;linkCode=ll1&amp;tag=wasbl08-20&amp;linkId=4122a93e0b27dcff93ca6138316c1abe&amp;language=en_US&amp;ref_=as_li_ss_tl" data-wpel-link="external" target="_blank" rel="external noopener">Amazon</a></td></tr></tbody></table></figure>



<p>Values shown are the average over all trials with worst &amp; best individual results shown in parenthesis.</p>



<h2 class="wp-block-heading" id="commentary">Commentary</h2>



<h3 class="wp-block-heading" id="surprisingly-little-performance-difference">Surprisingly little performance difference</h3>



<p>None of the cards tested are among the &#8220;known fastest&#8221; CFExpress cards, like the Delkin Blacks or ProGrade Cobalts.  Nonetheless, I&#8217;m surprised at how minor the performance difference is between all of them, <em>especially</em> given there&#8217;s an XQD card in the mix.</p>



<p>CFExpress cards are not necessarily fast.</p>



<h3 class="wp-block-heading" id="angelbird-av-pros-do-not-meet-their-promised-performance">Angelbird AV PROs do not meet their promised performance</h3>



<p>The Angelbird card claims a 1,000 MB/s <em>minimum</em>, <em>sustained</em> write speed.  The XQD format is incapable of speeds above 500 MB/s.  Yet the Angelbird is <em>at best</em> just 40% faster than the XQD Lexar.  This suggests either the camera is the limiting factor &#8211; unlikely given that others have demonstrated <em>much</em> deeper bursts with other, apparently faster cards &#8211; or that the Angelbird doesn&#8217;t live up to its claims.</p>



<p>Blackmagic Disk Speed Test with a Pergear USB-C reader indicates the Angelbird <em>almost</em> hits 1,000 MB/s at the start of a sequential read or write, but within a second or two falls down to a sustained speed of only about 700 MB/s.  And there&#8217;s that 40% difference again, vs XQD.</p>



<h3 class="wp-block-heading" id="average-performance-correlates-with-consistent-performance">Average performance correlates with consistent performance</h3>



<p>e.g. the Pergear 512 GiB is nominally about the same performance <em>on average</em> as the ProGrade 256 GiB, but the Pergear was much more consistent.  The Angelbird was a tad faster &amp; more consistent again.</p>



<p>This also highlights why many trials are important, in order to determine the variance.  I&#8217;d rather have an on-average slower card that&#8217;s very consistent than a &#8220;bursty&#8221; card that might crap out in a critical moment and cause me to miss the moment completely.</p>



<h3 class="wp-block-heading" id="30-120-fps-modes-are-camera-limited">30 &amp; 120 FPS modes are camera limited</h3>



<p>There was practically no difference in performance between the cards in 30 FPS &amp; 120 FPS modes.</p>



<p>The bandwidth demonstrated is well below the demonstrated capabilities of all these cards, at just a few hundred MB/s.</p>



<p>All this seems quite conclusive that in these extra-fast burst modes the Z9 is the bottleneck, not the memory card.</p>



<p>Sidenote: The ProGrade card showed occasional glitches (three in total across twenty trials) &#8211; where the Z9 would suddenly stop shooting mid-burst, where a split second prior it had still shown a significant amount left in the &#8220;buffer&#8221; (the rXXX counter).  I&#8217;m not sure what to make of that &#8211; perhaps the Z9 relies on some basic level of performance and the ProGrade can&#8217;t consistently meet it, or perhaps something is glitching between the Z9 &amp; the ProGrade card that causes the Z9 to error out and stop working.</p>



<h2 class="wp-block-heading" id="methodology">Methodology</h2>



<p>1/250, 24-70/4 @ f4, ISO 5000.</p>



<p>Z9 firmware 1.11.</p>



<p>I enabled the shutter sound at maximum volume, and held down the shutter until I heard a stutter.</p>



<p>For 20 FPS mode:</p>



<ul class="wp-block-list">
<li>I counted any extra frames after the stutter and subtracted those from the numbers.</li>



<li>I also tested 1/2500 and saw no meaningful difference in results, and ISO 64 &amp; 25,600 which improved and decreased (respectively) buffer depth by about 10% each (very likely corresponding to the file size differences, though I didn&#8217;t check).</li>



<li>Five trials, each testing each format in turn: lossless, HE*, HE.</li>
</ul>



<p>For 30 &amp; 120 FPS modes:</p>



<ul class="wp-block-list">
<li>I never heard an extra frame after the first stutter &#8211; I don&#8217;t know if that means the camera ground to a complete halt or merely that it doesn&#8217;t reliably play the fake shutter sound in these modes. The consistency of the results in those modes leads me to believe it&#8217;s the former.</li>



<li>Ten trials, sequentially.</li>
</ul>



<p>Cards were formatted in camera and empty at the start of each class of testing (20, 30, 120).  Images were <em>not</em> erased between trials (empty cards are not representative of real-world conditions).</p>



<p>Autofocus was not engaged during shooting.  I haven&#8217;t tested it comprehensively, but so far I&#8217;ve seen no impact on burst performance from using autofocus (including subject recognition).</p>
]]></content:encoded>
					
					<wfw:commentRss>https://wadetregaskis.com/z9-burst-shooting-buffer-depth/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<media:content url="https://wadetregaskis.com/wp-content/uploads/2022/02/Memory-cards-2048x617.jpg" medium="image" />
<post-id xmlns="com-wordpress:feed-additions:1">4983</post-id>	</item>
	</channel>
</rss>
