<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
>

<channel>
	<title>Swift Sequence &#8211; Wade Tregaskis</title>
	<atom:link href="https://wadetregaskis.com/tags/swift-sequence/feed/" rel="self" type="application/rss+xml" />
	<link>https://wadetregaskis.com</link>
	<description></description>
	<lastBuildDate>Mon, 01 Jan 2024 22:43:34 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://wadetregaskis.com/wp-content/uploads/2016/03/Stitch-512x512-1-256x256.png</url>
	<title>Swift Sequence &#8211; Wade Tregaskis</title>
	<link>https://wadetregaskis.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">226351702</site>	<item>
		<title>Collection enumeration performance in Swift</title>
		<link>https://wadetregaskis.com/collection-enumeration-performance-in-swift/</link>
					<comments>https://wadetregaskis.com/collection-enumeration-performance-in-swift/#comments</comments>
		
		<dc:creator><![CDATA[]]></dc:creator>
		<pubDate>Wed, 08 Nov 2023 07:10:08 +0000</pubDate>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Benchmarked]]></category>
		<category><![CDATA[Compiler optimisations]]></category>
		<category><![CDATA[Functional programming style]]></category>
		<category><![CDATA[Imperative programming style]]></category>
		<category><![CDATA[Programming style]]></category>
		<category><![CDATA[Swift]]></category>
		<category><![CDATA[Swift Collection]]></category>
		<category><![CDATA[Swift Sequence]]></category>
		<guid isPermaLink="false">https://blog.wadetregaskis.com/?p=5283</guid>

					<description><![CDATA[Swift&#8217;s Collection and Sequence protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively. For example: Or: Nominally these are equivalent &#8211; they&#8217;ll produce the same results for all correctly-implemented Collections and Sequences. So in principle which you use is purely a matter of stylistic preference. But is it? Do they&#8230; <a class="read-more-link" href="https://wadetregaskis.com/collection-enumeration-performance-in-swift/" data-wpel-link="internal">Read more</a>]]></description>
										<content:encoded><![CDATA[
<p>Swift&#8217;s <code>Collection</code> and <code>Sequence</code> protocols provide two primary ways to enumerate (filter, map, reduce, etc): functional-style and imperatively.  For example:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">let</span><span style="color: #000000"> result = data</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000"> * </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=)</span></span></code></pre></div>



<p>Or:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">var</span><span style="color: #000000"> result = </span><span style="color: #098658">0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #AF00DB">for</span><span style="color: #000000"> value </span><span style="color: #AF00DB">in</span><span style="color: #000000"> data {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #098658">0</span><span style="color: #000000"> != value {</span></span>
<span class="line"><span style="color: #000000">        result &amp;+= value * value</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<p>Nominally these are equivalent &#8211; they&#8217;ll produce the same results for all correctly-implemented <code>Collection</code>s and <code>Sequence</code>s.  So in principle which you use is purely a matter of stylistic preference.</p>



<p>But is it?</p>



<p>Do they actually <em>perform</em> equivalently?</p>



<p>Let&#8217;s examine an example that&#8217;s a <em>little</em> more involved than the above snippets, but still fundamentally pretty straightforward.  The extra processing steps are to help distinguish any performance differences.</p>



<p>The pertinent parts are:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #000000">testData.</span><span style="color: #001080">next</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">byteSwapped</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { (</span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; </span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">leadingZeroBitCount</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=))</span></span></code></pre></div>



<p>And the imperative equivalent:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #0000FF">var</span><span style="color: #000000"> result = </span><span style="color: #098658">0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #AF00DB">for</span><span style="color: #000000"> value </span><span style="color: #AF00DB">in</span><span style="color: #000000"> testData.</span><span style="color: #001080">next</span><span style="color: #000000"> {</span></span>
<span class="line"><span style="color: #000000">    </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #098658">0</span><span style="color: #000000"> != value {</span></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #0000FF">let</span><span style="color: #000000"> value = value.</span><span style="color: #001080">byteSwapped</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">        </span><span style="color: #AF00DB">if</span><span style="color: #000000"> (value &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; value &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> {</span></span>
<span class="line"><span style="color: #000000">            </span><span style="color: #0000FF">let</span><span style="color: #000000"> value = value.</span><span style="color: #001080">leadingZeroBitCount</span></span>
<span class="line"></span>
<span class="line"><span style="color: #000000">            </span><span style="color: #AF00DB">if</span><span style="color: #000000"> </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= value {</span></span>
<span class="line"><span style="color: #000000">                result &amp;+= value</span></span>
<span class="line"><span style="color: #000000">            }</span></span>
<span class="line"><span style="color: #000000">        }</span></span>
<span class="line"><span style="color: #000000">    }</span></span>
<span class="line"><span style="color: #000000">}</span></span></code></pre></div>



<p>I&#8217;ve published <a href="https://github.com/wadetregaskis/Swift-Benchmarks/blob/6566bd0c785053a3b6c9d6b7c43604ff2a636b35/Benchmarks/ArrayProcessing/ArrayProcessing.swift" data-type="link" data-id="https://github.com/wadetregaskis/Swift-Benchmarks/blob/6566bd0c785053a3b6c9d6b7c43604ff2a636b35/Benchmarks/ArrayProcessing/ArrayProcessing.swift" data-wpel-link="external" target="_blank" rel="external noopener">the full source code</a>, in case you&#8217;d like to review it further or run it yourself.</p>



<h2 class="wp-block-heading">How does the performance compare?</h2>



<p>On my iMac Pro (10 cores (Xeon W-2150B)):</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">234 ns</td><td class="has-text-align-right" data-align="right">133 ns</td><td class="has-text-align-right" data-align="right">1.67x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">57 µs</td><td class="has-text-align-right" data-align="right">16 µs</td><td class="has-text-align-right" data-align="right">3.56x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">1.7 ms</td><td class="has-text-align-right" data-align="right">0.5 ms</td><td class="has-text-align-right" data-align="right">3.36x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">27 ms</td><td class="has-text-align-right" data-align="right">4.2 ms</td><td class="has-text-align-right" data-align="right">6.36x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">147 ms</td><td class="has-text-align-right" data-align="right">17 ms</td><td class="has-text-align-right" data-align="right">8.65x</td></tr></tbody></table></figure>



<p>On my M2 MacBook Air:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">167 ns</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">2.01x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">37 µs</td><td class="has-text-align-right" data-align="right">3.6 µs</td><td class="has-text-align-right" data-align="right">10.20x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">1,058µs</td><td class="has-text-align-right" data-align="right">112 µs</td><td class="has-text-align-right" data-align="right">9.45x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">12 ms</td><td class="has-text-align-right" data-align="right">0.9 ms</td><td class="has-text-align-right" data-align="right">13.23x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">50 ms</td><td class="has-text-align-right" data-align="right">3.7 ms</td><td class="has-text-align-right" data-align="right">13.76x</td></tr></tbody></table></figure>



<p>The imperative version is <em>many</em> times faster!  And the performance difference increases as the collection size increases.  The functional version starts off not <em>super</em> terrible &#8211; at least on the same order magnitude as the imperative version &#8211; but it tends rapidly towards being an order of magnitude slower.</p>



<p>Worse, the difference is much more pronounced on more modern CPUs, like Apple Silicon.</p>



<h3 class="wp-block-heading">What&#8217;s going on?</h3>



<p>There&#8217;s a few compounding factors.</p>



<p>Smaller datasets are more likely to fit into CPU caches (the dataset sizes shown above were chosen to correspond to L1 / L1 / L2 / L3 / RAM, respectively, on my iMac Pro).  Working on data in CPU caches is by nature faster &#8211; the lower-level the cache the better &#8211; and so helps hides inefficiencies.</p>



<p>The functional version creates intermediary <code>Array</code>s to store the intermediary results of every <code>filter</code> and <code>map</code> operation.  This introduces malloc traffic, retains &amp; releases, and <em>writes to memory</em>.  The imperative version has none of that overhead &#8211; it simply reads every value in the collection once, performing the whole sequence of operations all in one go for each element, using only CPU registers (not so much as a function call, even!).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img fetchpriority="high" decoding="async" width="1358" height="352" src="https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations.webp" alt="Screenshot from Instruments showing the Allocations (memory usage) during the benchmarks' execution" class="wp-image-5365" style="object-fit:cover;width:679px;height:176px" srcset="https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations.webp 1358w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-512x133@2x.webp 1024w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-256x66.webp 256w, https://wadetregaskis.com/wp-content/uploads/2023/11/Memory-allocations-512x133.webp 512w" sizes="(max-width: 1358px) 100vw, 1358px" /><figcaption class="wp-element-caption">Can you tell which approach results in <em>way</em> more memory use (and is a lot slower)?</figcaption></figure>
</div>


<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Sidenote: Dataset load costs</summary>
<p>In real-world cases there&#8217;s sometimes a substantial baseline cost of reading the dataset in from memory (or disk, or the network), unless your dataset is small enough to fit in caches <em>and</em> was very recently populated there.  But even creaky old Intel machines like my iMac Pro have pretty decent memory bandwidth, such that it&#8217;s rarely the performance bottleneck unless your algorithm is quite trivial <em>and</em> well-optimised (e.g. to utilise SIMD instructions).</p>



<p>In the benchmarks I somewhat emptied the caches before each run, by alternating between two similar datasets, but this really just means that during each run it has to load in the initial dataset from one further level out (e.g. L2 instead of L1).  So these benchmarks aren&#8217;t really demonstrating the potential full cost of loading the dataset from RAM &#8211; let-alone from disk or the network.</p>
</details>
</div></div>



<p>Furthermore, because the functional version is allocating those additional arrays, which take up more space, it tends to overflow caches sooner.  For example, in the 1 MiB case, instead of being able to operate entirely out of L2 on the iMac Pro, it has to fall back (at least partially) to L3.  L3 is significantly slower &#8211; higher latency &#8211; than L2, so that makes everything it does noticeably slower.  It&#8217;s worse when it no longer fits in any CPU caches and has to start going back and forth to RAM, as there&#8217;s a <em>big</em> jump in latency between CPU caches and RAM.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Sidenote: Time per element</summary>
<p>I didn&#8217;t include it in the results table because it&#8217;s tangential, but as a quick note:  the time per element varies depending on which level of cache the execution fits within.</p>



<p>It&#8217;s a pretty consistent 4ns (iMac Pro) / 0.9ns (M2) in the imperative case (memory prefetching is able to keep up with the trivial linear read pattern).</p>



<p>But for the functional version it ranges from 13ns (iMac Pro) / 9ns (M2) for L1, up to 35ns (iMac Pro) / 12ns (M2) for RAM.  The memory prefetcher still does an admirable job keeping the performance relatively consistent, but it can&#8217;t completely cover up the inefficiencies.</p>



<p>The M2 scales better &#8211; suffers less of a performance impact as datasets get larger &#8211; because it has both greater memory bandwidth and <em>much</em> lower latency (especially when we get to RAM, as its RAM is in the CPU package rather than miles away across the motherboard on separate DIMMs).</p>
</details>
</div></div>



<p>Things get <em>far worse</em> if you exceed available RAM, too.  I haven&#8217;t shown that in these results &#8211; mainly because it&#8217;s painfully time-consuming to run such benchmarks on my iMac Pro with 64 GiB of RAM &#8211; but suffice to say that once you start swapping, the performance goes <em>completely</em> down the toilet.</p>



<h2 class="wp-block-heading">So I should avoid the filter &amp; map methods?</h2>



<p>For trivially small datasets, the difference might be negligible.  Especially if you&#8217;re only using the input data once (if you&#8217;re reusing it many times over, you might want to look at caching the results anyway, or other such optimisations).</p>



<p>For non-trivial datasets, it is often wise to avoid using <code>filter</code> and <code>map</code>, at least &#8220;eagerly&#8221;.  What does that mean?  Well, there are actually <em>two</em> functional styles supported by <code>Collection</code> and <code>Sequence</code>…</p>



<h3 class="wp-block-heading">Enter lazy…</h3>



<p>By default <code>filter</code>, <code>map</code>, and other such operations are &#8220;eager&#8221; &#8211; as soon as they&#8217;re executed they enumerate their <em>entire</em> input, generate their <em>entire</em> output, and only then does execution move on to the <em>next</em> operation in the pipeline.</p>



<p>But there is an alternative &#8211; <em>lazy</em> versions of all of these.  You access them via the <code>lazy</code> property of <code>Collection</code> / <code>Sequence</code>, e.g.:</p>



<div class="wp-block-kevinbatdorf-code-block-pro padding-disabled" data-code-block-pro-font-family="" style="font-size:.875rem;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><pre class="shiki light-plus" style="background-color: #FFFFFF" tabindex="0"><code><span class="line"><span style="color: #000000">testData.</span><span style="color: #001080">next</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #001080">lazy</span><span style="color: #000000"> </span><span style="color: #008000">// New!</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #098658">0</span><span style="color: #000000"> != </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">byteSwapped</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { (</span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff00</span><span style="color: #000000">) &gt;&gt; </span><span style="color: #098658">8</span><span style="color: #000000"> &lt; </span><span style="color: #0000FF">$0</span><span style="color: #000000"> &amp; </span><span style="color: #098658">0xff</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">map</span><span style="color: #000000"> { </span><span style="color: #0000FF">$0</span><span style="color: #000000">.</span><span style="color: #001080">leadingZeroBitCount</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">filter</span><span style="color: #000000"> { </span><span style="color: #267F99">Int</span><span style="color: #000000">.</span><span style="color: #001080">bitWidth</span><span style="color: #000000"> - </span><span style="color: #098658">8</span><span style="color: #000000"> &gt;= </span><span style="color: #0000FF">$0</span><span style="color: #000000"> }</span></span>
<span class="line"><span style="color: #000000">    .</span><span style="color: #795E26">reduce</span><span style="color: #000000">(</span><span style="color: #795E26">into</span><span style="color: #000000">: </span><span style="color: #098658">0</span><span style="color: #000000">, &amp;+=))</span></span></code></pre></div>



<p>The <code>lazy</code> property returns a special &#8220;lazy&#8221; view of the underlying <code>Collection</code> or <code>Sequence</code>.  That view looks a lot like the original object &#8211; it has the same <code>map</code>, <code>filter</code>, etc methods &#8211; but its version of those methods return <em>further</em> lazy views, rather than the actual results of the operation.  It doesn&#8217;t actually perform the operation until it&#8217;s strictly necessary. And when it is necessary &#8211; such as when some code like <code>reduce</code> enumerates the results to produce a concrete value &#8211; it calculates the results on the fly, with no intermediary storage.</p>



<p>So, that all sounds good &#8211; should be faster, right?  Let&#8217;s see.</p>



<p>On my iMac Pro:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Lazy functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">155 ns</td><td class="has-text-align-right" data-align="right">133 ns</td><td class="has-text-align-right" data-align="right">1.17x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">15 µs</td><td class="has-text-align-right" data-align="right">16 µs</td><td class="has-text-align-right" data-align="right">0.94x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">491 µs</td><td class="has-text-align-right" data-align="right">511 µs</td><td class="has-text-align-right" data-align="right">0.96x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">4.1 ms</td><td class="has-text-align-right" data-align="right">4.2 ms</td><td class="has-text-align-right" data-align="right">0.98x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">16 ms</td><td class="has-text-align-right" data-align="right">17 ms</td><td class="has-text-align-right" data-align="right">0.94x</td></tr></tbody></table></figure>



<p>A dramatic difference versus the eager style.  The lazy functional style is still slightly slower than the imperative style for <em>very</em> small collections, such as the empty one here, but it&#8217;s actually <em>slightly faster</em> for most!</p>



<p>The Swift compiler is doing a pretty amazing job in this case.  Nominally it still needs to do a bunch of overhead &#8211; each of those lazy <code>filter</code> and <code>map</code> methods returns a lazy view object, and those objects form a logical chain, and have to call various methods on each other in order to pass data through the pipeline.  Indeed, in debug builds that&#8217;s exactly what you see in the compiled binary, and the performance is much worse.  But with the optimiser engaged, the compiler sees through all that boilerplate and eliminates it, reducing the whole thing down to a very efficient form.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Wait, faster…?</summary>
<p>That it&#8217;s actually a tad faster is odd as in principle they should be almost identical &#8211; any optimisation the compiler can apply to the functional version should also be applicable to the imperative one, since the imperative one is basically an easier subcase where we&#8217;ve manually done the hard optimisations already.</p>



<p>Indeed looking at the machine code that the Swift compiler emits, they are <em>very</em> similar.  It&#8217;s unclear to me why there&#8217;s a reliable, measurable performance difference between them &#8211; perhaps an accidental consequence of slightly different instruction orderings and register selection.</p>
</details>
</div></div>



<p>However, on my M2 MacBook Air:</p>



<figure class="wp-block-table aligncenter"><table><thead><tr><th class="has-text-align-right" data-align="right">Dataset size</th><th class="has-text-align-right" data-align="right">Lazy functional (median)</th><th class="has-text-align-right" data-align="right">Imperative (median)</th><th class="has-text-align-right" data-align="right">Performance difference</th></tr></thead><tbody><tr><td class="has-text-align-right" data-align="right">0</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">83 ns</td><td class="has-text-align-right" data-align="right">1.00x</td></tr><tr><td class="has-text-align-right" data-align="right">32 KiB</td><td class="has-text-align-right" data-align="right">12 µs</td><td class="has-text-align-right" data-align="right">3.6 µs</td><td class="has-text-align-right" data-align="right">3.31x</td></tr><tr><td class="has-text-align-right" data-align="right">1 MiB</td><td class="has-text-align-right" data-align="right">442 µs</td><td class="has-text-align-right" data-align="right">112 µs</td><td class="has-text-align-right" data-align="right">3.95x</td></tr><tr><td class="has-text-align-right" data-align="right">8 MiB</td><td class="has-text-align-right" data-align="right">3.6 ms</td><td class="has-text-align-right" data-align="right">0.9 ms</td><td class="has-text-align-right" data-align="right">3.94x</td></tr><tr><td class="has-text-align-right" data-align="right">32 MiB</td><td class="has-text-align-right" data-align="right">14 ms</td><td class="has-text-align-right" data-align="right">3.6 ms</td><td class="has-text-align-right" data-align="right">3.85x</td></tr></tbody></table></figure>



<p>Oh no &#8211; while the lazy functional version is several times faster than the eager functional version, it&#8217;s still <em>many</em> times slower than the imperative version (for non-empty collections).</p>



<p>It&#8217;s not entirely clear to me why this is the case; why the behaviour is so different to x86-64.  Looking at the machine code, the compiler&#8217;s optimiser has still successfully removed all the boilerplate and simplified it down to a tight loop of trivial integer operations.  It appears the difference might arise from the use of conditional instructions (for the functional version) versus branching (for the imperative version).  It&#8217;s not clear why the compiler uses different approaches for what are otherwise very similar blocks of code.  As such, the behaviour might change between compiler versions (this exploration used Swift 5.9) or cases (variations in code structure &#8211; or details &#8211; might cause the compiler to make different instruction selections).</p>



<p>Alas, explicable or not, the conclusion is clear:</p>



<figure class="wp-block-pullquote"><blockquote><p>Lazy functional-style performs better than eager functional-style, but still much worse than imperative style.</p></blockquote></figure>



<p>So my advice is to generally avoid the functional style.  Not religiously, but with moderate determination.</p>



<p>The only clear exception, where it&#8217;s okay to use the functional style, is if you&#8217;re inherently doing single operations at a time, like a simple <code>map</code> where you actually need to store the resulting <code>Array</code>.  You&#8217;ll get no benefit from using <code>lazy</code> in such cases, nor will the imperative version be meaningfully faster (usually).</p>



<h3 class="wp-block-heading">Preview: Lazy considered harmful</h3>



<p>Unfortunately, in addition to still being slower than the imperative style on modern CPUs, there&#8217;s several <em>further</em> aspects of lazy <code>Collection</code>s and <code>Sequence</code>s that are problematic.  I plan to dive deeper into this in a follow-up post, but here&#8217;s a teaser:</p>



<ul class="wp-block-list">
<li>In debug builds the performance is poor, because the optimiser essentially isn&#8217;t used.  Thus debugging (e.g. the regular <code>Run</code> action in Xcode) and unit testing may be slowed down significantly.</li>



<li>The optimisations only work if the compiler can see your whole pipeline.  If you start splitting things up in your code, or making things dynamic at runtime, the compiler might become unable to make the necessary optimisations.  In general if you start storing lazy <code>Collection</code>s / <code>Sequence</code>s anywhere, or returning them from properties or methods, you&#8217;re likely to miss out on the optimisations.</li>



<li>There are some serious pitfalls and sharp edges around lazy <code>Collection</code>s and <code>Sequence</code>s which can lead to them being not just <em>slower</em> than their eager brethren, but potentially dangerous (in the sense of not producing the expected results)!  Stay tuned for details.</li>
</ul>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://wadetregaskis.com/collection-enumeration-performance-in-swift/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5283</post-id>	</item>
	</channel>
</rss>
