How running two calculations in a single thread can speed up the Fractal by up to 90% (but more likely, 15%…).

My friend Kris suggested using SIMD for the fractal calculations. He has previously programmed with it, using it (in integer mode) for MD5 calculation.

By way of background, SIMD (Single Instruction Multiple Data) instructions are a feature provided by some processors (including modern x86 ones). They work on vectors of data, performing the same operation in parallel on every component of the vector. For example, in my case the vectors are of size 128 bits, which is large enough to hold two double precision floating point numbers. If I need to perform E = A*B and F = C*D, I can load (A, C) into one vector, and (B, D) into another, and then perform a single multiplication instruction, yielding the vector (E, F).

There is some overhead of packing and unpacking the values from the vectors, but if most of the computation is in the form of simple arithmetic operations, SIMD on vectors of size N can improve speed by a factor approaching N. It’s most easy to apply if you have a number of reasonably long parallel computations on independent elements: you just pack N of them at a time into vectors, perform the computation, and then unpack the results.

## SIMD’ing the Mandelbrot

But how can we apply it to the Mandelbrot fractal? We have a large number of pixels to compute; the problem is some pixels take longer than others, and we don’t know which ones that will be until we try them. The core function is:

int mfunc(int max_iterations, double cx, double cy) { /* Inputs are inputs max_iterations, and the pixel coordinates (cx, cy). */ double zr = 0, zi = 0; double zr2 = 0, zi2 = 0; int i = 0; /* Loop up to max_iterations times, or until boundary is reached. */ while (i < max_iterations && zr2 + zi2 < 2.0*2.0) { double t; zr2 = zr*zr; zi2 = zi*zi; t = zr*zi; zr = zr2 - zi2 + cx; zi = 2*t + cy; i++; } /* Output is the value for this pixel. */ return i; }

We could try to adapt it to do two pixels in parallel. But one will likely finish sooner than the other, so the function would need to check for this, save that pixel’s output, then continue the computation until the other pixel has finished. This sounds like a mess, and it feels wasteful.

So initially, I thought about grouping the individual steps into pairs that can be done in parallel. Something like:

zr2 = zr*zr; zi2 = zi*zi; /* Two independent multiplications -- easy. */ t = zr*zi; zi2neg = zi2 * -1; /* One left over; also negate zi2 while we have the chance. */ zr = zr2 + zi2neg; zi = t + t; /* So that we can turn a subtraction into an addition. Turn multiplication by 2 into addition, too. */ zr = zr + cx; zi = zi + cy; /* Finally add constants. */

We’ve turned 7 computations (in the original 5 lines) into 4. But we still need to perform the comparison with the boundary value.

I never implemented the above, because Kris said, “how about something like this:”

void mfunc(int max_iterations, FUNC next_pixel, FUNC output_pixel) { while (1) { if (component 0 is done) { output_pixel(component 0); component 0 = next_pixel(); } if (component 1 is done) { output_pixel(component 1); component 1 = next_pixel(); } do computations using SIMD } }

In other words, *do* run two pixel computations in parallel, but when one finishes, output it and start a new calculation in its place. (I’m paraphrasing Kris’s pseudocode a bit here. He also predicted that it would end up getting and outputting the pixels itself rather than using callbacks. After all, the point of using SIMD is low-level optimisation, and function call overhead could be significant. And callbacks can’t be inlined.)

## Implementation and results

The above scheme has been implemented in mfunc.c. It’s currently supported only by the SIMPLE drawing mode, but it should be useable by the others. As an aside, my favourite drawing mode, TRACE, is almost 9 times as slow as SIMPLE; Amdahl’s law therefore implies a theoretical 5% speedup from using SIMD on the pixel calculations. TRACE spends most of its time maintaining its priority queue. There’s room for optimisation there: every movement of an element up or down the heap is done by a call to memmove, even though elements are relatively small — 12 bytes. And that size could be further reduced. (An element contains a priority, and the pixel coordinates. If we really wanted to compress them, we could use 12 bits each for x and y, leaving an ample 8 bits for priority. Packing them into a single word with the priority at the high end means we could manipulate them by ordinary value assignment rather than block memory moves.) The advantages of TRACE mode are that a) it fills in the interesting areas first, b) it can flood fill the empty bits, and c) it’s way more interesting to watch.

After implementing SIMD, casual measurements of the PPM in the graphical display showed a speedup of about 15%. I then implemented a proper benchmark mode. With a command line like `fractal --benchmark --mode SIMPLE_SIMD`

, the program will run 5 iterations of a standard benchmark (2000×2000, depth 256), without doing anything graphical.

The results were:

Mode | Pixels per second |
---|---|

SIMPLE | 870938 |

SIMPLE_LOOP | 773687 |

SIMPLE_SIMD | 1021408 |

PARALLEL | 1849903 |

TRACE | 86168 |

So: implementing SIMD has rewarded us with a 32% speed boost over ordinary double arithmetic. Somehow I was expecting more! I guess a lot of it is eaten by the parts of the code that aren’t yet parallelised. The other rest of the expected gain may have been consumed by different costs of the SIMD version compared to the scalar version (I have not researched the official costs of these instructions, rather relying on overall benchmarking like this). We could use 4 single precision components per vector, but I prefer to have the extra precision for accuracy in highly-zoomed-in fractals. (Note: it’s definitely worth investigating this, and determining the zoom level at which double precision really matters).

## Optimising it

I’ve done a bit of work since, trying to optimise that inner loop.

I’ve made a smaller benchmark, consisting only of the `mfunc`

functions with some stub `next_pixel`

and `output_pixel`

functions. Both MSVC and GCC have options to compile to assembly code, so I’ve been poring over that. It can be pretty hard to follow — GCC in particular will move code all over the place. MSVC has an option to annotate the assembly code with the original C source code, which helps (some code motion still occurs, however).

The mfunc_simd function has a particular shape:

int mfunc_loop { while (1) { if (check 1) /* Only succeeds about once in a thousand times. */ .... if (check 2) /* Only succeeds about once in a thousand times. */ .... mandelbrot computations /* Occurs every iteration; SIMD goes here. */ } }

We know from general knowledge about the Mandelbrot set, and from the typical usage of the program, how often various parts of the code are likely to be executed. (Profiling is always a good way to test these predictions; it’s less useful when we get down to this low a level of optimisation.) Unfortunately the compiler doesn’t have this level of knowledge. It can figure some things by analysis (e.g. two instructions are in the same loop, so are both called the same number of times); and it may have heuristics (e.g. an instruction is inside a conditional statement, so may be called less often than one outside). Compilers can often recognise certain loop forms (like iteration of a constant integer range). Some compilers can even do clever stuff like simulating every possible execution and proving useful things about program behaviour.

But they can’t prove arbitrary mathematical things about non-terminating complex transformations; and they don’t have access to outside information (not even Wikipedia!); and you can’t expect them to read your mind to know how you intend to use a program. My concern was that the compiler would optimise the code inside the checks at the expense of the main computation. In particular, I don’t want it to spend a register on a variable that’s only used in there. It’s more important (about 1000 times as important) that the variables used in the main computation are all in registers.

Unfortunately the compiler wasn’t using registers as much as it could in that code. A reasonable (but simple) heuristic for code execution frequency is loop nesting depth. But nesting within a conditional makes code less likely to execute. A possible fix in this case is to put the main computations in their own special loop, which terminates when any of those check conditions are true. (When that happens, the checks and their accompanying code can be run, and if they don’t cause the function to exit, the outer loop will repeat.)

Another obvious (but probably insignificant) optimisation is the counter variables `i0`

and `i1`

. For most iterations, they will both be incremented, and neither will have reached the limit. We can replace them with a single counter in the inner loop, and do the bookkeeping and checking on the distinct counters outside it. Interestingly, this move the counter to a register for the entire inner loop. (Integer registers are not in high demand in this function, so it’s strange that the compiler was not already using them effectively.)

The next change was at the point of testing the boundary conditions. Extracting data from SIMD vectors can be done in two ways. We can keep the vector in a C union with scalar (e.g. double, or long) variables. Conversion is done by forcing the data into memory as a vector, then reading it back out as a scalar. Or, we can use SIMD functions to extract components of it. Being new to SIMD programming I had defaulted to the first method (the Streaming SIMD documentation can be daunting; there are lots of useful functions in there but it takes insight and research to find them). But it turns out there was a function for extracting the sign bits of a vector’s components into a simple integer. This was exactly what I wanted to check whether either of those components had exceeded the boundary.

One last optimisation mystifies me a bit. We exit the inner function as soon as the counter variable expires, thus avoiding doing the boundary calculations. But then we do the boundary calculations immediately outside that loop because we need the data. Just as much work gets done either way, but the second way is reliably slightly faster!

An “optimisation” I experienced with was replacing the short-circuiting boolean operation `||`

with a bitwise one `|`

. The first operation would generate an branch in the middle of the loop, which I worried could hinder the CPU’s branch prediction. (I’ve read stories of highly optimised graphics code simulating conditional code using bitwise arithmetic instead of branches.) But that change made the function slower, so I’ve removed it.

I have not included the changes in the main program; but post-optimisation, the smaller SIMD test yields these figures:

Mode | MSVC | GCC (Cygwin) |
---|---|---|

LOOP | 19870 | |

SIMD (original) | 35359 | 37493 |

SIMD (optimised) | 36385 | 37514 |

(For some reason the LOOP implementation doesn’t perform the same computation in GCC. It could be a compiler bug, but is far more likely to be due to relying on undefined behaviour.)

In MSVC, the SIMD speedup (over LOOP) climbed from 78% to 83%. (Note that this test ran at a much higher depth than the benchmark; more time is spend in the main calculation, so SIMD speedups are more noticeable.)

In GCC, there was little change. Note that GCC already does a slightly better optimisation job on the SIMD, though.

Was it worth it? Contrary to my earlier estimates of a measly 15% speedup, I’ve shown that SIMD provides almost an 80% speedup. And we can improve it slightly with a few code changes. More usefully, I’ve explored different SIMD usage decisions (e.g. in-memory unions vs extraction functions), and program structure.

I haven’t applied these optimisations to the main program yet. When I will, I’ll expand the benchmark functionality and do some more testing to determine what the speedup is in various common scenarios (zoomed out, zoomed in on detail, zoomed in on edge).

This was my first foray into SIMD programming! It can be tricky, but it’s not impossible. And there are plenty of other, non-fractal, applications for it.

Once again, the fractal work is hosted at my project on Google Code, and there is a binary package for Windows. The latest binary includes the benchmark, and the next release will include the SIMD test code.

Pingback: Low-level priority queue optimisations | EJRH

Pingback: Project summary | EJRH

Thx for intresting informations. See also :

http://shootout.alioth.debian.org/u32/performance.php?test=mandelbrot

Hi Adam, thanks for the link! The Fortran one thrashes all the others in performance, so I had a look at its code and noticed two clever things:

I think it’s a bit easier for the shootout programs, since they compute simple in/outness rather than a continuous colour pattern. But some of their optimisations might be worth borrowing anyway.

I have never programmed Fortran in but I can appreciate how easy it is to do arithmetic on vectors.

Hi Edmund.

Thx for an answer. When speed is important array languages seems to be the best :

http://blogs.mathworks.com/loren/2011/07/18/a-mandelbrot-set-on-the-gpu/

I think about OpenCl. What is your opinion about it ?

Please look also at :

http://en.wikibooks.org/wiki/Fractals

Very interesting, I just played around implementing this with SIMD in C#, and comparing the results to an optimized way of computing it given by the intel guys (http://evolvedmicrobe.com/blogs/). You may be interested to see how they use the horizontal add/subtract version to speed things up. The intel assembly version also seems to make better use of the registers.

My program right now is slow (i am guessing) on the condition checks, and because it is doing 2 at once in the inner loop and not incrementing like it should so it is always running. If I ever have time will be curious to see how much those changes improve it.