Massively parallel fractals

I’ve implemented a mode in the fractal program that lets it run on a GPU, using OpenCL (fixes issue 19!). It’s not yet very fast, but IMHO is a good start. Writing my first GPU program was actually quite a challenge.

While discussing earlier fractal work such as that on using SIMD with the TRACE mode, I speculated that one day there would be a mode that computes hundreds of pixels in parallel on a “many-core” GPU. Surprisingly, that day has finally come! To be honest, I wasn’t seriously expecting to reach it when I talked about it as future work.

Porting libstdcl

I started a couple of months ago by downloading the AMD APP SDK (their version of OpenCL), and something called libstdcl. The latter is a library that wraps the OpenCL API in something slightly friendlier, for C. Now, libstdcl seems to be designed for 64-bit, when compiled on Windows at any rate. But fractal is a 32-bit program. And, as I discovered, the free version of MS Visual C doesn’t support 64-bit.

So my first task was porting libstdcl to 32-bit Windows. To forestall any awe-filled reactions to that, I’ll point out that this entailed nothing more than:

Making it detect Windows in 32-bit mode, by universally changing #ifdef _WIN64 to #if defined(_WIN64) || defined(WIN32). I also took a quick glance (about 30 seconds’ worth) at the Windows code to check it will work in 32-bit mode too.
Rewriting the C code to put all declarations at the top of a block. No real thought required, but lots of genuine irritation resulting. Why does MS Visual C 2010 still not support C99 D:.
Deciphering some of the clever (and unsupported by MSVC) macro tricks used to set kernel parameters.

Very little comprehension of the libstdcl code (and less of the OpenCL API) was required. Once the library was compiled, it was a similar but smaller task to port one of the example programs. I chose hello_stdcl, which appeared to run successfully. At least I’m guessing so, since I don’t know precisely what it was computing. It ran without complaining and output lots of stuff; I assume it’s correct.

An OpenCL mfunc kernel

The next challenge was translating the fractal calculations into something runnable on a GPU. Recall the direct “mfunc” routine from fractal:

int mfunc_direct(double zx, double zy, double cx, double cy, int max_iterations, double *fx, double *fy)
{
    int i = 0;
    double zr = zx, zi = zy;
    double zr2 = 0.0, zi2 = 0.0;

    while (i < max_iterations && zr2 + zi2 < 2.0*2.0)
    {
        double t;

        zr2 = zr*zr;
        zi2 = zi*zi;
        t = zr*zi;
        zr = zr2 - zi2 + cx;
        zi = t + t + cy;

        i++;
    }
    *fx = zr;
    *fy = zi;

    if (zr2 + zi2 < 2.0*2.0)
        return 0;

    return i;
}

The goal is to write a similar function — a kernel — that can be applied to many pixels in parallel. The main problem is that different pixels take different amounts of work to compute; this translates to different numbers of iterations until the exit condition is reached.

I tackled this earlier in the PARALLEL mode, by having each thread do a number of pixels evenly distributed across the drawing. The total work for each thread should therefore be similar.

The same technique is used here, except that only several thousands pixels are worked on at a time. There are 64 threads, and each thread i works on pixels 64j + i for all k within range.

The inputs to each calculation are four coordinate values (zx, xy, cx, cy), and the outputs are the depth v at that coordinate, and the final point (fx, fy) resulting from the calculation. Since a kernel will work on many pixels, it takes as input arrays of all the inputs, and stores its output in arrays too. The first draft kernel is:

__kernel void mfunc_kern(
    /* Work size: number of threads and total number of pixels. */
    uint n,
    uint w,

    /* Input coordinates. */
    __global float *gzx,
    __global float *gzy,
    __global float *gcx,
    __global float *gcy,

    /* Outputs. */
    __global unsigned short *vs,
    __global float *fx,
    __global float *fy,

    /* Maximum depth to search to. */
    uint depth,
)
{
    int i = get_global_id(0);
    int j = i;
    while (j < w)
    {
        /* Do mfunc on the pixel at j. */

        /* Move coordinates into local variables. */
        float zr = gzx[j];
        float zi = gzy[j];
        float zr2 = 0.0f;
        float zi2 = 0.0f;
        float cr = gcx[j];
        float ci = gcy[j];

        /* Remaining depth to search up to. */
        int remaining = depth;

        /* Iterate until depth exhausted or point escapes set. */
        while (remaining && zr2 + zi2 < 4.0f)
        {
            float t;

            zr2 = zr*zr;
            zi2 = zi*zi;
            t = zr*zi;
            zr = zr2 - zi2 + cr;
            zi = 2.0f * t + ci;

            remaining--;
        }

        /* Store final point. */
        fx[j] = zr;
        fy[j] = zi;

        /* Store depth at point, depending on whether it has escaped. */
        if (zr2 + zi2 < 4.0f)
            vs[j] = 0;
        else
            vs[j] = remaining;

        /* Next pixel. */
        j += n;
    }
}

This renders an image indistinguishable from the ones generated on the CPU. A very satisfying result to mark the new year. Get it right first, then make it fast.

…that is far from perfect

Unfortunately it takes longer to do so than the parallelism-challenged CPU; even the SIMPLE mode with no SIMD can be faster, and the PARALLEL mode with SIMD_FLOAT is many times faster; that mode does up to 24 pixels in parallel on my six-core CPU. But the GPU should be doing up to 224 if I’ve read my GPU’s specs right. (It has that many stream processors. Ideally I would also use SIMD on the GPU, making use of each stream processor’s 4 processing elements, to reach 1120 simultaneous pixels.)

Is each pixel computation more than 10 times slower on the GPU? Unlikely. I don’t know the relative scalar speeds on GPUs vs CPUs (it could easily be benchmarked) but I doubt it’s that much. And there are more obvious performance effects to consider first.

The first is that there is some overhead in marshalling the data and dispatching a computation to the GPU. This was especially evident when the program did a single row at a time. It was painfully slow, and just as slow on the deep parts as the shallow parts, indicating that the cost of the individual kernel iterations was not accounting for a significant part of the total cost. The speed was dramatically improved by doing 25 rows at a time.

The next symptom is that the OpenCL compiler will actually give a warning (not printed by default, in libstdcl), saying:

mfunc_kern kernel has register spilling. Lower performance expected.

This means that there are so many variables in use in the kernel that some of them are stored in “local memory”, which apparently is not very local to the stream processor that needs to use it. Some ways around this might be to combine the 4 input arrays into a single array. Using an array of 4-float vectors might help too, though the computations on them won’t be vectorised.

The big performance impact is the branchiness of the kernel code. Each compute unit on the GPU has 16 stream processors. But in order to run each in parallel, it needs to run them in lock step: the same instruction in the kernel is executed simultaneously on each, using the data elements belonging to that processor. If execution reaches a branch, such as the if-statement on line 57, then all threads should branch the same way. That way they can continue in lockstep.

If they do not, then the GPU will cope by running some of the threads only, then the remaining threads only, until the two branches join. This is slower than running them all in parallel. There are coding techniques that can remove branches, by executing both branches but masking out any changes to destination variables. That branch could be rewritten as:

    /* Set to remaining*1 if condition is true, or remaining*0 if it's false. */
    vs[j] = remaining * (zr2 + zi2 >= 4.0f);

(For what it’s worth, that if-statement could easily be moved out into the main program and run on the CPU. That removes any benefits it might receive from massive parallelism, of course. But that could be worthwhile if it no longer encumbers the more expensive iteration part of the kernel.)

It’s not clear what happens in the case of a while loop though. What if some of the threads exit the inner loop, but others continue in it? What then happens if some of those remaining ones later exit? Will the threads devolve into none of them being in lockstep, and effectively have to be individually executed one after the other?

It can be hard to peer through the layers of libraries, drivers and multiple devices to see exactly how my code is being executed. I’m still a novice at these things, having only been GPU programming for two days.

There are large programming guides, such as the nVidia CUDA one and the AMD APP one; from what I’ve read so far the underlying architectures, and hence good programming heuristics, are similar enough to make them both potentially useful. There are also some tools such as a Kernel Analyser that comes with the AMD SDK; it prints the Radeon assembly code that the kernel compiles to (which I’ve not completely understood yet; it’s not like CPU assembly code).

I suspect that the mfunc kernel will benefit from removing some of its branches. One way of handling the differing iteration lengths for pixels is to have the escape/exhaustion condition handled as an if-statement inside an infinite loop, as it was in the SIMD code. That way the GPU might have an easier time of recognising that it was a small bit of extra code to run for a few pixels, and that normal lockstep computation could continue after it.

Another possibility is to compute the same number of iterations for all pixels, explicitly masking out the effects for any pixels that were complete. This could be done in batches: do (say) 32 iterations for all pixels, then another 32 for those that are still going, and so on, until all are done. (This is a little like the ITERATIVE mode, which seeks to resume pixels that were not finished in a previous iteration.)

Last of all, there is a potential speedup from using SIMD on the GPU. This raises the possibility of a full set of GPU-specific mfunc modes. The current mode uses single-precision floats; double precision should be supported on some GPUs.

Ramifications on other modes

To finish, we could unify the OpenCL implementation with the other drawing modes. Currently it’s SIMPLE_OPENCL; it’s essentially SIMPLE mode (drawing row by row from the top). But there is no reason why the large number of pixels it works on should be a contiguous group. The OpenCL computation is more similar to an mfunc mode than a drawing mode: it takes a large number of input coordinates, works on them in parallel, and returns a large number of pixel results.

Making the OpenCL computation generic like this brings the program full circle again. We have several drawing modes: SIMPLE, PARALLEL, TRACE, etc. Orthogonal to these are several mfunc modes: FLOAT, SIMD, SIMD_FLOAT, OPENCL, OPENCL_SIMD, etc. But within the mfunc modes are several more orthogonal variations:

Scalar or SIMD?
Integer, float, or double precision?
CPU or GPU (or both!) ?

After covering all those, it might finally be time to start making the program actually useable. ;-)

6 Responses to Massively parallel fractals

Pingback: Project summary | EJRH
Pingback: Some practical uses of goto | EJRH
Demolishun says:

October 3, 2012 at 8:46 pm

You might want to check out numpy, pyopencl, and of course python. I am working with fractals in that environment using Matplotlib and related libraries to show the projections. I am not all that up on OpenCL, but it seems like pyopencl is hiding the details of parallel processing enough for me that I don’t seem to have to care about dividing up the processing amongst the GPU/CPU cores. I took my numpy calcs and made them more than 250 times faster. I am calculating mandelbrot set projections with smoothing between color bands for images that are 1024×1024 with a maxit of 2048 in 0.4 seconds. I usually keep my maxit at 512, but the bottleneck seems to be the copying of the data to be viewed and not the actual calculation. Simply amazing.

- ejrh says:
  
  October 5, 2012 at 2:16 pm
  
  Ok, I’ll check them out. But I’ll be reading the source to find out how they do it, so that I can use the same tricks for my C program. ;)
  
  I’ve wondered where the bottleneck is, but been too lazy to do any benchmarking to find out. I would not be surprised if there was a relatively large one per GPU dispatch, but that should only be a problem if we’re making lots of dispatches per image.
  
Joseph says:

January 15, 2013 at 11:35 am

To clarify one thing, according to http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf, page 116, when register spill happens, those variables goes to “global” memory, not “local” memory.

- ejrh says:
  
  January 15, 2013 at 1:42 pm
  
  Thanks. This could help me optimise the kernel, by making it use local memory more, if and when I go back to working on it.