Low-level priority queue optimisations

Optimising the Fractal’s trace mode by changing the priority queue and enabling SIMD.

As mentioned in an aside of the previous post, the priority queue implementation used by the TRACE mode is inefficient. During some profiling using gprof, I noticed that the sift_up method was taking around 50% of the time. This method is called when the root of the heap is consumed; it determines which child should be sifted into the root’s position, then calls itself recursively on the child’s vacated position.

The PQ implementation is generic. It doesn’t assume anything about its elements, except that each has a 32-bit signed integer priority (where the lowest number becomes the root), and a fixed size item. The PQ is initialised with the item size, and calls to push and pop items are done using pointers to the item data. Internally, each slot is the priority concatenated with the item, and memmove is used to move items up and down the heap. In the absence of template programming, Just-in-Time optimisation, and low-level hacks, there’s a certain overhead of this implementation.

There are $1920 \times 1080 \times 4 = 8294400$ pixels when the fractal drawn on my desktop.  But it’s possible for a pixel to be queued, then requeued again with a different priority, if two of its neighbours are evaluated before it.  There are 8 neighbours so ostensibly the queue could grow to 8294400 \times 8 = 66355200 items.  The priority queue is stored as a heap; when it has this many items, the depth will be log_2 66355200 \approx 25.98.  That is how many items will potentially need to be moved each time the root of the heap is consumed.

The items on the TRACE mode’s priority queue are COORD structures, containing x and y coordinates (32-bit integers). Combined with the priority, each slot in the heap holds 12 bytes. I have checked, and memmove is always used to move them. There is no way the compiler can see that item_size will in fact be 8 bytes; even if pq.c and trace.c were a single file, I am not sure the compiler would be able detect that item_size will always be 8. My surmise was that the cost of making a separate memmove call just to move 12 bytes would be greater than simply doing 3 word assignments directly. I thought the compiler would agree with me, if I provided suitable encouragement:

if (pq->slot_size == 12)
    memmove(*x, *c1, 12);   // Moving exactly 12 bytes, so inline the call, damn it!
else
    memmove(*x, *c1, pq->slot_size);

But the optimised assembly output in both MSVC and GCC still call memmove.

I want PQ to remain generic, and did not want to add a special case for size-12 slots just because that’s how PQ would be used in this program. And I also had a plan to reduce the item size — in fact, to combine the item with the priority, because 32 bits really is enough information for them both. Adding explicit support for priority queues with no items seemed reasonable. The slot size would be 4, which meant slot assignment was simple integer assignment:

if (pq->item_size == 0)
    *x = *c1;   // No item data, so just assign the priority.
else
    memmove(*x, *c1, pq->slot_size);

The next step is to pack the priority, and x and y coordinates into a single 32 bit signed integer, which PQ would then treat as the priority. My largest screen is 1920×1080, and the fractal is drawn at double resolution, for antialiasing. Therefore 12 bits should be enough to hold each pixel coordinate (up to 4096 pixels in each direction). That leaves 8 bits for the priority. There are two approaches to maintaining these three items in 4 bytes:

  1. Using bit masks and shift operators. For example, extraction of bits 12-23 would be x = (data >> 12) & 4095 — first we shift it 12 bits down, then we mask out the higher bits, leaving the value we want in x.
  2. Using C’s bit fields. Bit fields are a low level feature that allow us to divide an integer into groups of bits.

I chose to try the second approach. I think it is the first time I’ve used bit fields in a C program. (Most of my C programming is higher level…)

typedef struct COORDS
{
    unsigned int x:12;
    unsigned int y:12;
    char priority:8;
} COORDS;

Note that this entire structure is treated as the item priority in PQ. It’s therefore best if it compares priorities starting at the priority bit field. A bit of experimentation was required to determine that putting it last in the structure puts it in the most significant bits of the integer form used inside PQ.

Some additional debugging was needed to get it drawing properly again. There is an outstanding minor effect of the compressed priority space — in interesting parts of the fractal, most pixels will have the maximum priority, so the fractal does not trace as effectively as it could (it tends to floor fill the sufficiently interesting pixels from the top). This will be fixed by taking the logarithm of the pixel value as its priority.

I develop Fractal on both Linux and Windows. Compiling the new version in MSVC revealed some interesting problems. Garbage was coming out the PQ. It turned out that the size of the COORDS bitfield was 5 bytes, not 4! There is documentation on structure packing and alignment on MSDN but it was too copious and obtuse to help me. I found the answer in the GCC documentation (of all places). When describing an option it provides for compatibility on Windows, it explains the MSVC packing algorithm: “Adjacent bit fields are packed into the same 1-, 2-, or 4-byte allocation unit if the integral types are the same size and if the next bit field fits into the current allocation unit.” Now, my fields do fit into the allocation unit; but the last field was a different type, so it had to be put in a new unit. Changing it to unsigned long int solves the problem.

This is the first time in my experience that changing a type from 8-bit byte to 32-bit unsigned long int results in a smaller object. ;-)

I have implemented TRACE_LOOP and TRACE_SIMD modes. The performance improvement of TRACE_SIMD over TRACE is approximately the same as it was for SIMPLE_SIMD and SIMPLE. Tracing is still significantly slower than other drawing modes. But the gap has been reduced: TRACE was around 8-10 times slower on the original benchmark; now it’s 4-5 times lower.

I’ve also changed the standard benchmark to focus on a slightly more interesting part of the fractal. The pixels in the new area will be deeper, and so provide a better benchmark for the main calculation.

The new benchmark results are:

Mode Depth 100 Depth 1000 Depth 10000
SIMPLE 1358497 699527 665732
SIMPLE_LOOP 1335164 730508 695060
SIMPLE_SIMD 1679148 932749 926970
PARALLEL 3326163 3302414 1965352
TRACE 584272 352806 344100
TRACE_LOOP 462722 349757 341257
TRACE_SIMD 491707 398985 393672

I have not implemented SIMD for the PARALLEL mode yet; nor have I made the SIMPLE and TRACE modes parallel! Ultimately there are at least three distinct dimensions to the mode model the program uses:

  • The drawing mode proper. This determines the order in which pixels are computed and drawn.
  • The main calculation mode: whether SIMD is used (and potentially, what variety of SIMD). It could also mean other variations on that main loop.
  • The multiprocessing mode: whether threads are used.

Implementing new modes helps to develop this model. This is not unlike the development of a scientific theory by accumulating evidence of various related phenomena. Simple mode with SIMD and tracing mode without it were once two incomparable particles in the model; but the SIMD aspects of each have been unified (thanks to the abstraction of mfunc_simd and its next_pixel and output_pixel callbacks). For these modes, at least, the drawing mode is more or less orthogonal to the calculation mode. Parallel mode, like an exotic particle, does not yet fit into this framework. But once it adapts to SIMD, and once it shares its threading abilities (via another abstraction, hopefully) with the other drawing modes, the model will be complete. Until one day a new mode is discovered that spurs new “science” to be done.  One such speculative discovery is offloading computation to a GPU.

This entry was posted in Programming, Science and tagged , , , , . Bookmark the permalink.

2 Responses to Low-level priority queue optimisations

  1. Pingback: Don’t use signed ints if you don’t need them | EJRH

  2. Pingback: Massively parallel fractals | EJRH

Leave a comment