Speeding Up Slow Sums – Faster BSD Checksums

I recently made some changes to the Entangle archive that required a good majority of the archive to be checksum reverified. That’s about 200 TB worth of checksums. Even with six systems running parallel, and RAIDs hitting 600 MB/s, it still takes a while.

One thing I noticed was that the I/O rate while the sums were being performed was ~210 MB/s. A respectable speed, but cp’ing a file from the RAID to to /dev/null results in ~600 MB/s. So either my CPU was too slow or the ‘sum’ operation wasn’t as efficient as it could be.

To be clear, I’m using the BSD sum, which is a 16-bit sum. For each byte of data to be summed, the sum is rotated right by one bit position and then the new byte is added to the sum. This should be a piece of cake for an Apollo Lake CPU, so it was time to dig into The Source.

The sum command is part of Centos’ coreutils, so after pulling the src RPM and unpacking it, I started perusing the code. The first suspicion was of course the I/O side of things. Copying a file to /dev/null gives a best-case idea of how fast data can be read off the RAID. But there’s lots of inefficient ways to read data.

There are some best practices to get the OS to cooperate with your I/O, especially if you’re reading a file linearly. The call to posix_fadvise hinting at sequential I/O was already there. The I/O size was 8K, which is a bit small, but oddly increasing it had the effect of making things slower. I’d guess it has to do with 8K being close to a sweet spot in parallelism between the filesystem, kernel, and libc.

The code makes repeated calls to getc. Hmm. Perhaps we can speed things up a bit by fread’ing 16k or so at a time. This resulted in a slight increase but we’re still much lower than the raid’s read rate.

There’s nothing obviously inefficient about the I/O, so it’s time to look at the sum function itself. The core loop looks like:

total_bytes++;
checksum = (checksum >> 1) + ((checksum & 1) << 15);
checksum += ch;
checksum &= 0xffff;

So the first thing we can note is the increment of total_bytes. This is easy to get rid of. If we’re not using stdin, then we can just call ftell() after we’re done to determine how many were read. (If we’re reading from stdin then we can’t use ftell(), though, so we handle stdin as a separate case.)

Next is the last line that’s clamping the checksum value to 16 bits. Do we need that? Well if we declare checksum to be a uint16_t then we don’t. The addition will saturate and the MSBs will be lost – which is exactly what the masking is doing anyway.

So we’re down to:

checksum = (checksum >> 1) + ((checksum & 1) << 15);
checksum += ch;

This is now running at about 400 MB/s. Much improved, but maybe we can do better. So let’s look at the assembly code. Way back in the mists of time when Commodore 64s and TRS-80s roamed the fortunate packed Winchesters (err, hard drives), I learned Z-80 machine language as my second coding language (BASIC of course being the first!). Yes, I didn’t have an assember so I was hand-assembling machine code. The Z-80 was a spiffed up version of the 8080, an ancestor of the Intel x86 processors we know and love today. The Z-80/8080 had a certain certain “ror” – rotate right – instruction. Despite undergoing some linguistic drift, current X86 processors have the same instruction. The 16-bit version of that instruction – “rorw” – should replace all the shifts, masks, and adds in that first line.

But when I look at the assembler, I see…a bunch of shfits, masks and adds. To be honest I wasn’t that surprised. I mean gcc would need to be pretty good to notice that the first line is really a 16-bit rotate. But a bit of web searching shows that gcc should be able to figure this out. So why isn’t it? After much deliberation, I made a small change…

checksum = (checksum >> 1) | ((checksum & 1) << 15);
checksum += ch;

Voilla! We’re now summing along at 540 MB/s. Apparently that addition was enough to throw off the optimizer. For those of you who are curious, the main sum loop looks like this in assembler:

.L40:
    movzbl (%r8), %eax  // Get next byte to sum
    rolw $15, %bx       // Rotate sum left by 15
    addq $1, %r8        // increment buffer pointer
    addl %eax, %ebc     // Add the byte to the sum
    cmpq %r15, %r8      // See if we're at the end 
                        // of the buffer
    jne .L40            // If not at end, then do 
                        // next byte

The only oddity is the use of the rolw rather than rorw instruction. Rotate left by 15 is the same as rotating right a 16-bit value by 1, so no harm done.

Some further gains might be had by loop unrolling and other methods, but 540 MB/s is close to the 600 MB/s throughput from the drives for me.