Branchless Programming: An Actual Example
Let us recall that any number XORed with 0 is just that number i.e. 5 ^ 0 == 5 resolves to true. Let us also recall that it is foolish and dangerous to try and implement carry-less multiplication if suitable, secure libraries and hardware acceleration(s) are available for cryptographic purposes.
The Advanced Encryption Standard calls for a carry-less multiplication of two polynomials of up to degree 7 whose coefficients are represented by the bits in an 8-bit byte. For example, 0b00000011 * 0b0000100 represents a left-shift of 0b00000011 by 2 or a left-shift of 0b00000100 by 1 XORed with 0b000000100. Either approach will yield 0b00001100.
What happens when the carry-less multiplication exceeds the width of a byte? You reduce the number by left shifting 0x011b (called the modulo) to align it with the highest set bit, XOR, check to see if it fits in 8 bits, and repeat if necessary. A clever reader may recognize that the highest set bit in such a multiplication, before reduction, is the 14th bit i.e. the 7th bit left-shifted by 7. I’m not sure what the largest number possible is, but that’s left for a later exercise.
Since 0x011b is equivalent to 0b00000001 00011011, we can see that it may have to get aligned to bytes 14 through 8. That’s 7 total checks we’ll have to run. How can we check to see if an individual bit is set? We right-shift the number in question and mask it with 0x01. If that equals 1, we know we have to apply the XOR again. Why not just XOR the number by 0x011b 7 times? The XOR doesn’t simply shift the highest bit down. It may leave a “0 gap”.
0b0101010101010101 ^ 0b0100011011000000 = 0b0001001110010101
Notice that the highest bit in 14 disappeared, but a bit appeared in 12.
Our algorithm looks like this:
- Check original number for highest bit.
- If high (1), shift the modulo to align to bit and XOR. Store result in original number.
- If low (0), do nothing.
- Adjust location to check.
I have seen this coded with an if-statement. The danger with this approach, however, is that it opens the door for side channel attacks to weaken encryption. When the processor performs the XOR operation, it draws a different amount of current than when skipping then operation entirely. It even “sounds” different to an attacker using a very sensitive microphone.
The solution I have come up with is to multiply the modulo by the state of the bit we’re examining and perform an XOR with this new value regardless of the state of the bit. If the bit is high, the modulo remains unchanged, and the number is reduced. If the bit is low, the modulo is changed into 0, and the number remains unchanged.
Imagine that I have stored the result of the multiplication in a variable named accumulator.
for(int i = 0; i < 7; i++) { accumulator ^= ((modulo << (6 - i)) * ((accumulator >> (14 - i)) & 0x01)); }
- Left-shift the modulo 6 to 0 times.
- Multiply the modulo by the accumulator right-shifted 14 to 8 times masked with 0x01.
- XOR the accumulator by this amount and store back in accumulator.
- Multiply the modulo by the accumulator right-shifted 14 to 8 times masked with 0x01.
That’s it. It took several paragraphs to explain these three lines of code.
There are a handful of reasons why branchless programming should be employed. In a hyperthreaded environment, for example, the processor may try to guess which “branch” of an if-else statement should be followed and will place this branch on the execution stack preemptively. Should the guess be wrong, the stack has to be unwound wasting processor cycles. The compiler may attempt to convert branching code into branchless machine code on its own. There is no way to be certain without looking at the assembly code it spits out.
There are good reasons NOT to use branchless programming: it is cryptic. That may seem like only one reason, but it’s many reasons because many people will look at it and scratch their heads.