I’d like to write a bit about a project my friends and I have been working on recently.
LLVM is a compiler toolkit that makes it easy to write reasonably fast programming languages without having to implement backends for every processor architecture known to man. A big reason for its popularity is that it boasts a rich intermediate representation (IR) language. IR is sort of like assembly code, but much more convenient and programmerfriendly. The most significant difference for this article is the fact that IR is statically typed. LLVM uses IR to generate actual assembly and machine code. LLVM stands for lowlevel virtual machine and IR is its bitcode.
SIMD stands for single instruction, multiple data. It describes a class of machine instructions which let you use large registers to operate on vectors of elements at once. For example, using the SSE2 instruction extensions to x86 you could use the standard 128bit floating point registers to store vectors of four 32bit integers and add them together in one instruction:
%xmm0, %xmm1 paddd
This takes more than one cycle, but it’s much faster than four adds. There are a slew of these vector instructions. All the common arithmetic operators and also some operations that only make sense on vectors. Dot products, vector sums, etc.
New processors are adding more and more SIMD instructions. The current iMac Pro has a Xeon processor with AVX512 technology: 512bit registers with extensive vector instruction sets so you can work on sixteen 32bit integers at once.
LLVM’s type system accounts for SIMD with vector types.
<4 x i32>
is a type describing a vector of 32bit
integers, so to generate the above assembly you might write a line of IR
that looks like this:
%sum = <4 x i32> add %v1, %v2
While working on a vector of int
’s (or
long
’s or char
’s or short
’s) is
nice, some algorithms lend themselves better to smaller field widths.
DNA base pairs can be encoded as 2bit numbers, booleans as 1bit
numbers, and there are applications in GPS (the navigation kind) that
rely on 4bit values. There aren’t native instructions for these. The
smallest vector element type that is commonly supported is a byte. What
to do?
Well, LLVM IR has support for arbitrary vector types. That means you
can write <5 x i7>
or <64 x i2>
and your program will work without errors. So how does LLVM generate
native code when there are no native instructions defined on these
types?
Scalarization. The generated code goes through each element of your vector individually, removes it from the vector, does whatever you want with that element, and then pops it back into another vector. The vector is scalarized. The result is a 53KB ASM file yielding a ridiculously slow 32KB binary when compiling a file containing a single function:
@define add(<128 x i1> %v1, <128 x i1> %v2) {
%sum = <128 x i1> add %v1, %v2
ret <128 x i1> %sum
}
This is silly! Adding two 1bit values is just an XOR, and adding two
vectors of 1bit values means just taking the XOR of the vectors. So
what we really want to generate is xor %xmm0, %xmm1
.
One of the best things about LLVM IR, its type system, is getting in the way. It would be really nice to write a compiler that generates IR similar to that above. Luckily, LLVM is built to be extendible. Compilation works in passes. One pass might find variables you never use and get rid of them. One might find tailrecursive functions and turn them into loops. Another may inline function calls. This way, different people can write different optimizations concurrently without needing to cooperate. What my friends and I did is write a pass to subvert LLVM’s type system.
Given an IR file containing nonstandard types (I’ll refer to them as illegal types), we produce an IR file containing only standard types (I’ll call them legal). This is called type legalization and it comprises one stage of our pass.
We take
@define add(<128 x i1> %v1, <128 x i1> %v2) {
%sum = <128 x i1> add %v1, %v2
ret <128 x i1> %sum
}
and transform it into
@define add(<4 x i32> %v1, <4 x i32> %v2) {
%sum = <4 x i32> add %v1, %v2
ret <4 x i32> %sum
}
But that’s not enough. XORing registers is not the same as adding
int
’s. We also have to transform that add
instruction into an xor
. And we need to do it
first, so we don’t accidentally transform an instruction that
didn’t have illegal types before our pass. So first we analyze the
program to find all the illegal types. Every value that touches an
illegal type gets recorded. Then, we switch out the add
for
an xor
, and finally we can change the types.
Changing that add
to an xor
is called SWAR: SIMD Within A
Register. We treat each 32bit vector element as its own register and we
implement SIMD within that. SIMD within SIMD. To add 128 1bit values,
we XOR four 32bit values. XORing 32bit values in order to add 1bit
values is a SWAR technique since it lets us operate on thirtytwo values
with one instruction.
So our final output should look like this:
@define add(<4 x i32> %v1, <4 x i32> %v2) {
%sum = <4 x i32> xor %v1, %v2
ret <4 x i32> %sum
}
Now this generates the ASM we’re looking for! And it runs as fast as
you can expect a single xor
instruction to run (very very
very fast).
This example was easy though; 1bit values are not complicated. What
happens if your input is multiplying vectors of 4bit values? Well, you
can’t just swap in an xor
and call it a day, but it’s not
too difficult to come up with algorithms to perform 4bit multiplication
in packed 128bit registers. There are algorithms for all the main
arithmetic operations in fact, in varying performance and
complexity.
We built a function that contains our SWAR implementation of multiply for vectors of 4bit integers and splice in a call to it instead. I won’t bore you with the details, it’s some annoying bittwiddling.
What I am really excited to show are the results. We benchmarked a very simple program with and without our pass.
define <128 x i1> @add(<128 x i1> %v1, <128 x i1> %v2) {
entry:
%sum = add <128 x i1> %v1, %v2
ret <128 x i1> %sum
}
define i32 @main() {
entry:
br label %loop
loop:
%i = phi i64 [ 7812500, %entry ], [ %i1, %loop ]
%v1 = phi <128 x i1> [ <i1 0, ..., i1 0>, %entry ], [ %sum, %loop ]
%v2 = phi <128 x i1> [ <i1 1, ..., i1 1>, %entry ], [ %v2ctr, %loop ]
%sum = call <128 x i1> @add(<128 x i1> %v1, <128 x i1> %v2)
%v2ctr = call <128 x i1> @add(<128 x i1> %v2, <128 x i1> <i1 1, ..., i1 1>)
%i1 = sub i64 %i, 1
%cmp = icmp eq i64 %i1, 0
br i1 %cmp, label %fin, label %loop
fin:
ret i32 0
}
This code adds two vectors and increments every element of the second vector 7812500 times. The choice of this last number is arbitrary, we only chose it because 7812500(128) = 1 billion. We had analogous programs for the 2bit and 4bit cases.
Each program was compiled using LLVM 5 times with and 5 times without our pass. Then we ran the programs 5 times each, and reported here are the means (I don’t include the variance because it was negligible: 0.00 or 0.01 for all experiments).
Mean Compile Time (s)
i1

i2

i4



with  0.05  0.04  0.03 
without  1.78  0.88  0.42 
Mean Run Time (s)
i1

i2

i4



with  0.05  0.07  0.06 
without  6.26  3.00  1.50 
You don’t need a MannWhitney U test to tell you that’s an
improvement. The i1
program runs 125 times faster with our
approach and it was faster to compile, too!
Of course, we benchmarked a single stupid program and our approach may not actually fit with how LLVM IR is generated in reallife. There are a slew of instructions that would need to be supported and most aren’t as simple as bitwise addition. I’m not really interested in pursuing this further, because I’m satisfied at beating LLVM in this little game.