Comparing 160-bit hash values with SSE

Posted: September 13, 2009 | Author: skitten | Filed under: Uncategorized |2 Comments »

I’m writing this up as it’s one of those things I was trying to do and thought I’d just be able to crib how to do it from the Internet yet a good amount of googleing turned up nothing. So here it it for anyone else looking for the same thing.

I was working on something at work where we put a whole load of entries into a dictionary (an STL map) using a 160-bit key (SHA-1 hash). Using STL maps requires you to define an ordering on your keys and I’d knocked up a quick naïve Key class that compared keys by iterating through it in 32-bit chunks testing if those 32 bits were less than or greater than—you only need to continue to the next chunk if they are equal. Here’s the class:

class Key {
 public:

 unsigned int & operator[] (int i) {
   return _data[i];
 }

 unsigned int operator [] (int i) const {
   return _data[i];
 }

 bool operator < (const Key &) const;

 private:
 static const int size = 5;

 unsigned int _data[size];
};

And here’s the simple comparison operator:

inline bool Key::operator < (const Key &k) const {
  for (int i = 0; i < size; i++)
  {
    if (_data[i] < k._data[i])
      return true;
    else if (_data[i] > k._data[i])
      return false;
  }

 // Equal
 return false;
}

There was a lot of talk about using SSE at the time so I decided that as an academic exercise I’d try to optimise the key comparison using SSE. After all, loops with comparisons inside are bad, aren’t they?

Deciphering SSE

My only past experience with SIMD instruction sets had been writing a bilinear interpolation function using 3D-NOW back in 2000 – SSE was totally new to me. First I had to work out how to use it from C++ (this is on linux).

With GCC on linux it seems you have a choice of intrinsics you can use. GCC has built in vector types which appear to support simple operations and be portable to different architectures; the manual didn’t explain them very well and it was unclear if I’d be able to do what I was trying to do using them. And then there are SSE specific intrinsics I think defined by Microsoft. These are in various include files depending on which subset of SSE you want:

xmmintrin.h – SSE
emmintrin.h – SSE2
pmmintrin.h – SSE3
tmmintrin.h – SSSE3
smmintrin.h – SSE4.1
ammintrin.h – SSE4a
bmmintrin.h – SSE5

See how the already confusing SSE numbering scheme is further obfuscated?

These define functions like _mm_add_ps(a, b) that will add two vectors of four 32-bit fp values. If you actually look in the header you’ll see it maps to __builtin_ia32_addps(a, b) which will (I hope) compile down to the ADDPS instruction. There isn’t always a direct mapping from the intrinsic name to the instruction name. Microsoft’s MSDN seems to be the goto place for documentation on these and I found the best way to map between intrinsics and instructions was to type them into google and look for the MSDN links.

Armed with this knowledge, and deciding i didn’t need anything beyond SSE2, I set forth.

Figuring out how to do it

I downloaded the AMD SSE docs and tried to figure out how to compare two long values without ending up with a loop again. And it’s not necessarily obvious when you just have a bit alphabetical list of opcodes. I followed a few red herring lines of thinking for a bit then realised I could implement the same algorithm but doing the comparisons in parallel, removing the evil loop/compare/branch.

SSE2 has packed equality and greater than instructions for 8, 16 and 32 bit integers. I decided to avoid the floating point comparisons since my hash values could have bit patterns for illegal fp values. PCMPGTD seemed the obvious choice since it compares four 32-bit values at once—I decided to ignore the last 32-bits of my 160-bit key until I figured this bit out. Unfortunately it sets the result by setting each 32-bit chunk to either all 1’s or all 0’s and it wasn’t to clear where to go from there. However the result from the packed 8-bit comparison PCMPGTB can then be fed into PMOVMSKB instruction to get a nice packed 16-bit array of comparison results. Thus I can can the packed 8-bit greater than, equal and less than results from 128-but values thusly:

inline bool Key::operator < (const Key &k) const {
  __m128i i1 = _mm_load_si128(reinterpret_cast<const __m128i *>(_data));
  __m128i i2 = _mm_load_si128(reinterpret_cast<const __m128i *>(k._data));

  __m128i gt128 = _mm_cmpgt_epi8(i1, i2);
  __m128i eq128 = _mm_cmpeq_epi8(i1, i2);
  unsigned int gt = _mm_movemask_epi8(gt128);
  unsigned int eq = _mm_movemask_epi8(eq128);
  unsigned int lt = ~(gt | eq) & 0x0000ffff;
}

The idea of the initial algorithm was that I start by comparing the most significant chunks of the two values and if a > b for that then the rest of the bits don’t matter, likewise is a < b. Only if the chunks are equal do I have to move on to the next most significant chunk. Since I now have two binary masks for gt and lt of the chunks I can interpret them as integers and simply say

bool result = gt < lt;

and we’ve compared two 128-bit values with no loops or branches. And indeed it compiles down to:

movdqa    (%rsi), %xmm0
movdqa    (%rdi), %xmm1
movdqa    %xmm1, %xmm2
pcmpgtb    %xmm0, %xmm2
pcmpeqb    %xmm1, %xmm0
pmovmskb    %xmm2, %edx
pmovmskb    %xmm0, %eax
orl    %edx, %eax
notl    %eax
andl    $65535, %eax
cmpl    %eax, %edx
setb    %al
ret

with gcc.

Two asides

When I started looking at the compiler’s assembly output I was completely confused for at least a couple of hours about how it worked at all. I’m not too familiar with x86 assembler (I grew up on 68000) and I’d been reading all the Intel/AMD docs that have it as op dst, src and the gcc output looked like it was loading things then overwriting them, using uninitialised values and all sorts of lunacy. Turns out gcc uses AT&T syntax, which is op src, dst.

Also, the SSE version of the comparison is not equivalent to the original – we compare in 8-bit chunks rather than 32-bit and the order of the comparisons is messed up further by the little-endian storage of the 32-bit values in the array. But it doesn’t matter as long as the ordering is consistent with itself.

Extending to 160 bits

We have 32 bits left over to compare. Rather than trying to marshal them into our SSE scheme we treat them as the least significant chunk and only defer to them if the first 128 bits are equal, so the result becomes

bool r = gt < lt || _data[4] < k._data[4];

Our code now has a single conditional branch in it, but it has to be better than the original, no?

Evaluation

I wrote some tests to exercise the STL map – inserting a million random keys and then looking up both keys that we know are and are not in the map. The results? The SSE code is 10% slower.

A colleague pointed out that if the hash is any good then the comparison should almost always be decided on the first chunk and thus usually in the original algorithm only one 32-bit comparison happens anyway. In the SSE code we are loading more data than needed most of the time and have the additional overhead of marshalling it around.

So the conclusion is think about your problem before plunging in thinking “SSE good, loops and branches bad”. But it was a good SSE learning experience for me.

2 Comments on “Comparing 160-bit hash values with SSE”

Joey says:

December 7, 2009 at 7:44 pm

This is funny. Then sad. I shall continue to avoid assembler.

Reply
onitake says:

December 21, 2009 at 3:16 pm

thank you for the post.
i was optimizing a small vector math library i wrote and had some trouble with the result of cmpeqps. what is it good for if i have to transfer a whole xmm register into memory to evaluate the result?
movmskps was exactly what i was looking for.

in my case, the work wasn’t futile: my hand-optimized dot product runs about 10% faster than the mulss/addss version generated by gcc 4.2’s tree-ssa optimizer, while my cross product routine reduces benchmark runtime by 20%. not as much as i expected, but i’m sure there’s still room for improvement.
i suspect there are still some unneccessary memory accesses. they hurt performance much more than the occasional serialization…

Reply

skitten