Here is a macro for a fast, branchless population count on a 32-bit register. I have put this together from various pieces of information found in News postings over the past 6 year.
POPCOUNT1 implements a branchless computation of the population count (ie. the number of 1-bits) of a 32-bit register. It is based on a O(log(n)) algorithm that succesively groups the bits into groups of 2, 4, 8, 16, and 32, while maintaining a count of 1-bits in each group.
n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh n >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZwhere
WWWW
, XXXX
, YYYY
, and ZZZZ
are the interesting sums each at most 1000b, or 8 decimal.
0000pppp0000qqqq0000rrrr0000ssss * 00000001000000010000000100000001 = :0000pppp0000qqqq0000rrrr0000ssss 0000pppp:0000qqqq0000rrrr0000ssss 0000pppp0000qqqq:0000rrrr0000ssss 0000pppp0000qqqq0000rrrr:0000ssss --------------------------------------------------------------- 0000ppppxxxxxxxxwwwwwwww:vvvvvvvvuuuuuuuutttttttt0000sssswhere
pppp
, qqqq
, rrrr
, and ssss
are the 4-bit groups we want to add, and vvvvvvvvv is the sum we are interest in. If integer multiplies are slow, we continue reducing the number of bit groups by a factor of two per step until only one group remains:
n = n + (n >> 8) n = 0000WWWW0000XXXX0000YYYY0000ZZZZ n >> 8 = 000000000000WWWW0000XXXX0000YYYY sum = 0000WWWW000ppppp000qqqqq000rrrrrwhere
ppppp
and rrrrr
are the interesting sums (and each is at most 10000b or 16 decimal). The sum qqqqq
is junk, but it is not necessary to discard it this time.
n = (n + (n >> 16)) & 0x3F n = 0000WWWW000ppppp000qqqqq000rrrrr n >> 16= 00000000000000000000WWWW000ppppp sum = 0000WWWW000ppppp000sssss00ttttttwhere
sssss
is at most 11000b and tttttt
at most 100000b (32 decimal).
; ; population count of a 32 bit register ; ; input: ; esi = number to population count ; ; output: ; edx = population count ; ; destroys: ; esi, edx ; eflags ; MACRO POPCOUNT1 mov edx,esi shr edx,1 and edx,055555555h ; (n >> 1) & 0x55555555 sub esi,edx ; n - ((n >> 1) & 0x55555555) mov edx,esi shr edx,2 ; n >> 2 and esi,033333333h ; n & 0x33333333 and edx,033333333h ; (n >> 2) & 0x33333333 add esi,edx ; n = (n & 0x33333333) + ((n >> 2) & 0x33333333) mov edx,esi shr edx,4 ; n >> 4 add edx,esi ; n + (n >> 4) and edx,00F0F0F0Fh ; n = (n + (n >> 4) & 0x0F0F0F0F) IFDEF FASTMUL imul edx,001010101h ; add by multiplying with a "magic number" shr edx,24 ; shift result into place ENDIF IFNDEF FASTMUL mov esi,edx shr esi,8 ; n >> 8 add edx,esi ; n = n + (n >> 8) mov esi,edx shr esi,16 ; n >> 16 add edx,esi ; n = n + (n >> 16) and edx,00000003Fh ; popcount ENDIF ENDMNote that FASTMUL can be defined if your CPU has does a multiply in few cycles (for example the AMD which can do a multiply in 2 cycles. This gem comes from the K6 optimization manual, available at AMDs homepage. There are a gem showing how to use this macro, see Emulate bit scan instructions for further information.