![]() | ![]() ![]() ![]() |
Here is a macro for a fast, branchless population count on a 32-bit register. I have put this together from various pieces of information found in News postings over the past 6 year.
POPCOUNT1 implements a branchless computation of the population count (ie. the number of 1-bits) of a 32-bit register. It is based on a O(log(n)) algorithm that succesively groups the bits into groups of 2, 4, 8, 16, and 32, while maintaining a count of 1-bits in each group.
n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh n >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZwhere
WWWW, XXXX, YYYY, and ZZZZ are the interesting sums each at most 1000b, or 8 decimal.
0000pppp0000qqqq0000rrrr0000ssss * 00000001000000010000000100000001 =
:0000pppp0000qqqq0000rrrr0000ssss
0000pppp:0000qqqq0000rrrr0000ssss
0000pppp0000qqqq:0000rrrr0000ssss
0000pppp0000qqqq0000rrrr:0000ssss
---------------------------------------------------------------
0000ppppxxxxxxxxwwwwwwww:vvvvvvvvuuuuuuuutttttttt0000ssss
where pppp, qqqq, rrrr, and ssss are the 4-bit groups we want to add, and vvvvvvvvv is the sum we are interest in. If integer multiplies are slow, we continue reducing the number of bit groups by a factor of two per step until only one group remains:
n = n + (n >> 8) n = 0000WWWW0000XXXX0000YYYY0000ZZZZ n >> 8 = 000000000000WWWW0000XXXX0000YYYY sum = 0000WWWW000ppppp000qqqqq000rrrrrwhere
ppppp and rrrrr are the interesting sums (and each is at most 10000b or 16 decimal). The sum qqqqq is junk, but it is not necessary to discard it this time.
n = (n + (n >> 16)) & 0x3F n = 0000WWWW000ppppp000qqqqq000rrrrr n >> 16= 00000000000000000000WWWW000ppppp sum = 0000WWWW000ppppp000sssss00ttttttwhere
sssss is at most 11000b and tttttt at most 100000b (32 decimal).
;
; population count of a 32 bit register
;
; input:
; esi = number to population count
;
; output:
; edx = population count
;
; destroys:
; esi, edx
; eflags
;
MACRO POPCOUNT1
mov edx,esi
shr edx,1
and edx,055555555h ; (n >> 1) & 0x55555555
sub esi,edx ; n - ((n >> 1) & 0x55555555)
mov edx,esi
shr edx,2 ; n >> 2
and esi,033333333h ; n & 0x33333333
and edx,033333333h ; (n >> 2) & 0x33333333
add esi,edx ; n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
mov edx,esi
shr edx,4 ; n >> 4
add edx,esi ; n + (n >> 4)
and edx,00F0F0F0Fh ; n = (n + (n >> 4) & 0x0F0F0F0F)
IFDEF FASTMUL
imul edx,001010101h ; add by multiplying with a "magic number"
shr edx,24 ; shift result into place
ENDIF
IFNDEF FASTMUL
mov esi,edx
shr esi,8 ; n >> 8
add edx,esi ; n = n + (n >> 8)
mov esi,edx
shr esi,16 ; n >> 16
add edx,esi ; n = n + (n >> 16)
and edx,00000003Fh ; popcount
ENDIF
ENDMNote that FASTMUL can be defined if your CPU has does a multiply in few cycles (for example the AMD which can do a multiply in 2 cycles. This gem comes from the K6 optimization manual, available at AMDs homepage. There are a gem showing how to use this macro, see Emulate bit scan instructions for further information.