Here is a macro for a fast, branchless population count on a 32-bit register. I have put this together from various pieces of information found in News postings over the past 6 year.

POPCOUNT1 implements a branchless computation of the population count (ie. the number of 1-bits) of a 32-bit register. It is based on a O(log(n)) algorithm that succesively groups the bits into groups of 2, 4, 8, 16, and 32, while maintaining a count of 1-bits in each group.

Partition the register into groups of 2 bits. Compute the population count for each 2-bit group and store the result in the 2-bit group. This calls for the following transformation to be performed for each 2-bit group: 00b -> 00b, 01b -> 01b, 10b -> 01b, 11b -> 10b If the orginal value of a 2-bit group is x, then the new value will be x - (x >> 1). In order to handle all 2-bit groups simultaneously, we have to mask appropriately to prevent spilling from one group to the next lower group. Thus: n = n - ((n >> 1) & 0x55555555)
Now, add the population count of adjacent 2-bit groups and store the sum to the 4-bit group resulting from merging these adjacent 2-bit groups. To do this simultaneous to all groups, we mask out the odd numbered groups, mask out the even numbered groups, and then add the odd numered groups to the even numbered groups: n = (n & 0x33333333) + ((n >> 2) & 0x33333333) Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or 0100b.
Now, for the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. Thus the following computation is performed:
The result is four 8-bit fields whose lower half has the correct sum and whose upper half contains junk that has to be masked out. Pictorially:
where WWWW, XXXX, YYYY, and ZZZZ are the interesting sums each at most 1000b, or 8 decimal.

We now have two choices of how to continue. If the target machine has a fast integer multiply, we can use an IMUL to add the 4-bit fields like so:

0000pppp0000qqqq0000rrrr0000ssss * 00000001000000010000000100000001 =

                        :0000pppp0000qqqq0000rrrr0000ssss
                0000pppp:0000qqqq0000rrrr0000ssss
        0000pppp0000qqqq:0000rrrr0000ssss
0000pppp0000qqqq0000rrrr:0000ssss
---------------------------------------------------------------
0000ppppxxxxxxxxwwwwwwww:vvvvvvvvuuuuuuuutttttttt0000ssss

where pppp, qqqq, rrrr, and ssss are the 4-bit groups we want to add, and vvvvvvvvv is the sum we are interest in. If integer multiplies are slow, we continue reducing the number of bit groups by a factor of two per step until only one group remains:

For the next step, adjacent 8-bit fields are added together and the sum is put into 16-bit fields created by merging the adjacent 8-bit fields. Compute:
where ppppp and rrrrr are the interesting sums (and each is at most 10000b or 16 decimal). The sum qqqqq is junk, but it is not necessary to discard it this time.
In the last step, the values of the two 16-bit groups are added, producing the final result in the six least significant bits. Any remaining junk is masked out.
where sssss is at most 11000b and tttttt at most 100000b (32 decimal).

;
; population count of a 32 bit register
;
; input:
;   esi = number to population count
;
; output:
;   edx = population count
;
; destroys:
;  esi, edx
;  eflags
;

MACRO   POPCOUNT1
        mov     edx,esi
        shr     edx,1
        and     edx,055555555h          ; (n >> 1) & 0x55555555
        sub     esi,edx                 ; n - ((n >> 1) & 0x55555555)
        mov     edx,esi
        shr     edx,2                   ; n >> 2
        and     esi,033333333h          ; n & 0x33333333
        and     edx,033333333h          ; (n >> 2)  & 0x33333333
        add     esi,edx                 ; n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
        mov     edx,esi
        shr     edx,4                   ; n >> 4
        add     edx,esi                 ; n + (n >> 4)
        and     edx,00F0F0F0Fh          ; n = (n + (n >> 4) & 0x0F0F0F0F)
        IFDEF FASTMUL
                imul    edx,001010101h  ; add by multiplying with a "magic number"
                shr     edx,24          ; shift result into place
        ENDIF
        IFNDEF FASTMUL
                mov     esi,edx
                shr     esi,8           ; n >> 8
                add     edx,esi         ; n = n + (n >> 8)
                mov     esi,edx
                shr     esi,16          ; n >> 16
                add     edx,esi         ; n = n + (n >> 16)
                and     edx,00000003Fh  ; popcount
        ENDIF
ENDM

Note that FASTMUL can be defined if your CPU has does a multiply in few cycles (for example the AMD which can do a multiply in 2 cycles. This gem comes from the K6 optimization manual, available at AMDs homepage. There are a gem showing how to use this macro, see Emulate bit scan instructions for further information.

Gem writer: Norbert Juffa
last updated: 1998-03-16