Population count of a 32-bit registerAssembler/80386

Here is a macro for a fast, branchless population count on a 32-bit register. I have put this together from various pieces of information found in News postings over the past 6 year.

POPCOUNT1 implements a branchless computation of the population count (ie. the number of 1-bits) of a 32-bit register. It is based on a O(log(n)) algorithm that succesively groups the bits into groups of 2, 4, 8, 16, and 32, while maintaining a count of 1-bits in each group.

  1. Partition the register into groups of 2 bits. Compute the population count for each 2-bit group and store the result in the 2-bit group. This calls for the following transformation to be performed for each 2-bit group: 00b -> 00b, 01b -> 01b, 10b -> 01b, 11b -> 10b If the orginal value of a 2-bit group is x, then the new value will be x - (x >> 1). In order to handle all 2-bit groups simultaneously, we have to mask appropriately to prevent spilling from one group to the next lower group. Thus: n = n - ((n >> 1) & 0x55555555)

  2. Now, add the population count of adjacent 2-bit groups and store the sum to the 4-bit group resulting from merging these adjacent 2-bit groups. To do this simultaneous to all groups, we mask out the odd numbered groups, mask out the even numbered groups, and then add the odd numered groups to the even numbered groups: n = (n & 0x33333333) + ((n >> 2) & 0x33333333) Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or 0100b.

  3. Now, for the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. Thus the following computation is performed: The result is four 8-bit fields whose lower half has the correct sum and whose upper half contains junk that has to be masked out. Pictorially:

    where WWWW, XXXX, YYYY, and ZZZZ are the interesting sums each at most 1000b, or 8 decimal.

  4. We now have two choices of how to continue. If the target machine has a fast integer multiply, we can use an IMUL to add the 4-bit fields like so: where pppp, qqqq, rrrr, and ssss are the 4-bit groups we want to add, and vvvvvvvvv is the sum we are interest in. If integer multiplies are slow, we continue reducing the number of bit groups by a factor of two per step until only one group remains:

    1. For the next step, adjacent 8-bit fields are added together and the sum is put into 16-bit fields created by merging the adjacent 8-bit fields. Compute:
        n = n + (n >> 8)
        
        n      = 0000WWWW0000XXXX0000YYYY0000ZZZZ
        n >> 8 = 000000000000WWWW0000XXXX0000YYYY
        sum    = 0000WWWW000ppppp000qqqqq000rrrrr
      where ppppp and rrrrr are the interesting sums (and each is at most 10000b or 16 decimal). The sum qqqqq is junk, but it is not necessary to discard it this time.

    2. In the last step, the values of the two 16-bit groups are added, producing the final result in the six least significant bits. Any remaining junk is masked out.
        n = (n + (n >> 16)) & 0x3F
        
        n      = 0000WWWW000ppppp000qqqqq000rrrrr
        n >> 16= 00000000000000000000WWWW000ppppp
        sum    = 0000WWWW000ppppp000sssss00tttttt
      where sssss is at most 11000b and tttttt at most 100000b (32 decimal).
    ;
    ; population count of a 32 bit register
    ;
    ; input:
    ;   esi = number to population count
    ;
    ; output:
    ;   edx = population count
    ;
    ; destroys:
    ;  esi, edx
    ;  eflags
    ;
    
    MACRO   POPCOUNT1
            mov     edx,esi
            shr     edx,1
            and     edx,055555555h          ; (n >> 1) & 0x55555555
            sub     esi,edx                 ; n - ((n >> 1) & 0x55555555)
            mov     edx,esi
            shr     edx,2                   ; n >> 2
            and     esi,033333333h          ; n & 0x33333333
            and     edx,033333333h          ; (n >> 2)  & 0x33333333
            add     esi,edx                 ; n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
            mov     edx,esi
            shr     edx,4                   ; n >> 4
            add     edx,esi                 ; n + (n >> 4)
            and     edx,00F0F0F0Fh          ; n = (n + (n >> 4) & 0x0F0F0F0F)
            IFDEF FASTMUL
                    imul    edx,001010101h  ; add by multiplying with a "magic number"
                    shr     edx,24          ; shift result into place
            ENDIF
            IFNDEF FASTMUL
                    mov     esi,edx
                    shr     esi,8           ; n >> 8
                    add     edx,esi         ; n = n + (n >> 8)
                    mov     esi,edx
                    shr     esi,16          ; n >> 16
                    add     edx,esi         ; n = n + (n >> 16)
                    and     edx,00000003Fh  ; popcount
            ENDIF
    ENDM
    Note that FASTMUL can be defined if your CPU has does a multiply in few cycles (for example the AMD which can do a multiply in 2 cycles. This gem comes from the K6 optimization manual, available at AMDs homepage. There are a gem showing how to use this macro, see Emulate bit scan instructions for further information.
    Gem writer: Norbert Juffa
    last updated: 1998-03-16