Replace Bit scan instructionsAssembler/80386+FPU

BSF and BSR are the poorest optimized instructions on the Pentium, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped. (on later processors it takes only 1 or 2) The following code emulates BSR ECX,EAX:

        test    eax,eax
        jz      short bs1
        mov     [dword ptr temp],eax
        mov     [dword ptr temp+4],0
        fild    [qword ptr temp]
        fstp    [qword ptr temp]
        wait    ; WAIT only needed for compatibility with earlier processors
        mov     ecx,[dword ptr temp+4]
        shr     ecx,20
        sub     ecx,3ffh
        test    eax,eax       ; clear zero flag
bs1:
The following code emulates BSF ECX,EAX:
        test    eax,eax
        jz      short bs2
        xor     ECX,ECX
        mov     [dword ptr temp+4],ecx
        sub     ecx,eax
        and     eax,ecx
        mov     [dword ptr temp],eax
        fild    [qword ptr temp]
        fstp    [qword ptr temp]
        wait    ; WAIT only needed for compatibility with earlier processors
        mov     ecx,[dword ptr temp+4]
        shr     ecx,20
        sub     ecx,3ffh
        test    eax,eax       ; clear zero flag
bs2:
This gem comes from Agner Fog's "How to optimize for Pentium(tm) microprocessor". This manual is highly recommended.
Gem writer: Agner Fog
last updated: 1998-03-16