BSF
and BSR
are the poorest optimized instructions on the Pentium, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped. (on later processors it takes only 1 or 2)
The following code emulates BSR ECX,EAX
: test eax,eax
jz short bs1
mov [dword ptr temp],eax
mov [dword ptr temp+4],0
fild [qword ptr temp]
fstp [qword ptr temp]
wait ; WAIT only needed for compatibility with earlier processors
mov ecx,[dword ptr temp+4]
shr ecx,20
sub ecx,3ffh
test eax,eax ; clear zero flag
bs1:
The following code emulates BSF ECX,EAX
: test eax,eax
jz short bs2
xor ECX,ECX
mov [dword ptr temp+4],ecx
sub ecx,eax
and eax,ecx
mov [dword ptr temp],eax
fild [qword ptr temp]
fstp [qword ptr temp]
wait ; WAIT only needed for compatibility with earlier processors
mov ecx,[dword ptr temp+4]
shr ecx,20
sub ecx,3ffh
test eax,eax ; clear zero flag
bs2:
This gem comes from Agner Fog's "How to optimize for Pentium(tm) microprocessor". This manual is highly recommended.