Sometimes it might be needed to take a 64-bit number (stored in two registers) and shift it up to 63 places (using two more dwords to take the extra), with the exact number stored in CL:eHI : eLO : HI : LO
One solution is: shld eLo, Hi, cl ; shift mod 32
shld HI, LO, cl
shl ecx,28 ; put 5th bit into CF
jnc done ; shift by 32
mov eHI, eLO
mov eLO, HI
mov HI, LO
xor LO, LO
done:
Which is rather clunky, but still close to optimal since we will have to handle shiftcounts greater or equal to 32. I'd replace the SHL ECX,28
with a simple TEST CL,32
or CMP CL,32
but otherwise your code seems to be close to optimal.
The only other idea I'd look into would be to get totally rid of the double shifts, since they are very slow. On a Pentium you could probably use dedicated code for most of the shift counts, and avoid the variable shifts as well. This would be a pretty much unpredictable branch though, so on a PPro/PII it would be very slow.
It might be worthwhile to generate two sets of code (not just the shifts, but all of the inner loop(s), one for Pentium and the other for PPro+.
Pentium code: jmp ShiftTable[ecx*4]
MACRO SmallShift count, eHi, eLo, HI, LO
mov eHi,LO
mov eLo,HI
shr eHi,32-count
shl HI,count
shr eLo,32-count
or HI,eHi
shl LO,count
ENDM SmallShift
COUNT = 1
REPT 31
ShiftTarget&&<COUNT>:
SmallShift COUNT, eHi, eLo, HI, LO
xor eHi,eHi
jmp ShiftDone
COUNT = COUNT + 1
ENDM
ShiftTarget32:
mov eLo, HI
mov HI, LO
xor LO, LO
jmp ShiftDone
COUNT = 33
REPT 31
ShiftTarget&&<COUNT>:
SmallShift COUNT - 32, eLo, eHi, HI, LO
mov eLo, HI
xor LO, LO
jmp ShiftDone
COUNT = COUNT + 1
ENDM
ShiftTarget0:
ShiftDone:
The code above will have one initial unpredictable branch, taking about 4-5 cycles, and a perfectly predicatble JMP at the end, so it should run in about 10 cycles on a Pentium. The "naive" code above will use 9-11 cycles if the branch is correctly predicted, so it is only when you randomly alternate between shift counts above/below 32 that the much more complicated version can be a big win.