Sometimes it might be needed to take a 64-bit number (stored in two registers) and shift it up to 63 places (using two more dwords to take the extra), with the exact number stored in CL:eHI : eLO : HI : LO One solution is:

        shld    eLo, Hi, cl     ; shift mod 32
        shld    HI, LO, cl
        shl     ecx,28          ; put 5th bit into CF
 
        jnc     done            ; shift by 32
        mov     eHI, eLO
        mov     eLO, HI
        mov     HI, LO
        xor     LO, LO
done:

Which is rather clunky, but still close to optimal since we will have to handle shiftcounts greater or equal to 32. I'd replace the SHL ECX,28 with a simple TEST CL,32 or CMP CL,32 but otherwise your code seems to be close to optimal.

The only other idea I'd look into would be to get totally rid of the double shifts, since they are very slow. On a Pentium you could probably use dedicated code for most of the shift counts, and avoid the variable shifts as well. This would be a pretty much unpredictable branch though, so on a PPro/PII it would be very slow.

It might be worthwhile to generate two sets of code (not just the shifts, but all of the inner loop(s), one for Pentium and the other for PPro+.

Pentium code:

        jmp     ShiftTable[ecx*4]

MACRO   SmallShift      count, eHi, eLo, HI, LO

        mov     eHi,LO
        mov     eLo,HI
        shr     eHi,32-count
        shl     HI,count
        shr     eLo,32-count
        or      HI,eHi
        shl     LO,count

ENDM    SmallShift

COUNT = 1
REPT 31
ShiftTarget&&<COUNT>:

        SmallShift COUNT, eHi, eLo, HI, LO
        xor     eHi,eHi
        jmp     ShiftDone
        COUNT = COUNT + 1

ENDM

ShiftTarget32:
        mov     eLo, HI
        mov     HI, LO
        xor     LO, LO
        jmp     ShiftDone

COUNT = 33
REPT 31
ShiftTarget&&<COUNT>:
        SmallShift COUNT - 32, eLo, eHi, HI, LO
        mov     eLo, HI
        xor     LO, LO
        jmp     ShiftDone
        COUNT = COUNT + 1
ENDM

ShiftTarget0:
ShiftDone:

The code above will have one initial unpredictable branch, taking about 4-5 cycles, and a perfectly predicatble JMP at the end, so it should run in about 10 cycles on a Pentium. The "naive" code above will use 9-11 cycles if the branch is correctly predicted, so it is only when you randomly alternate between shift counts above/below 32 that the much more complicated version can be a big win.

Gem writer: Terje Mathisen
last updated: 1998-03-16