Snippets from Ramblins in RealtimeAssembler/Pentium

by: Michael Abrash, published in: Dr Dobbs Sourcebook

I found these gems in Dr Dobbs Sourcebook (September/October 1996). They are all written by Michael Abrash and it is with his permission I make them available here. Further publishing WILL require Mr. Abrash or Dr. Dobbs permisson.
The routines cover highspeed crossproduct, dotproduct and matrix transformations. They are designed to fully utilize the Intel Pentium(tm) floating point pipeline. The optimizations used here can be found in "Optimizations for Intel's 32-Bit Processors", available at Intels Homepage.
After reading Mr Abrash article I learned several ways to speed up FPU math code. I also learned that fixedpoint will not be used in many areas because it can not give you the speed and accuracy needed today.

First a dot product, fairly simple to understand. You have two vectors ([vec0],[vec1]), and the result comes in [dot].

; unoptimized dot product; 17 cycles
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
                                ;stalls for cycles 6-7
        faddp   st(1),st(0)     ;starts on cycle 8
                                ;stalls for cycles 9-10
        faddp   st(1),st(0)     ;starts on cycle 11
                                ;stalls for cycles 12-14
        fstp    [dot]           ;starts on cycle 15,
                                ;ends on cycle 16
And now an optimized version of the dot product:
; optimized dot product; 15 cycles
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
        fxch    st(1)           ;no cost
        faddp   st(2),st(0)     ;starts on cycle 6
                                ;stalls for cycles 7-8
        faddp   st(1),st(0)     ;starts on cycle 9
                                ;stalls for cycles 10-12
        fstp    [dot]           ;starts on cycle 13,
                                ;ends on cycle 14
Here is a cross-product. Here you also have two vectors, [vec0],[vec1], but now the result is another vector [vec2]:
; unoptimized cross product; 36 cycles
        fld     [vec0+4]        ;starts & ends on cycle 0
        fmul    [vec1+8]        ;starts on cycle 1
        fld     [vec0+8]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
                                ;stalls for cycles 4-5
        fsubrp  st(1),st(0)     ;starts on cycle 6
                                ;stalls for cycles 7-9
        fstp    [vec2+0]        ;starts on cycle 10,
                                ;ends on cycle 11
        fld     [vec0+8]        ;starts & ends on cycle 12
        fmul    [vec1+0]        ;starts on cycle 13
        fld     [vec0+0]        ;starts & ends on cycle 14
        fmul    [vec1+8]        ;starts on cycle 15
                                ;stalls for cycles 16-17
        fsubrp  st(1),st(0)     ;starts on cycle 18
                                ;stalls for cycles 19-21
        fstp    [vec2+4]        ;starts on cycle 22,
                                ;ends on cycle 23
        fld     [vec0+0]        ;starts & ends on cycle 24
        fmul    [vec1+4]        ;starts on cycle 25
        fld     [vec0+4]        ;starts & ends on cycle 26
        fmul    [vec1+0]        ;starts on cycle 27
                                ;stalls for cycles 28-29
        fsubrp  st(1),st(0)     ;starts on cycle 30
                                ;stalls for cycles 31-33
        fstp    [vec2+8]        ;starts on cycle 34,
                                ;ends on cycle 35
Same as above, only much more optimized. And now, it is fast:
; optimized cross product; 22 cycles
        fld     [vec0+4]        ;starts & ends on cycle 0
        fmul    [vec1+8]        ;starts on cycle 1
        fld     [vec0+8]        ;starts & ends on cycle 2
        fmul    [vec1+0]        ;starts on cycle 3
        fld     [vec0+0]        ;starts & ends on cycle 4
        fmul    [vec1+4]        ;starts on cycle 5
        fld     [vec0+8]        ;starts & ends on cycle 6
        fmul    [vec1+4]        ;starts on cycle 7
        fld     [vec0+0]        ;starts & ends on cycle 8
        fmul    [vec1+8]        ;starts on cycle 9
        fld     [vec0+4]        ;starts & ends on cycle 10
        fmul    [vec1+0]        ;starts on cycle 11
        fxch    st(2)           ;no cost
        fsubrp  st(5),st(0)     ;starts on cycle 12
        fsubrp  st(3),st(0)     ;starts on cycle 13
        fsubrp  st(1),st(0)     ;starts on cycle 14
        fxch    st(2)           ;no cost
                                ;stalls for cycle 15
        fstp    [vec2+0]        ;starts on cycle 16,
                                ;ends on cycle 17
        fstp    [vec2+4]        ;starts on cycle 18,
                                ;ends on cycle 19
        fstp    [vec2+8]        ;starts on cycle 20,
                                ;ends on cycle 21
Here is a fast routine for matrix transformation. The structures are quite complex here. Mr Abrash uses a 4x4 matrix with bottom row equal to: [0,0,0,1]. The code should be rather self explaining:
; optimized transformation: 34 cycles
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [matrix+0]      ;starts on cycle 1
        fld     [vec0+0]        ;starts & ends on cycle 2
        fmul    [matrix+16]     ;starts on cycle 3
        fld     [vec0+0]        ;starts & ends on cycle 4
        fmul    [matrix+32]     ;starts on cycle 5
        fld     [vec0+4]        ;starts & ends on cycle 6
        fmul    [matrix+4]      ;starts on cycle 7
        fld     [vec0+4]        ;starts & ends on cycle 8
        fmul    [matrix+20]     ;starts on cycle 9
        fld     [vec0+4]        ;starts & ends on cycle 10
        fmul    [matrix+36]     ;starts on cycle 11
        fxch    st(2)           ;no cost
        faddp   st(5),st(0)     ;starts on cycle 12
        faddp   st(3),st(0)     ;starts on cycle 13
        faddp   st(1),st(0)     ;starts on cycle 14
        fld     [vec0+8]        ;starts & ends on cycle 15
        fmul    [matrix+8]      ;starts on cycle 16
        fld     [vec0+8]        ;starts & ends on cycle 17
        fmul    [matrix+24]     ;starts on cycle 18
        fld     [vec0+8]        ;starts & ends on cycle 19
        fmul    [matrix+40]     ;starts on cycle 20
        fxch    st(2)           ;no cost
        faddp   st(5),st(0)     ;starts on cycle 21
        faddp   st(3),st(0)     ;starts on cycle 22
        faddp   st(1),st(0)     ;starts on cycle 23
        fxch    st(2)           ;no cost
        fadd    [matrix+12]     ;starts on cycle 24
        fxch    st(1)           ;starts on cycle 25
        fadd    [matrix+28]     ;starts on cycle 26
        fxch    st(2)           ;no cost
        fadd    [matrix+44]     ;starts on cycle 27
        fxch    st(1)           ;no cost
        fstp    [vec1+0]        ;starts on cycle 28,
                                ;ends on cycle 29
        fstp    [vec1+8]        ;starts on cycle 30,
                                ;ends on cycle 31
        fstp    [vec1+4]        ;starts on cycle 32,
                                ;ends on cycle 33
Gem writer (with permission): Michael Abrash
Published in: Dr. Dobbs Sourcebook
last updated: 1998-06-15