I found these gems in Dr Dobbs Sourcebook (September/October 1996). They are all written by Michael Abrash and it is with his permission I make them available here. Further publishing WILL require Mr. Abrash or Dr. Dobbs permisson.
The routines cover highspeed crossproduct, dotproduct and matrix transformations. They are designed to fully utilize the Intel Pentium(tm) floating point pipeline. The optimizations used here can be found in "Optimizations for Intel's 32-Bit Processors", available at Intels Homepage.
After reading Mr Abrash article I learned several ways to speed up FPU math code. I also learned that fixedpoint will not be used in many areas because it can not give you the speed and accuracy needed today.
First a dot product, fairly simple to understand. You have two vectors ([vec0],[vec1]
), and the result comes in [dot]
.; unoptimized dot product; 17 cycles
fld [vec0+0] ;starts & ends on cycle 0
fmul [vec1+0] ;starts on cycle 1
fld [vec0+4] ;starts & ends on cycle 2
fmul [vec1+4] ;starts on cycle 3
fld [vec0+8] ;starts & ends on cycle 4
fmul [vec1+8] ;starts on cycle 5
;stalls for cycles 6-7
faddp st(1),st(0) ;starts on cycle 8
;stalls for cycles 9-10
faddp st(1),st(0) ;starts on cycle 11
;stalls for cycles 12-14
fstp [dot] ;starts on cycle 15,
;ends on cycle 16
And now an optimized version of the dot product:; optimized dot product; 15 cycles
fld [vec0+0] ;starts & ends on cycle 0
fmul [vec1+0] ;starts on cycle 1
fld [vec0+4] ;starts & ends on cycle 2
fmul [vec1+4] ;starts on cycle 3
fld [vec0+8] ;starts & ends on cycle 4
fmul [vec1+8] ;starts on cycle 5
fxch st(1) ;no cost
faddp st(2),st(0) ;starts on cycle 6
;stalls for cycles 7-8
faddp st(1),st(0) ;starts on cycle 9
;stalls for cycles 10-12
fstp [dot] ;starts on cycle 13,
;ends on cycle 14
Here is a cross-product. Here you also have two vectors, [vec0],[vec1]
, but now the result is another vector [vec2]
:; unoptimized cross product; 36 cycles
fld [vec0+4] ;starts & ends on cycle 0
fmul [vec1+8] ;starts on cycle 1
fld [vec0+8] ;starts & ends on cycle 2
fmul [vec1+4] ;starts on cycle 3
;stalls for cycles 4-5
fsubrp st(1),st(0) ;starts on cycle 6
;stalls for cycles 7-9
fstp [vec2+0] ;starts on cycle 10,
;ends on cycle 11
fld [vec0+8] ;starts & ends on cycle 12
fmul [vec1+0] ;starts on cycle 13
fld [vec0+0] ;starts & ends on cycle 14
fmul [vec1+8] ;starts on cycle 15
;stalls for cycles 16-17
fsubrp st(1),st(0) ;starts on cycle 18
;stalls for cycles 19-21
fstp [vec2+4] ;starts on cycle 22,
;ends on cycle 23
fld [vec0+0] ;starts & ends on cycle 24
fmul [vec1+4] ;starts on cycle 25
fld [vec0+4] ;starts & ends on cycle 26
fmul [vec1+0] ;starts on cycle 27
;stalls for cycles 28-29
fsubrp st(1),st(0) ;starts on cycle 30
;stalls for cycles 31-33
fstp [vec2+8] ;starts on cycle 34,
;ends on cycle 35
Same as above, only much more optimized. And now, it is fast:; optimized cross product; 22 cycles
fld [vec0+4] ;starts & ends on cycle 0
fmul [vec1+8] ;starts on cycle 1
fld [vec0+8] ;starts & ends on cycle 2
fmul [vec1+0] ;starts on cycle 3
fld [vec0+0] ;starts & ends on cycle 4
fmul [vec1+4] ;starts on cycle 5
fld [vec0+8] ;starts & ends on cycle 6
fmul [vec1+4] ;starts on cycle 7
fld [vec0+0] ;starts & ends on cycle 8
fmul [vec1+8] ;starts on cycle 9
fld [vec0+4] ;starts & ends on cycle 10
fmul [vec1+0] ;starts on cycle 11
fxch st(2) ;no cost
fsubrp st(5),st(0) ;starts on cycle 12
fsubrp st(3),st(0) ;starts on cycle 13
fsubrp st(1),st(0) ;starts on cycle 14
fxch st(2) ;no cost
;stalls for cycle 15
fstp [vec2+0] ;starts on cycle 16,
;ends on cycle 17
fstp [vec2+4] ;starts on cycle 18,
;ends on cycle 19
fstp [vec2+8] ;starts on cycle 20,
;ends on cycle 21
Here is a fast routine for matrix transformation. The structures are quite complex here. Mr Abrash uses a 4x4 matrix with bottom row equal to: [0,0,0,1]
. The code should be rather self explaining: ; optimized transformation: 34 cycles
fld [vec0+0] ;starts & ends on cycle 0
fmul [matrix+0] ;starts on cycle 1
fld [vec0+0] ;starts & ends on cycle 2
fmul [matrix+16] ;starts on cycle 3
fld [vec0+0] ;starts & ends on cycle 4
fmul [matrix+32] ;starts on cycle 5
fld [vec0+4] ;starts & ends on cycle 6
fmul [matrix+4] ;starts on cycle 7
fld [vec0+4] ;starts & ends on cycle 8
fmul [matrix+20] ;starts on cycle 9
fld [vec0+4] ;starts & ends on cycle 10
fmul [matrix+36] ;starts on cycle 11
fxch st(2) ;no cost
faddp st(5),st(0) ;starts on cycle 12
faddp st(3),st(0) ;starts on cycle 13
faddp st(1),st(0) ;starts on cycle 14
fld [vec0+8] ;starts & ends on cycle 15
fmul [matrix+8] ;starts on cycle 16
fld [vec0+8] ;starts & ends on cycle 17
fmul [matrix+24] ;starts on cycle 18
fld [vec0+8] ;starts & ends on cycle 19
fmul [matrix+40] ;starts on cycle 20
fxch st(2) ;no cost
faddp st(5),st(0) ;starts on cycle 21
faddp st(3),st(0) ;starts on cycle 22
faddp st(1),st(0) ;starts on cycle 23
fxch st(2) ;no cost
fadd [matrix+12] ;starts on cycle 24
fxch st(1) ;starts on cycle 25
fadd [matrix+28] ;starts on cycle 26
fxch st(2) ;no cost
fadd [matrix+44] ;starts on cycle 27
fxch st(1) ;no cost
fstp [vec1+0] ;starts on cycle 28,
;ends on cycle 29
fstp [vec1+8] ;starts on cycle 30,
;ends on cycle 31
fstp [vec1+4] ;starts on cycle 32,
;ends on cycle 33