Hadamard Transform in MMXAssembler/Pentium+MMX

This is an implementation of a Hadamard Transform if the form of a 4x4 matrix multiplication. The matrix is of the form:
       + + + +
  H =  + - + -
       + + - -
       + - - +
The '+' stands for +1 and the '-' for -1. The implementation uses MMX. The vector to transform are stored in a MMX register with 16 bits per component, having 4 components. The components' magnitude is about 12 bits.

Using PMADDWD to do four 16x16 -> 32 bit muls plus two 32+32 -> 32 adds at once can reduce the time to about 9 cycles with some more unrolling/scheduling.

This has the added benefit of being much more generally useful, because all the intermediate operations are done on 32-bit values, with a single saturated pack operation at the end to get back to 16-bit results.

;
; hadamard transform (req mmx)
;
; input:
;   mm0 = vector to transform
;   matrix = matrix of hadamard transform
;
; output:
;   mm0 = result from transform
;
; destroys:
;   mm1, mm2, mm3, mm4
;   flags
;

        movq    mm1,mm0
        pmaddwd mm0,matrix[0]   ; t00+t01 t02+t03

        movq    mm2,mm1
        pmaddwd mm1,matrix[8]   ; t10+t11 t12+t13

        movq    mm3,mm2
        pmaddwd mm2,matrix[16]  ; t20+t21 t22+t23

        pmaddwd mm3,matrix[24]  ; t30+t31 t32+t33
        movq    mm4,mm0

        punpckldq mm0,mm1       ; t00+t01 t10+t11
        punpckhdq mm4,mm1       ; t02+t03 t12+t13

        movq    mm1,mm2
        punpckldq mm2,mm3       ; t20+t21 t30+t31

        punpckhdq mm1,mm3       ; t22+t23 t32+t33
        paddd   mm0,mm4         ; t0      t1

        paddd   mm1,mm2         ; t2      t3

        packssdw mm0,mm1        ; t0 t1 t2 t3
Gem writer: Terje Mathisen
last updated: 1998-06-06