Hadamard Transform in MMX | Assembler/Pentium+MMX |
+ + + + H = + - + - + + - - + - - +The '+' stands for +1 and the '-' for -1. The implementation uses MMX. The vector to transform are stored in a MMX register with 16 bits per component, having 4 components. The components' magnitude is about 12 bits.
Using PMADDWD to do four 16x16 -> 32 bit muls plus two 32+32 -> 32 adds at once can reduce the time to about 9 cycles with some more unrolling/scheduling.
This has the added benefit of being much more generally useful, because all the intermediate operations are done on 32-bit values, with a single saturated pack operation at the end to get back to 16-bit results.;
; hadamard transform (req mmx)
;
; input:
; mm0 = vector to transform
; matrix = matrix of hadamard transform
;
; output:
; mm0 = result from transform
;
; destroys:
; mm1, mm2, mm3, mm4
; flags
;
movq mm1,mm0
pmaddwd mm0,matrix[0] ; t00+t01 t02+t03
movq mm2,mm1
pmaddwd mm1,matrix[8] ; t10+t11 t12+t13
movq mm3,mm2
pmaddwd mm2,matrix[16] ; t20+t21 t22+t23
pmaddwd mm3,matrix[24] ; t30+t31 t32+t33
movq mm4,mm0
punpckldq mm0,mm1 ; t00+t01 t10+t11
punpckhdq mm4,mm1 ; t02+t03 t12+t13
movq mm1,mm2
punpckldq mm2,mm3 ; t20+t21 t30+t31
punpckhdq mm1,mm3 ; t22+t23 t32+t33
paddd mm0,mm4 ; t0 t1
paddd mm1,mm2 ; t2 t3
packssdw mm0,mm1 ; t0 t1 t2 t3