What if you want a routine to invert an 8x8, 16x16, or 32x32 bit array. Inversion, as used here, is defined as exchanging the bit position and byte/word/dword offset for each bit (as opposed to flipping each bit, which is also called inversion):
Bit #: 76543210 76543210 x0 00010000b y0 00000000b x1 00111000b y1 01111000b x2 01101100b y2 01111100b x3 11000110b --> y3 00010110b x4 11111110b y4 00010011b x5 11000110b y5 00010110b x6 11000110b y6 01111100b x7 00000000b y7 01111000bThe inversion consists of a 90-degree clock-wise rotation plus a flip around the X axis. The latter part can be disregarded, since that can be done by using negative instead of positive Y-increments.
I'd do the rotation part with one or more lookup tables. 4x4 bit chunks would need a single 64k 16-bit table (i.e. 128KB), which will fit easily inside the 256K of L2 cache which I assume you have.
Each 8x4 block would then be rotated independently and joined together, something like this: mov al,[x1] ; 00,01,02,03,04,05,06,07
mov bl,[x2] ; 08,09,0a,0b,0c,0d,0e,0f
mov ah,[x3] ; 10,11,12,13,14,15,16,17
mov bh,[x4] ; 18,19,1a,1b,1c,1d,1e,1f
mov esi,eax
and eax,0ff00ffh
shr esi,4
mov edx,ebx
shl edx,4
and ebx,0ff00ff00h
and esi,0ff00ffh
or eax,ebx ; 00,01,02,03,08,09,0a,0b,10,11,12,13,18,19,1a,1b
and edx,0ff00ff00h
or edx,esi
mov ax,rotate_4x4[eax*2]
mov dx,rotate_4x4[edx*2]
mov [temp1],eax
mov [temp2],edx
Using the same code twice and splitting/merging the two sets of results, it seems like it should take about 30 cycles for each 8x8 block.
With a 90MHz pentium, this equates to about 90/30*8 = 24MB/s, which is probably faster than the rest of the system could keep up.