MOVS with SI=-X, DI=+XAssembler/8086

As we know, it is not possible to do MOVS with SI=-X DI=+X, where X is 2,4,8 depending on the size which you want to move. However here is a solution for this problem (here X=2):
In 16-bit mode:

;
; movs with si=-x, di=+x, x=2,4,8
;
; input:
;   si = source
;   di = destination
;   cx = number of bytes
;
; output:
;   none (data copied)
;
; destroys:
;   ax, bx, cx, dx, bp, si, di
;   flags
; 


        add     di,cx
        sub     cx,8            ; cx has (even) number of bytes to handle
        jb      done4
next4:
        mov     ax,[si]
        mov     bx,[si+2]
        mov     dx,[si+4]
        mov     bp,[si+6]
        mov     [di-8],bp
        mov     [di-6],dx
        mov     [di-4],bx
        mov     [di-2],ax
        add     si,8
        sub     di,8
        sub     cx,8
        jae     next4
done4:
; Handle the last 0 to 3 16-bit words here!
This version pairs perfectly on a Pentium, for a throughput of 6 cycles for each block of 4 words, i.e. 1.5 cycles for each reversal.
I do all accesses in ascending order within the unrolled loop, since most PC hardware handles that better than reversed writes.
In 32-bit mode, with X=4:
;
; movs with si=-x, di=+x, x=2,4,8
;
; input:
;   esi = source
;   edi = destination
;   ecx = number of bytes (preferably divisible by 4!)
;
; output:
;   none (data copied)
;
; destroys:
;   eax, ebx, ecx, edx, ebp, esi, edi
;   eflags
;

        lea     edi,[edi+ecx-4] ; Point edi at last 2-word block
        sub     ecx,16
        jb      done8
next8:
        mov     eax,[esi]
        mov     ebx,[esi+4]
        rol     ebx,16
        mov     edx,[esi+8]
        rol     edx,16
        mov     ebp,[esi+12]
        rol     ebp,16
        add     esi,16
        rol     eax,16
        mov     [edi-12],ebp
        mov     [edi-8],edx
        mov     [edi-4],ebx
        mov     [edi],eax
        sub     edi,16
        sub     ecx,16
        jae     next8
done8:
This version would also pair perfectly, using just 8 cycles for the 16-instruction unrolled loop, which would copy/reverse 8 16-bit words at a rate of 1 reversal/cycle.
Gem writer: Terje Mathisen
last updated: 1998-03-16