Copying data using the FPUASM/Pentium+FPU

This gem show you how you can use your FPU to copy data between memory locations? The following loop can be used for block memory copying. I don't know who was the original developer of this kind of loop, but it has been presented in various documents. This version comes from Agner Fogs excelent Pentium optimization manual

;
; copying data using the fpu
;
; input:
;   esi = source
;   edi = destination
;   ecx = number of 16-byte chunks to move
;
; output:
;   none (data from esi is copied to edi)
;
; destorys:
;   esi, edi, ecx
;   flags, fp flags
;

topofloop:
	fild    qword ptr [esi]
	fild    qword ptr [esi+8]
	fxch
	fistp   qword ptr [edi]
	fistp   qword ptr [edi+8]
	add     esi,16
	add     edi,16
	dec     ecx
        jnz     topofloop
The loop is optimal on (a fast) Pentium when both the source and destination are aligned on 64-bit boundaries and the destination is not in the cache. (Additionally the loop can be optimal on PPro if the destination does not permit write-combining.)
If the destination is in the cache (or the destination memory permits write combining on PPro) then REP MOVSD will be faster.
The loop is faster than REP MOVSD, because it does half as many writes to external memory (with the noted exceptions). External memory is usually very slow compared to the execution time of the loop. Consequently after a few iterations of the loop the write buffers of the CPU become filled and subsequent iterations of the loop will execute at the speed of external memory. For small memory blocks you should use a simple DWORD copy loop, because the overhead of the FPU copy loop is much higher than that of most other memory copy loops.
You might think that you should use FLD/FSTP instead of FILD/FISTP. Unfortunately FLD/FSTP would not work very well, because all 64-bit values are not normal floating point values. The handling of denormal floating point numbers is very slow.
But it's eve worse. Denormals (see notes) make the FLD/FSTP copying slow, but it will still be functionally correct. But, if the data represents an SNAN (see notes), it will be quietly converted to a QNAN (see notes) if IE is masked (CW.IM = 1), or you will get an exception if IE is unmasked (CW.IM = 0).
Therefore one should really forget about trying to use FLD/FSTP for memory copy loops.

For related information see Agner Fog's Pentium optimization manual (you can find it at http://www.agner.org/assem and Intel's Pentium Pro developer's manual volume 3 for information on write buffers, caches, write-combining etc... (it can be found at Intel's developer WWW site).

notes:
SNANs are all the numbers where bits <62:52> = 7FFh, and bit <51> = 0 and bits <50:0> !=0. An SNAN is converted to a QNAN by setting bit<51>.

Denormals are numbers when the exponent field has all bit set to 0 and the mantissa is non-zero. Or in the copy process the bits 62-52 (exponent field) of each aligned 64-bit entitiy is zero.

Gem writer: (code) Agner Fog
(text) Vesa Karvonen
(comments) Norbert Juffa
last updated: 1998-03-16