Copying data using the FPU | ASM/Pentium+FPU |
;
; copying data using the fpu
;
; input:
; esi = source
; edi = destination
; ecx = number of 16-byte chunks to move
;
; output:
; none (data from esi is copied to edi)
;
; destorys:
; esi, edi, ecx
; flags, fp flags
;
topofloop:
fild qword ptr [esi]
fild qword ptr [esi+8]
fxch
fistp qword ptr [edi]
fistp qword ptr [edi+8]
add esi,16
add edi,16
dec ecx
jnz topofloop
The loop is optimal on (a fast) Pentium when both the source and destination are aligned on 64-bit boundaries and the destination is not in the cache. (Additionally the loop can be optimal on PPro if the destination does not permit write-combining.)
If the destination is in the cache (or the destination memory permits write combining on PPro) then REP MOVSD
will be faster.
The loop is faster than REP MOVSD
, because it does half as many writes to external memory (with the noted exceptions). External memory is usually very slow compared to the execution time of the loop. Consequently after a few iterations of the loop the write buffers of the CPU become filled and subsequent iterations of the loop will execute at the speed of external memory. For small memory blocks you should use a simple DWORD
copy loop, because the overhead of the FPU copy loop is much higher than that of most other memory copy loops.
You might think that you should use FLD/FSTP
instead of FILD/FISTP
. Unfortunately FLD/FSTP
would not work very well, because all 64-bit values are not normal floating point values. The handling of denormal floating point numbers is very slow.
But it's eve worse. Denormals (see notes) make the FLD/FSTP
copying slow, but it will still be functionally correct. But, if the data represents an SNAN (see notes), it will be quietly converted to a QNAN (see notes) if IE
is masked (CW.IM = 1
), or you will get an exception if IE
is unmasked (CW.IM = 0
).
Therefore one should really forget about trying to use FLD/FSTP
for memory copy loops.
For related information see Agner Fog's Pentium optimization manual (you can find it at http://www.agner.org/assem and Intel's Pentium Pro developer's manual volume 3 for information on write buffers, caches, write-combining etc... (it can be found at Intel's developer WWW site).
notes:
SNANs are all the numbers where bits <62:52> = 7FFh
, and bit <51> = 0
and bits <50:0> !=0
. An SNAN is converted to a QNAN by setting bit<51>.
Denormals are numbers when the exponent field has all bit set to 0
and the mantissa is non-zero. Or in the copy process the bits 62-52
(exponent field) of each aligned 64-bit entitiy is zero.