In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause a slowdown: add eax,ebx
add ecx,eax
takes two cycles. On a Pentium, however, it takes two cycles too, but the add eax,ebx
add ecx,edx
takes one cycle because the second instruction does not use the result of the first so they can be 'pair'-ed. These situations are quite well described in the application note "Intel Architecture Optimization Manual" for released by Intel. I just want to point to one interesting thing. Generally the 486 has two types of flow-dependence penalties:
(E)AX
, (E)BX
, (E)CX
, (E)DX
after AL
, BH
etc. has been changed).
LEA
is an addressing instruction). For example, how many cycles does the following code sequence eat (in protected mode, assuming 100% cache hit):add ecx,ebp adc bl,dl mov al,[ebx]On the 486 the
ADD
is one, the ADC
is another one, but the MOV
takes three cycles even if the operand is already in the cache. Why? There is a double penalty: One clock for using a register after it was modified (Address Generation Interlock - AGI),; another cycle for using a register after its subregister was modified (Flow Break). So this innocent MOV
instruction costs three cycles. I'm a smart coder, I'm gonna put an instruction between the ADC
and the MOV
, and the problem is solved! Really? Theadd ecx,ebp adc bl,dl sub esi,ebp mov al,[ebx]sequence takes 5 clocks: the
ADD
, ADC
and SUB
take three but the MOV
takes two because ONE cycle inserted BETWEEN the ADC
and the MOV
can save only ONE penalty, not TWO. So for a perfect on clock per one instruction ratio at least TWO instructions have to be inserted. Or, one two-cycle instruction like SHR
or even a prefixed like ADD AX,BX
in 32-bit code.