Penalties on the 486Assembler/80486

In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause a slowdown:

        add     eax,ebx
        add     ecx,eax
takes two cycles. On a Pentium, however, it takes two cycles too, but the
        add     eax,ebx
        add     ecx,edx
takes one cycle because the second instruction does not use the result of the first so they can be 'pair'-ed. These situations are quite well described in the application note "Intel Architecture Optimization Manual" for released by Intel. I just want to point to one interesting thing. Generally the 486 has two types of flow-dependence penalties:
        add     ecx,ebp
        adc     bl,dl
        mov     al,[ebx]
On the 486 the ADD is one, the ADC is another one, but the MOV takes three cycles even if the operand is already in the cache. Why? There is a double penalty: One clock for using a register after it was modified (Address Generation Interlock - AGI),; another cycle for using a register after its subregister was modified (Flow Break). So this innocent MOV instruction costs three cycles. I'm a smart coder, I'm gonna put an instruction between the ADC and the MOV, and the problem is solved! Really? The
        add     ecx,ebp
        adc     bl,dl
        sub     esi,ebp
        mov     al,[ebx]
sequence takes 5 clocks: the ADD, ADC and SUB take three but the MOV takes two because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE penalty, not TWO. So for a perfect on clock per one instruction ratio at least TWO instructions have to be inserted. Or, one two-cycle instruction like SHR or even a prefixed like ADD AX,BX in 32-bit code.
Gem writer: Ervin Toth
last updated: 1998-03-16