In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause a slowdown:

        add     eax,ebx
        add     ecx,eax

takes two cycles. On a Pentium, however, it takes two cycles too, but the

        add     eax,ebx
        add     ecx,edx

takes one cycle because the second instruction does not use the result of the first so they can be 'pair'-ed. These situations are quite well described in the application note "Intel Architecture Optimization Manual" for released by Intel. I just want to point to one interesting thing. Generally the 486 has two types of flow-dependence penalties:

Immediately using a register after its 8-bit subregister was modified. This applies to (this applies to (E)AX, (E)BX, (E)CX, (E)DX after AL, BH etc. has been changed).
Using a register in addressing immediately after it was modified. (This is valid for all registers, and beware, LEA is an addressing instruction). For example, how many cycles does the following code sequence eat (in protected mode, assuming 100% cache hit):

add ecx,ebp adc bl,dl mov al,[ebx]
On the 486 the ADD is one, the ADC is another one, but the MOV takes three cycles even if the operand is already in the cache. Why? There is a double penalty: One clock for using a register after it was modified (Address Generation Interlock - AGI),; another cycle for using a register after its subregister was modified (Flow Break). So this innocent MOV instruction costs three cycles. I'm a smart coder, I'm gonna put an instruction between the ADC and the MOV, and the problem is solved! Really? The
add ecx,ebp adc bl,dl sub esi,ebp mov al,[ebx]
sequence takes 5 clocks: the ADD, ADC and SUB take three but the MOV takes two because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE penalty, not TWO. So for a perfect on clock per one instruction ratio at least TWO instructions have to be inserted. Or, one two-cycle instruction like SHR or even a prefixed like ADD AX,BX in 32-bit code.

Gem writer: Ervin Toth
last updated: 1998-03-16