Performance monitoringAssembler/Pentium

This gem presents a small, yet quite effective method of measuring the cycles needed to execute a piece of code. This gem utilises the RDTSC instruction.

Beginning with the Pentium processor, it is possible to access the time-stamp counter. The time-stamp counter keeps an accurate count of every cycle executed. The time-stamp counter is a 64-bit MSR (model specific register) that is incremented every clock cycle. On reset, the time-stamp counter is set to zero. Accessing the counter is done by the RDTSC instruction (read time-stamp counter). The instruction returns the low cycle count in EAX and high cycle count in EDX. The RDTSC returns the number of cycles executed, not the time taken to execute them. To convert cycles to time use this formula (frequency given in Hz):

        time = cycles / frequency
Since the counter may overflow, especially on faster processors, the package uses the full 64-bit count.
The Pentium Pro and Pentium II processors support out-of-order execution instructions may be executed in another order as you programmed them. This can be a source of errors if not taken care of.
To prevent this the programmer must serialize the the instruction queue. This can be done by inserting a serializing instruction like CPUID instruction before the RDTSC instruction. There is however a problem with this. The CPUID instruction itself takes some time to execute. The soultion here is to measure the exection time of CPUID and subtract if from the cycle count returned by RDTSC.
A strange thing about CPUID is that it may longer time to execute the first couple of times it is called. The best thing to do is call the instruction three times and measure the third call. This is utilised in the code below.

How to use the package

The package must be included into your code segment. Please note the data last in the package, that must be placed in a data segment.
Then, if your CPU is a Pentium Pro or a Pentium II, PProPII must be defined (this is because the package must then use serializing to prevent out-of-order execution of RDTSC).
The package must be initialized by:

        call    monitor_init
The you call call the macros like this:
;
; some other piece of code

        time_start

; the code you may want to measure
;                .
;                .
;                .

        time_stop
        mov     [mycountlow],eax
        mov     [mycounthigh],edx
This was a simple example of the package. The above example does compensate for cache effects (code/data not beeing in cache). If cache effects is not wanted you must "pretouch&qout; the data, simply by just reading it. Then just call the package several times to take care of the code cache:
;
; some other piece of code

        mov     ecx,4           ; execute test code 4 times
meassureloop:
        push    ecx

        time_start

; the code you may want to measure
;                .
;                .
;                .

        time_stop
        pop     ecx
        mov     [mycount_low+ecx*4],eax
        mov     [mycount_high+ecx*4],edx
        dec     ecx
        jnz     meassureloop
Note: The mycount variables must (in this example) be arrays of doublewords with 4 indexes.
Also note that data used by a section of code should be placed together to minimize cache effects.

Performance monitoring package

It is supposed to run under plain DOS (no EMM and similar) since the may interrupt the process. Also the RDTSC is a priveleged instruction and does not run in CPL 3. This is not really a problem since a real monitoring session should be performed in an enviroment where the program isn't interrupted since that would mess up the cycle count. A wise thing to do would also be inserting a CLI right before the time_start instruction to prevent all types of interrupts.
Here is the actual monitoring package:

;
; Performance monitoring package
;
; define PProPII if your CPU is a Pentium Pro or a Pentium II
;
; implements:
;
;   monitor_init
;     initializes the package
;
;   time_start
;     start cycle count here
;
;   time_stop
;     stop counting here
;
; note:
;   the package can not do nested measurements, since the macro
;   returns all cycles in the same variable
;

;
; define cpuid and rdtsc instructions via macros
; this is not necessary is your assembler supports them
;
MACRO   cpuid
        db      0fh,0a2h
ENDM

MACRO   rdtsc
        db      0fh,031h
ENDM


;
; monitor_init:
;
; input:
;   nothing
;
; output:
;   cpuid_cycle = initialized to exection time of cpuid
;
; destroys:
;   nothing
;

monitor_init:

IFDEF PProPII
        pushfd
        pushad

        mov     ecx,3
getcpuidtime:
        cpuid
        rdtsc
        mov     [cycle],eax
        cpuid
        rdtsc
        sub     eax,[cycle]
        mov     [cpuid_cycle],eax
        dec     ecx
        jnz     getcpuidtime

        popad
        popf
ENDIF
        ret

;
; time_start - start timing point here
;
; input:
;   none
;
; output:
;   time_cycles initialized
;
; destroys:
;   eax, ebx, ecx, edx
;   eflags
;

MACRO   time_start
IFDEF PProPII
        cpuid
ENDIF
        rdtsc
        mov     [time_cycles],eax
        mov     [time_cycles+4],edx
ENDM

;
; time_stop - stop timing point here
;
; input:
;   none
;
; output:
;   eax = low cycle count
;   edx = high cycle count
;
; destroys:
;   eax, ebx, ecx, edx
;   eflags
;

MACRO   time_stop
IFDEF PProPII
        cpuid
ENDIF
        rdtsc
        sub     eax,[time_cycles]
        sbb     edx,[time_cycles+4]

IFDEF ProPII
        sub     eax,[cpuid_cycle]
        sbb     edx,0
ENDIF

ENDM

;
; place the following data in your data segment
;

time_cycles     dq      ?
cycle           dd      ?
cpuid_cycle     dd      ?
Gem writer: John Eckerdal
last updated: 1998-06-06