Fast strlen()Assembler/80386

Fast implementation of strlen()

Recently, someone wrote to me with the comment that strlen() is a very commonly called function, and as such was interested in possible performance improvements for it. At first, without thinking too hard about it, I didn't see how there was any opportunity to fundamentally improve the algorithm. I was right, but as far as low level algorithmic scrutiny is concerned, there is plenty of opportunity. Basically, the algorithm is byte scan based, and as such the typical thing that the C version will do wrong is miss the opportunity to reduce load redundancy.
;
; fast strlen()
;
; input:
;   eax = offset to string
;
; output:
;   ecx = length
;
; destroys:
;   ebx
;   eflags
;

        lea     ecx,[eax-1]
l1:     inc     ecx
        test    ecx,3
        jz      l2
        cmp     [byte ptr ecx],0
        jne     l1
        jmp     l6
l2:     mov     ebx,[ecx]       ; U
        add     ecx,4           ;   V
        test    bl,bl           ; U
        jz      l5              ;   V
        test    bh,bh           ; U
        jz      l4              ;   V
        test    ebx,0ff0000h    ; U
        jz      l3              ;   V
        test    ebx,0ff000000h  ; U
        jnz     l2              ;   V +1brt
        inc     ecx
l3:     inc     ecx
l4:     inc     ecx
l5:     sub     ecx,4
l6:     sub     ecx,eax
Here, I've sacrificed size for performance, by essentially unrolling the loop 4 times. If the input strings are fairly long (which is when performance will matter) on a Pentium, the asm code will execute at a rate of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte.
Note: This routine can be used on lower CPUs too, just use 16-bit registers. But then it may not be the fastest implementation.
Gem writer: Paul Hsieh
last updated: 1998-06-07