GSoC status update #2 (memcmp)

14-06-2024

Table of Contents

What is memcmp even?

The man page on FreeBSD has the following to say:

MEMCMP(3)              FreeBSD Library Functions Manual              MEMCMP(3)

NAME
     memcmp – compare byte string

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include <string.h>

     int
     memcmp(const void *b1, const void *b2, size_t len);

DESCRIPTION
     The memcmp() function compares byte string b1 against byte string b2.
     Both strings are assumed to be len bytes long.

RETURN VALUES
     The memcmp() function returns zero if the two strings are identical,
     otherwise returns the difference between the first two differing bytes
     (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’,
     for example).  Zero-length strings are always identical.

SEE ALSO
     bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3),
     timingsafe_memcmp(3), wmemcmp(3)

STANDARDS
     The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”).

FreeBSD 15.0-CURRENT            August 15, 2016           FreeBSD 15.0-CURRENT

Now when starting off coding anything I like to begin with a minimum example, a PoC (Proof of Concept). So for these string functions I first make a small wrapper program to either call the libc function or my assembly function. Then I compare the output so get some quick turnaround times before integrating it into libc and running the proper unit tests.
So I did just that, and to my surprise libc wasn’t returning the expected values.

Here be dragons 🐉

So when I run my little wrapper program with the following:

#include <stdio.h>
#include <string.h>

int main() {
    printf("%i\n", memcmp("\200", "\0", 1));
}

It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies that a number greater than, equal to, or less than zero shall be returned. So going by the standard it’s a valid return value. But it doesn’t match up with what the FreeBSD man page states.

But if we edit our program a bit like the following we notice something interesting.

#include <stdio.h>
#include <string.h>

int main() {
  printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1));
}

It returns 128, as expected!

The reason for this is that on Aarch64 we have inherited string functions from the Arm Optimized Routines. These functions aren’t originally written with FreeBSD in mind and this has simply gone unnoticed. If we instead glance over at the manpage for memcmp on glibc we see the following regarding its return value:

RETURN VALUE 

    The memcmp() function returns an integer less than, equal to, or greater than
    zero if the first n bytes of s1 is found, respectively, to be less than, to
    match, or be greater than the first n bytes of s2.

    For a nonzero return value, the sign is determined by the sign of the difference
    between the first pair of bytes (interpreted as unsigned char) that differ in s1
    and s2.

    If n is zero, the return value is zero.

I strongly suspect that nobody is checking the return value beyond <, =, or >, except for the units tests… Which do fail on FreeBSD!

I generated this result page by using the kyua command kyua report-html -r ${TEST}.db, everything you need to know about the FreeBSD test suite is available on the FreeBSD wiki.

Finally, these inherited Arm Optimized Routines are located in /usr/src/contrib/arm-optimized-routines and plugged in to the string functions Makefile.inc like the following:

AARCH64_STRING_FUNCS= \
    memchr \
    memcmp \
    memcpy \
    memmove \
    memrchr \
    memset \
    stpcpy \
    strchr \
    strchrnul \
    strcmp \
    strcpy \
    strlen \
    strncmp \
    strnlen \
    strrchr

#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
# Some file need multiple macros defined or a weak symbol added we can
# override the generated file in these cases.
#
.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
    printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
    printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
    printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+=    ${FUNC}.S
.endif

MDSRCS+=    ${FUNC}.S
CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string
.endfor

There we can swap out the faulty one for a compliant compiler generated implementation, but lets get going with porting ours!

@@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
        strchrnul \
        strcmp \
        strcpy \
-       memcmp \
        strncmp \
        strnlen \
        strrchr

+MDSRCS+= \
+       memcmp.S
 #
 # Add the above functions. Generate an asm file that includes the needed
 # Arm Optimized Routines file defining the function name to the libc name.

These functions already have ARM NEON enchancements from the Arm Optimized Routines

    memchr             |   x
    memcmp             |   x
    memcpy             |   SWAR
    memmove            |   SWAR
    memrchr            |   x
    memset             |   x
    stpcpy             |   x
    strchr             |   x
    strchrnul          |   x
    strcmp             |   SWAR
    strcpy             |   x
    strlen             |   x
    strncmp            |   SWAR
    strnlen            |   x
    strrchr            |   x

Now I wont be touching memcpy as its implementation could benefit from another design [1]. But if we ignore the functions with pre-existing implementations by ARM we are left with the following list. But lets finish off memcmp first because we already got started on it and maybe we can beat Arm at their own game.

           FUNCTION          AMD64
           bcmp              S1
           index             S1
           memccpy           S1
           rindex            S1
           stpncpy           S1
           strcat            S1
           strcmp            S1
           strcspn           S2
           strlcat           S1
           strlcpy           S1
           strncat           S1
           strncmp           S1
           strncpy           S1
           strpbrk           S2
           strsep            S2
           strspn            S2
           timingsafe_bcmp   S1

Scalar compiler output

After the above small hiccup I took a look at the .c code for memcmp and its generated assembly. Nothing particularly notable here

memcmp assembly listing

amd64 SIMD implementation

Now this is a long one

ARCHENTRY(memcmp, baseline)
    cmp     $32, %rdx       # enough to permit use of the long kernel?
    ja      .Llong

    test        %rdx, %rdx      # zero bytes buffer?
    je      .L0

    /*
     * Compare strings of 1--32 bytes.  We want to do this by
     * loading into two xmm registers and then comparing.  To avoid
     * crossing into unmapped pages, we either load 32 bytes from
     * the start of the buffer or 32 bytes before its end, depending
     * on whether there is a page boundary between the overread area
     * or not.
     */

    /* check for page boundaries overreads */
    lea     31(%rdi), %eax      # end of overread
    lea     31(%rsi), %r8d
    lea     -1(%rdi, %rdx, 1), %ecx # last character in buffer
    lea     -1(%rsi, %rdx, 1), %r9d
    xor     %ecx, %eax
    xor     %r9d, %r8d
    test        $PAGE_SIZE, %eax    # are they on different pages?
    jz      0f

    /* fix up rdi */
    movdqu      -32(%rdi, %rdx, 1), %xmm0
    movdqu      -16(%rdi, %rdx, 1), %xmm1
    lea     -8(%rsp), %rdi      # end of replacement buffer
    sub     %rdx, %rdi      # start of replacement buffer
    movdqa      %xmm0, -40(%rsp)    # copy to replacement buffer
    movdqa      %xmm1, -24(%rsp)

0:  test        $PAGE_SIZE, %r8d
    jz      0f

    /* fix up rsi */
    movdqu      -32(%rsi, %rdx, 1), %xmm0
    movdqu      -16(%rsi, %rdx, 1), %xmm1
    lea     -40(%rsp), %rsi     # end of replacement buffer
    sub     %rdx, %rsi      # start of replacement buffer
    movdqa      %xmm0, -72(%rsp)    # copy to replacement buffer
    movdqa      %xmm1, -56(%rsp)

    /* load data and compare properly */
0:  movdqu      16(%rdi), %xmm1
    movdqu      16(%rsi), %xmm3
    movdqu      (%rdi), %xmm0
    movdqu      (%rsi), %xmm2
    mov     %edx, %ecx
    mov     $-1, %edx
    shl     %cl, %rdx       # ones where the buffer is not
    pcmpeqb     %xmm3, %xmm1
    pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm1, %ecx
    pmovmskb    %xmm0, %eax
    shl     $16, %ecx
    or      %ecx, %eax      # ones where the buffers match
    or      %edx, %eax      # including where the buffer is not
    not     %eax            # ones where there is a mismatch
#ifndef BCMP
    bsf     %eax, %edx      # location of the first mismatch
    cmovz       %eax, %edx      # including if there is no mismatch
    movzbl      (%rdi, %rdx, 1), %eax   # mismatching bytes
    movzbl      (%rsi, %rdx, 1), %edx
    sub     %edx, %eax
#endif
    ret

    /* empty input */
.L0:    xor     %eax, %eax
    ret

    /* compare 33+ bytes */
    ALIGN_TEXT
.Llong: movdqu      (%rdi), %xmm0       # load head
    movdqu      (%rsi), %xmm2
    mov     %rdi, %rcx
    sub     %rdi, %rsi      # express rsi as distance from rdi
    and     $~0xf, %rdi     # align rdi to 16 bytes
    movdqu      16(%rsi, %rdi, 1), %xmm1
    pcmpeqb     16(%rdi), %xmm1     # compare second half of this iteration
    add     %rcx, %rdx      # pointer to last byte in buffer
    jc      .Loverflow      # did this overflow?
0:  pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm0, %eax
    xor     $0xffff, %eax       # any mismatch?
    jne     .Lmismatch_head
    add     $64, %rdi       # advance to next iteration
    jmp     1f          # and get going with the loop

    /*
     * If we got here, a buffer length was passed to memcmp(a, b, len)
     * such that a + len < a.  While this sort of usage is illegal,
     * it is plausible that a caller tries to do something like
     * memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending
     * for memcmp() to stop comparing at the first mismatch.  This
     * behaviour is not guaranteed by any version of ISO/IEC 9899,
     * but usually works out in practice.  Let's try to make this
     * case work by comparing until the end of the address space.
     */
.Loverflow:
    mov     $-1, %rdx       # compare until the end of memory
    jmp     0b

    /* process buffer 32 bytes at a time */
    ALIGN_TEXT
0:  movdqu      -32(%rsi, %rdi, 1), %xmm0
    movdqu      -16(%rsi, %rdi, 1), %xmm1
    pcmpeqb     -32(%rdi), %xmm0
    pcmpeqb     -16(%rdi), %xmm1
    add     $32, %rdi       # advance to next iteration
1:  pand        %xmm0, %xmm1        # 0xff where both halves matched
    pmovmskb    %xmm1, %eax
    cmp     $0xffff, %eax       # all bytes matched?
    jne     .Lmismatch
    cmp     %rdx, %rdi      # end of buffer reached?
    jb      0b

    /* less than 32 bytes left to compare */
    movdqu      -16(%rdx), %xmm1    # load 32 byte tail through end pointer
    movdqu      -16(%rdx, %rsi, 1), %xmm3
    movdqu      -32(%rdx), %xmm0
    movdqu      -32(%rdx, %rsi, 1), %xmm2
    pcmpeqb     %xmm3, %xmm1
    pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm1, %ecx
    pmovmskb    %xmm0, %eax
    shl     $16, %ecx
    or      %ecx, %eax      # ones where the buffers match
    not     %eax            # ones where there is a mismatch
#ifndef BCMP
    bsf     %eax, %ecx      # location of the first mismatch
    cmovz       %eax, %ecx      # including if there is no mismatch
    add     %rcx, %rdx      # pointer to potential mismatch
    movzbl      -32(%rdx), %eax     # mismatching bytes
    movzbl      -32(%rdx, %rsi, 1), %edx
    sub     %edx, %eax
#endif
    ret

#ifdef BCMP
.Lmismatch:
    mov     $1, %eax
.Lmismatch_head:
    ret
#else /* memcmp */
.Lmismatch_head:
    tzcnt       %eax, %eax      # location of mismatch
    add     %rax, %rcx      # pointer to mismatch
    movzbl      (%rcx), %eax        # mismatching bytes
    movzbl      (%rcx, %rsi, 1), %ecx
    sub     %ecx, %eax
    ret

.Lmismatch:
    movdqu      -48(%rsi, %rdi, 1), %xmm1
    pcmpeqb     -48(%rdi), %xmm1    # reconstruct xmm1 before PAND
    pmovmskb    %xmm0, %eax     # mismatches in first 16 bytes
    pmovmskb    %xmm1, %edx     # mismatches in second 16 bytes
    shl     $16, %edx
    or      %edx, %eax      # mismatches in both
    not     %eax            # matches in both
    tzcnt       %eax, %eax      # location of mismatch
    add     %rax, %rdi      # pointer to mismatch
    movzbl      -64(%rdi), %eax     # mismatching bytes
    movzbl      -64(%rdi, %rsi, 1), %ecx
    sub     %ecx, %eax
    ret
#endif
ARCHEND(memcmp, baseline)

The following table might be helpful when reading unless the x86 registers are already ingrained in your head. :-)

Name
Notes Type
64-bit
long
32-bit
int
16-bit
short
8-bit
char
rax
Values are returned from functions in this register. 
scratch
rax eax ax ah and al
rcx
Typical scratch register.  Some instructions also use it as a counter. scratch
rcx ecx cx ch and cl
rdx
Scratch register. Function argument #3 in 64-bit Linux scratch
rdx edx dx dh and dl
rbx
Preserved register: don't use it without saving it! preserved
rbx ebx bx bh and bl
rsp
The stack pointer.  Points to the top of the stack
preserved rsp esp sp spl
rbp
Preserved register.  Sometimes used to store the old value of the stack pointer, or the "base". preserved rbp ebp bp bpl
rsi
Scratch register.  Also used to pass function argument #2 in 64-bit Linux scratch rsi esi si sil
rdi
Scratch register.  Function argument #1 in 64-bit Linux scratch rdi edi di dil
r8
Scratch register.  These were added in 64-bit mode, so they have numbers, not names.
scratch r8 r8d r8w r8b
r9
Scratch register. scratch r9 r9d r9w r9b
r10
Scratch register. scratch r10 r10d r10w r10b
r11
Scratch register. scratch r11 r11d r11w r11b
r12
Preserved register.  You can use it, but you need to save and restore it.
preserved r12 r12d r12w r12b
r13
Preserved register. preserved r13 r13d r13w r13b
r14
Preserved register. preserved r14 r14d r14w r14b
r15
Preserved register. preserved r15 r15d r15w r15b
X86 Registers courtesy of [2]

Luckily this code is well commented, but in short it does the following:

Translating to Aarch64

To begin with we ignore the special case regarding page crossing and save it for last, like a dessert. We start loading registers pairwise, so 32 bytes are immediately loaded. Then we start comparing using cmeq like we did with strlen except that we compared with #0 there.

then we need to reduce the width so the result can fit in a GPR, we use shrn for that, just as we did before.

But for matches the result is FF, so now the register is all FF’s if they were identical and if they were identical we want to keep looping. But the problem is that we can’t do a floating point compare with FF as its NaN, therefore we move it to a GPR and invert it. Inverting it also allows us to count the zeroes and if we have 32 zeroes (FF’s before inverting) we had no matches and we can keep looking! To make sure we dont look for too long we do a comparison with the max limit first, if we are past our limit and no matches were found then we simply return 0.

.Lbegin:
    mov x8,x0           // store base address for later
    mov x9,x1

    ldp q0,q1,[x0]      //load 32 bytes of b1 into vector registers
    ldp q2,q3,[x1]

    cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2
    shrn    v0.8b,v0.8h,#4      // shift them right to fit in x1
    cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-31 b1 vs b2
    shrn    v1.8b,v1.8h,#4

    fmov    x1,d0
    fmov    x3,d1

    mvn x0,x1           // invert to use clz
    cbz x0,0f
    rbit    x0,x0
    clz x0,x0           // if this is zero check bytes 16..32
    b   1f
    
0:  
    rbit    x1,x3
    mvn x1,x1
    clz x0,x1
    add x0,x0,#64
1:  
    lsr x0,x0,#2
    cmp x0,limit        // if we have more matches than the limit return 0
    b.ge    .Lnone  
    cmp x0,#32          // x0 == 32 if no hit (32 0's)
    b.eq    .Lloop
2:  
    ldrb    w4,[x8,x0]      // gonna need to edit this based on align offset
    ldrb    w5,[x9,x0]  
    subs    w0,w4,w5        // get the byte difference
    ret

Now this approach works for up to 32 byte comparisons but we want to turn it into a loop, its quite straight forward with the way the code was written. Simpy make the loads increment the pointer and subtract from the limit.

    subs    limit,limit,#32
    b.lt    .Lnone          //we could do special handling here
    ldp q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
    ldp q2,q3,[x9,32]!

Now the problem with this approach is that the loop condition is buried deep within a code path so even with speculative execution etc it will struggle.

Now we could try doing something like this:

.Lloop:
    subs    limit,limit,#32
    b.lt    .Lnone          //we could do special handling here
    ldp     q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
    ldp     q2,q3,[x9,32]!
    
    cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2
    cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-32 b1 vs b2

    umin    v2.16b,v0.16b,v1.16b
    umin    v2.16b,v2.16b,v2.16b    // separate lowest value    
    NOT     v2.16b,v2.16b           // invert it (FF -> we want to loop)    
    shrn    v2.8b,v2.8h,#4
    fcmp    d2,#0.0
    b.eq    .Lloop

But benchmarks showed that this resulted in better performance but I tried a few other ways to tighten the loop that turned out to be even faster.

.Lloop:
    subs    limit,limit,#32
    b.lt    .Lnone          //we could do special handling here
    ldp q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
    ldp q2,q3,[x9,32]!
    
    eor v4.16b,v0.16b,v2.16b    // v0 = b1(16-32) XOR b2(16-32)
    eor v5.16b,v1.16b,v3.16b    // v1 = b1(32-48) XOR b2(32-48)
    umaxp   v4.16b,v4.16b,v5.16b    // v0 = max(v0,v1)
    umaxp   v4.16b,v4.16b,v4.16b    // v0 = max(v0,v0)
    fmov    tmp, d4
    cbz tmp, .Lloop

    cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2  
    cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-32 b1 vs b2

Despite not reusing the result from cmeq this resulted in a few percent better perfomance across several Arm64 cores. This could be attributed to cmeq and shrn utilizing fewer pipelines than eor and umaxp. This is shown for the Neoverse V2 core the the optimization guide. [3]

A nice thing I realized is that this quick check is also applicable for the first 32 byte iteration aswell. Adding it there resulted in almost a 50% improvement for short string using the now familiar suite of benchmarks.

missing
Aarch64 registers.

Stay within your page

Now to the dessert, the crossing into unmapped pages.

We put our memcmp through the unit tests as previously described and we get the following.

One failed! But why? And it’s not even a test of memcmp!

If we look into the log we notice the following

Process with PID 9597 exited with signal 11 and dumped core; attempting to gather stack trace
[New LWP 120611]
Core was generated by `/usr/obj/home/getz/gsoc/freebsd-src/arm64.aarch64/lib/libc/tests/string/checkdir'.
Program terminated with signal SIGSEGV, Segmentation fault.
Address not mapped to object.
#0  memcmp () at /home/getz/gsoc/freebsd-src/lib/libc/aarch64/string/memcmp.S:78
78    ldp q2,q3,[x1] //load 32 bytes of b2 into vector registers

So it’s a SIGSEV due to a address not being mapped, i.e. an unmapped page.

If we now run it through gdb (or lldb) we see the following values for x0 and x1 right before the attempted load.

>>> p/tz  $x0
$2 = 0x0000000041035000
>>> p/tz $x1
$3 = 0x0000000041036fff

To handle this we can first make check if we are allowed to check more than 32 bytes (limit > 32). If not we can check how many bytes are left to the next page crossing. If its less than 32 bytes we can load our 32 bytes from the buffers pointer minus the offset. Then after its loaded we can shift the entire vector register to get rid of the unwanted bytes.

On SSE there is a PSHUFB instruction that does this and luckily there is a more or less equivalent instruction for Aarch64 TBL. We need to use TBL as the other shift instructions like EXT and USHL either dont work with variable shifts or across entire vector registers. USHL would shift each byte a certain amount depending on the arrangement specifier (the .16b piece you often see).


The numbers followed by a letter for the vector registers are the arrangement specifiers is the number and size of the elements in the vector. A simple way to keep track of them is by thinking:

courtesy of [4]


So for checking if we are about to cross a page we can do the following

cmp     x2,#32
b.hi    .Lbegin    // Jump to regular loop if we are allowed to check more than 32 bytes
add     x3,x8,#32
add     x4,x9,#32
eor     x3,x3,x8
eor     x4,x4,x9
tst     w3,#PAGE_SIZE
b.ne    .Lspecialhandling
tst     w4,#PAGE_SIZE
b.ne    .Lspecialhandling

Then theres also a piece that was overlooked in the main loop. As neither buf1 or buf2 is aligned there is a chance that we may overread when theres less than 32 bytes left to compare. This is easily solved by adding a new branch where we previously checked if the limit had gone down to 0.

.Lloop:
    subs    x2,x2,#32
    b.le    .Lnone
    cmp x2,#32
    b.le    .Llast32
    ldp q0,q1,[x8,32]!      // load 32 bytes to vector registers
    ldp q2,q3,[x9,32]!

    eor v4.16b,v0.16b,v2.16b
    eor v5.16b,v1.16b,v3.16b
    umaxp   v4.16b,v4.16b,v5.16b
    umaxp   v4.16b,v4.16b,v4.16b
    fmov    x6,d4
    cbz x6,.Lloop
    b   .Lmatch

    /* If 32 bytes left to compare only load 32 bytes from x8 - limit to
     * avoid overread */
.Llast32:
    mov x3,#32
    sub x3,x3,x2        // x3 = 32 - x2
    add x2,x2,x3        // add the amount we shifted to limit
    sub x8,x8,x3
    sub x9,x9,x3

    ldp q0,q1,[x8,#32]! // load 32 bytes to vector registers
    ldp q2,q3,[x9,#32]!

.Lmatch:
    cmeq    v0.16b,v0.16b,v2.16b    // compare between 0-15 b1 vs b2
    cmeq    v1.16b,v1.16b,v3.16b    // compare between 16-31 b1 vs b2

16 byte wide shifts

So now the only piece left in the puzzle is to shift the vector register for the case when we have 32 bytes or less to compare and they are located near the end of a page.
To utilize TBL here we need to generate a suitable permutation mask, to do this we can take the identity permutation 0, 1, 2, ..., 15, 16, 17, ..., 31 and add to each byte to rotate it, or load from it with a desired offset. This can table can easily be prepared in a data segment.

    .section .rodata
    .p2align 4
shift_table:
    .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
    .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

USHR/USHL could also work but as it can only shift 64 bit words it would only be useful for very short strings unless there some nice way to utilize it here.

The TBL/TBL instructions are rather tricky to figure out how they work based on the reference manual but ARM provides a graphical representation of how they operate which cleared some things out for me.

TBL SVG
TBL Lookup table instruction. courtesy of [5] (click to enlarge)

We simply check if the 32nd byte from our buffers starting address is located on another page as defined by the $PAGE_SIZE preprocessor directive defined in sys/arm64/include/param.h. Then we shift and load by the appropriate amount.

So the check and fix can be implemented like the following:

ENTRY(_memcmp)

    mov x8, x0
    mov x9, x1
    cbz x2, .Lnone      // 0 length

    /*
     * Check if buffer is located at end of page to avoid crossing
     * into unmapped page. If so, we load 32 bytes from the end of the
     * limit and check the other buffer.
     */

    cmp x2, #32
    b.hi    .Lbegin
    add x3, x8, #32
    add x4, x9, #32
    eor x3, x3, x8
    eor x4, x4, x9

    tst w3, #PAGE_SIZE
    b.eq    0f

    mov x3, #32
    sub x3, x3, x2
    sub x8, x8, x3

    /*
     * We perform a variable shift in the vector register using TBL,
     * a suitable permutation is generated by loading a table of bytes
     * with a desired offset.
     */

    adrp    x0, shift_table
    add x0, x0, :lo12:shift_table
    add x0, x0, x3
    ldp q0, q1, [x8]
    ldp q4, q5, [x0]        // load permutation table
    tbl v0.16b, {v0.16b, v1.16b}, v4.16b
    tbl v1.16b, {v0.16b, v1.16b}, v5.16b
    add x8, x8, x3      // reset pointer to beginning of src
    b   1f

0:
    ldp q0, q1, [x8]

1:
    tst w4, #PAGE_SIZE
    b.eq    0f

    mov x3, #32
    sub x3, x3, x2
    sub x9, x9, x3

    ldp q2, q3, [x9]
    adrp    x0, shift_table
    add x0, x0, :lo12:shift_table
    add x0, x0, x3
    ldp q4, q5, [x0]
    tbl v2.16b, {v2.16b, v3.16b}, v4.16b
    tbl v3.16b, {v2.16b, v3.16b}, v5.16b
    add x9, x9, x3
    b   1f

    /*
     * Compare strings of 1--32 bytes.  We do this by loading into two
     * vector registers and then doing a quick compare with XOR,  UMAXP
     * do determine if the first 32 bytes all match.
     */
.Lbegin:
    ldp q0, q1, [x8]
0:
    ldp q2, q3, [x9]
1:

    /*
     * REST OF CODE HERE
     */

END(_memcmp)

    .section .rodata
    .p2align 4
shift_table:
    .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
    .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

Benchmarks

Collin Percival was kind enough to give me a voucher for AWS credits so now I can also benchmark on a more powerful ARM core which might be more common to find running a FreeBSD system.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
            │ memcmpScalar │              memcmpARM              │             memcmpSIMD              │
            │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
MemcmpShort   139.41µ ± 2%   63.96µ ± 1%  -54.12% (p=0.000 n=20)   25.14µ ± 1%  -81.97% (p=0.000 n=20)
MemcmpMid      93.38µ ± 3%   12.09µ ± 1%  -87.05% (p=0.000 n=20)   10.91µ ± 1%  -88.31% (p=0.000 n=20)
MemcmpLong    86.722µ ± 7%   4.648µ ± 1%  -94.64% (p=0.000 n=20)   4.931µ ± 0%  -94.31% (p=0.000 n=20)
geomean        104.1µ        15.32µ       -85.29%                  11.06µ       -89.38%

            │ memcmpScalar │                memcmpARM                │               memcmpSIMD                │
            │     B/s      │      B/s       vs base                  │      B/s       vs base                  │
MemcmpShort   855.1Mi ± 2%   1864.0Mi ± 1%   +117.99% (p=0.000 n=20)   4742.2Mi ± 1%   +454.58% (p=0.000 n=20)
MemcmpMid     1.247Gi ± 3%    9.629Gi ± 1%   +671.89% (p=0.000 n=20)   10.668Gi ± 1%   +755.20% (p=0.000 n=20)
MemcmpLong    1.342Gi ± 6%   25.048Gi ± 1%  +1765.92% (p=0.000 n=20)   23.608Gi ± 0%  +1658.68% (p=0.000 n=20)
geomean       1.118Gi         7.600Gi        +579.66%                   10.53Gi        +841.33%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
            │ memcmpScalar  │              memcmpARM               │             memcmpSIMD              │
            │    sec/op     │    sec/op     vs base                │   sec/op     vs base                │
MemcmpShort    183.12µ ± 0%   101.29µ ± 0%  -44.68% (p=0.000 n=20)   39.96µ ± 0%  -78.18% (p=0.000 n=20)
MemcmpMid      129.12µ ± 0%    24.55µ ± 1%  -80.99% (p=0.000 n=20)   21.48µ ± 0%  -83.37% (p=0.000 n=20)
MemcmpLong    111.374µ ± 0%    6.288µ ± 0%  -94.35% (p=0.000 n=20)   7.593µ ± 0%  -93.18% (p=0.000 n=20)
geomean         138.1µ         25.01µ       -81.89%                  18.68µ       -86.47%

            │ memcmpScalar │                memcmpARM                │               memcmpSIMD                │
            │     B/s      │      B/s       vs base                  │      B/s       vs base                  │
MemcmpShort   651.0Mi ± 0%   1176.9Mi ± 0%    +80.78% (p=0.000 n=20)   2983.1Mi ± 0%   +358.23% (p=0.000 n=20)
MemcmpMid     923.3Mi ± 0%   4856.0Mi ± 1%   +425.95% (p=0.000 n=20)   5551.0Mi ± 0%   +501.23% (p=0.000 n=20)
MemcmpLong    1.045Gi ± 0%   18.514Gi ± 0%  +1671.24% (p=0.000 n=20)   15.332Gi ± 0%  +1366.84% (p=0.000 n=20)
geomean       863.3Mi         4.656Gi        +452.24%                   6.233Gi        +639.33%

References

[1]: https://research.google/pubs/automemcpy-a-framework-for-automatic-generation-of-fundamental-memory-operations/
[2]: https://www.cs.uaf.edu/2017/fall/cs301/lecture/09_11_registers.html
[3]: https://developer.arm.com/documentation/109898/latest/
[4]: https://stackoverflow.com/a/73293215/17418180