GSoC status update #2 (memcmp)
14-06-2024
Table of Contents
What is memcmp even?
The man page on FreeBSD has the following to say:
MEMCMP(3) FreeBSD Library Functions Manual MEMCMP(3)
NAME
memcmp – compare byte string
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include <string.h>
int
memcmp(const void *b1, const void *b2, size_t len);
DESCRIPTION
The memcmp() function compares byte string b1 against byte string b2.
Both strings are assumed to be len bytes long.
RETURN VALUES
The memcmp() function returns zero if the two strings are identical,
otherwise returns the difference between the first two differing bytes
(treated as unsigned char values, so that ‘\200’ is greater than ‘\0’,
for example). Zero-length strings are always identical.
SEE ALSO
bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3),
timingsafe_memcmp(3), wmemcmp(3)
STANDARDS
The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”).
FreeBSD 15.0-CURRENT August 15, 2016 FreeBSD 15.0-CURRENT
Now when starting off coding anything I like to begin with a minimum example, a
PoC (Proof of Concept). So for these string functions I first make a small
wrapper program to either call the libc function or my assembly function. Then I
compare the output so get some quick turnaround times before integrating
it into libc and running the proper unit tests.
So I did just that, and to my surprise libc wasn’t returning the expected
values.
Here be dragons 🐉
So when I run my little wrapper program with the following:
#include <stdio.h>
#include <string.h>
int main() {
printf("%i\n", memcmp("\200", "\0", 1));
}
It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies that a number greater than, equal to, or less than zero shall be returned. So going by the standard it’s a valid return value. But it doesn’t match up with what the FreeBSD man page states.
But if we edit our program a bit like the following we notice something interesting.
#include <stdio.h>
#include <string.h>
int main() {
printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1));
}
It returns 128
, as expected!
The reason for this is that on Aarch64
we have inherited string functions from
the Arm Optimized Routines
. These functions aren’t originally written with
FreeBSD in mind and this has simply gone unnoticed. If we instead glance over at
the manpage for memcmp on glibc
we see the following regarding its return
value:
RETURN VALUE
The memcmp() function returns an integer less than, equal to, or greater than
zero if the first n bytes of s1 is found, respectively, to be less than, to
match, or be greater than the first n bytes of s2.
For a nonzero return value, the sign is determined by the sign of the difference
between the first pair of bytes (interpreted as unsigned char) that differ in s1
and s2.
If n is zero, the return value is zero.
I strongly suspect that nobody is checking the return value beyond <, =, or >, except for the units tests… Which do fail on FreeBSD!
I generated this result page by using the kyua
command kyua report-html -r
${TEST}.db
, everything you need to know about the FreeBSD test suite is
available on the FreeBSD wiki.
Finally, these inherited Arm Optimized Routines
are located in
/usr/src/contrib/arm-optimized-routines
and plugged in to the string functions
Makefile.inc
like the following:
AARCH64_STRING_FUNCS= \
memchr \
memcmp \
memcpy \
memmove \
memrchr \
memset \
stpcpy \
strchr \
strchrnul \
strcmp \
strcpy \
strlen \
strncmp \
strnlen \
strrchr
#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
# Some file need multiple macros defined or a weak symbol added we can
# override the generated file in these cases.
#
.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+= ${FUNC}.S
.endif
MDSRCS+= ${FUNC}.S
CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string
.endfor
There we can swap out the faulty one for a compliant compiler generated implementation, but lets get going with porting ours!
@@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
strchrnul \
strcmp \
strcpy \
- memcmp \
strncmp \
strnlen \
strrchr
+MDSRCS+= \
+ memcmp.S
#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
These functions already have ARM NEON enchancements from the Arm Optimized Routines
memchr | x
memcmp | x
memcpy | SWAR
memmove | SWAR
memrchr | x
memset | x
stpcpy | x
strchr | x
strchrnul | x
strcmp | SWAR
strcpy | x
strlen | x
strncmp | SWAR
strnlen | x
strrchr | x
Now I wont be touching memcpy as its implementation could benefit from another
design [1]. But if we ignore the functions with pre-existing implementations
by ARM we are left with the following list. But lets finish off memcmp
first
because we already got started on it and maybe we can beat Arm at their own
game.
FUNCTION AMD64
bcmp S1
index S1
memccpy S1
rindex S1
stpncpy S1
strcat S1
strcmp S1
strcspn S2
strlcat S1
strlcpy S1
strncat S1
strncmp S1
strncpy S1
strpbrk S2
strsep S2
strspn S2
timingsafe_bcmp S1
Scalar compiler output
After the above small hiccup I took a look at the .c
code for memcmp and its
generated assembly. Nothing particularly notable here
amd64 SIMD implementation
Now this is a long one
ARCHENTRY(memcmp, baseline)
cmp $32, %rdx # enough to permit use of the long kernel?
ja .Llong
test %rdx, %rdx # zero bytes buffer?
je .L0
/*
* Compare strings of 1--32 bytes. We want to do this by
* loading into two xmm registers and then comparing. To avoid
* crossing into unmapped pages, we either load 32 bytes from
* the start of the buffer or 32 bytes before its end, depending
* on whether there is a page boundary between the overread area
* or not.
*/
/* check for page boundaries overreads */
lea 31(%rdi), %eax # end of overread
lea 31(%rsi), %r8d
lea -1(%rdi, %rdx, 1), %ecx # last character in buffer
lea -1(%rsi, %rdx, 1), %r9d
xor %ecx, %eax
xor %r9d, %r8d
test $PAGE_SIZE, %eax # are they on different pages?
jz 0f
/* fix up rdi */
movdqu -32(%rdi, %rdx, 1), %xmm0
movdqu -16(%rdi, %rdx, 1), %xmm1
lea -8(%rsp), %rdi # end of replacement buffer
sub %rdx, %rdi # start of replacement buffer
movdqa %xmm0, -40(%rsp) # copy to replacement buffer
movdqa %xmm1, -24(%rsp)
0: test $PAGE_SIZE, %r8d
jz 0f
/* fix up rsi */
movdqu -32(%rsi, %rdx, 1), %xmm0
movdqu -16(%rsi, %rdx, 1), %xmm1
lea -40(%rsp), %rsi # end of replacement buffer
sub %rdx, %rsi # start of replacement buffer
movdqa %xmm0, -72(%rsp) # copy to replacement buffer
movdqa %xmm1, -56(%rsp)
/* load data and compare properly */
0: movdqu 16(%rdi), %xmm1
movdqu 16(%rsi), %xmm3
movdqu (%rdi), %xmm0
movdqu (%rsi), %xmm2
mov %edx, %ecx
mov $-1, %edx
shl %cl, %rdx # ones where the buffer is not
pcmpeqb %xmm3, %xmm1
pcmpeqb %xmm2, %xmm0
pmovmskb %xmm1, %ecx
pmovmskb %xmm0, %eax
shl $16, %ecx
or %ecx, %eax # ones where the buffers match
or %edx, %eax # including where the buffer is not
not %eax # ones where there is a mismatch
#ifndef BCMP
bsf %eax, %edx # location of the first mismatch
cmovz %eax, %edx # including if there is no mismatch
movzbl (%rdi, %rdx, 1), %eax # mismatching bytes
movzbl (%rsi, %rdx, 1), %edx
sub %edx, %eax
#endif
ret
/* empty input */
.L0: xor %eax, %eax
ret
/* compare 33+ bytes */
ALIGN_TEXT
.Llong: movdqu (%rdi), %xmm0 # load head
movdqu (%rsi), %xmm2
mov %rdi, %rcx
sub %rdi, %rsi # express rsi as distance from rdi
and $~0xf, %rdi # align rdi to 16 bytes
movdqu 16(%rsi, %rdi, 1), %xmm1
pcmpeqb 16(%rdi), %xmm1 # compare second half of this iteration
add %rcx, %rdx # pointer to last byte in buffer
jc .Loverflow # did this overflow?
0: pcmpeqb %xmm2, %xmm0
pmovmskb %xmm0, %eax
xor $0xffff, %eax # any mismatch?
jne .Lmismatch_head
add $64, %rdi # advance to next iteration
jmp 1f # and get going with the loop
/*
* If we got here, a buffer length was passed to memcmp(a, b, len)
* such that a + len < a. While this sort of usage is illegal,
* it is plausible that a caller tries to do something like
* memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending
* for memcmp() to stop comparing at the first mismatch. This
* behaviour is not guaranteed by any version of ISO/IEC 9899,
* but usually works out in practice. Let's try to make this
* case work by comparing until the end of the address space.
*/
.Loverflow:
mov $-1, %rdx # compare until the end of memory
jmp 0b
/* process buffer 32 bytes at a time */
ALIGN_TEXT
0: movdqu -32(%rsi, %rdi, 1), %xmm0
movdqu -16(%rsi, %rdi, 1), %xmm1
pcmpeqb -32(%rdi), %xmm0
pcmpeqb -16(%rdi), %xmm1
add $32, %rdi # advance to next iteration
1: pand %xmm0, %xmm1 # 0xff where both halves matched
pmovmskb %xmm1, %eax
cmp $0xffff, %eax # all bytes matched?
jne .Lmismatch
cmp %rdx, %rdi # end of buffer reached?
jb 0b
/* less than 32 bytes left to compare */
movdqu -16(%rdx), %xmm1 # load 32 byte tail through end pointer
movdqu -16(%rdx, %rsi, 1), %xmm3
movdqu -32(%rdx), %xmm0
movdqu -32(%rdx, %rsi, 1), %xmm2
pcmpeqb %xmm3, %xmm1
pcmpeqb %xmm2, %xmm0
pmovmskb %xmm1, %ecx
pmovmskb %xmm0, %eax
shl $16, %ecx
or %ecx, %eax # ones where the buffers match
not %eax # ones where there is a mismatch
#ifndef BCMP
bsf %eax, %ecx # location of the first mismatch
cmovz %eax, %ecx # including if there is no mismatch
add %rcx, %rdx # pointer to potential mismatch
movzbl -32(%rdx), %eax # mismatching bytes
movzbl -32(%rdx, %rsi, 1), %edx
sub %edx, %eax
#endif
ret
#ifdef BCMP
.Lmismatch:
mov $1, %eax
.Lmismatch_head:
ret
#else /* memcmp */
.Lmismatch_head:
tzcnt %eax, %eax # location of mismatch
add %rax, %rcx # pointer to mismatch
movzbl (%rcx), %eax # mismatching bytes
movzbl (%rcx, %rsi, 1), %ecx
sub %ecx, %eax
ret
.Lmismatch:
movdqu -48(%rsi, %rdi, 1), %xmm1
pcmpeqb -48(%rdi), %xmm1 # reconstruct xmm1 before PAND
pmovmskb %xmm0, %eax # mismatches in first 16 bytes
pmovmskb %xmm1, %edx # mismatches in second 16 bytes
shl $16, %edx
or %edx, %eax # mismatches in both
not %eax # matches in both
tzcnt %eax, %eax # location of mismatch
add %rax, %rdi # pointer to mismatch
movzbl -64(%rdi), %eax # mismatching bytes
movzbl -64(%rdi, %rsi, 1), %ecx
sub %ecx, %eax
ret
#endif
ARCHEND(memcmp, baseline)
The following table might be helpful when reading unless the x86 registers are already ingrained in your head. :-)
Name |
Notes | Type |
64-bit long |
32-bit int |
16-bit short |
8-bit char |
rax |
Values are returned from
functions in this register. |
scratch |
rax | eax | ax | ah and al |
rcx |
Typical scratch register. Some instructions also use it as a counter. | scratch |
rcx | ecx | cx | ch and cl |
rdx |
Scratch register. Function argument #3 in 64-bit Linux | scratch |
rdx | edx | dx | dh and dl |
rbx |
Preserved register: don't use it without saving it! | preserved |
rbx | ebx | bx | bh and bl |
rsp |
The
stack pointer. Points to the top of the stack |
preserved | rsp | esp | sp | spl |
rbp |
Preserved register. Sometimes used to store the old value of the stack pointer, or the "base". | preserved | rbp | ebp | bp | bpl |
rsi |
Scratch register. Also used to pass function argument #2 in 64-bit Linux | scratch | rsi | esi | si | sil |
rdi |
Scratch register. Function argument #1 in 64-bit Linux | scratch | rdi | edi | di | dil |
r8 |
Scratch register. These
were added in 64-bit mode, so they have numbers, not names. |
scratch | r8 | r8d | r8w | r8b |
r9 |
Scratch register. | scratch | r9 | r9d | r9w | r9b |
r10 |
Scratch register. | scratch | r10 | r10d | r10w | r10b |
r11 |
Scratch register. | scratch | r11 | r11d | r11w | r11b |
r12 |
Preserved
register. You can use it, but you need to save and
restore it. |
preserved | r12 | r12d | r12w | r12b |
r13 |
Preserved register. | preserved | r13 | r13d | r13w | r13b |
r14 |
Preserved register. | preserved | r14 | r14d | r14w | r14b |
r15 |
Preserved register. | preserved | r15 | r15d | r15w | r15b |
Luckily this code is well commented, but in short it does the following:
Checks if we are checking more than 32 bytes
If not it checks if the bytes we are about to check will cross a page boundary as we check 32 bytes simultaneously. For example if we are to check 20 bytes but the page boundary is at 25 bytes from the beginning of one of our inputs.
If the above is the case we copy the data to a replacement/bounce buffer and override rdi/rsi.
If no funny business is required we simply start comparing data in a similar way as to
strlen
except now we have a mask where we check if everything is matching or not. So we look for the occurence of the first 00 rather than FF as we did in `strlen.Then once we do hit we load the word at the offset from our base address and subtract them before we return
Translating to Aarch64
To begin with we ignore the special case regarding page crossing and save it for
last, like a dessert. We start loading registers pairwise, so 32 bytes are
immediately loaded. Then we start comparing using cmeq like we did with strlen
except that we compared with #0
there.
then we need to reduce the width so the result can fit in a GPR, we use shrn
for that, just as we did before.
But for matches the result is FF
, so now the register is all FF
’s if they
were identical and if they were identical we want to keep looping.
But the problem is that we can’t do a floating point compare with FF
as its
NaN
, therefore we move it to a GPR and invert it. Inverting it also allows us
to count the zeroes and if we have 32 zeroes (FF
’s before inverting) we had no
matches and we can keep looking!
To make sure we dont look for too long we do a comparison with the max limit
first, if we are past our limit and no matches were found then we simply return
0.
.Lbegin:
mov x8,x0 // store base address for later
mov x9,x1
ldp q0,q1,[x0] //load 32 bytes of b1 into vector registers
ldp q2,q3,[x1]
cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2
shrn v0.8b,v0.8h,#4 // shift them right to fit in x1
cmeq v1.16b,v1.16b,v3.16b // do compare between 16-31 b1 vs b2
shrn v1.8b,v1.8h,#4
fmov x1,d0
fmov x3,d1
mvn x0,x1 // invert to use clz
cbz x0,0f
rbit x0,x0
clz x0,x0 // if this is zero check bytes 16..32
b 1f
0:
rbit x1,x3
mvn x1,x1
clz x0,x1
add x0,x0,#64
1:
lsr x0,x0,#2
cmp x0,limit // if we have more matches than the limit return 0
b.ge .Lnone
cmp x0,#32 // x0 == 32 if no hit (32 0's)
b.eq .Lloop
2:
ldrb w4,[x8,x0] // gonna need to edit this based on align offset
ldrb w5,[x9,x0]
subs w0,w4,w5 // get the byte difference
ret
Now this approach works for up to 32 byte comparisons but we want to turn it into a loop, its quite straight forward with the way the code was written. Simpy make the loads increment the pointer and subtract from the limit.
subs limit,limit,#32
b.lt .Lnone //we could do special handling here
ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers
ldp q2,q3,[x9,32]!
Now the problem with this approach is that the loop condition is buried deep within a code path so even with speculative execution etc it will struggle.
Now we could try doing something like this:
.Lloop:
subs limit,limit,#32
b.lt .Lnone //we could do special handling here
ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers
ldp q2,q3,[x9,32]!
cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2
cmeq v1.16b,v1.16b,v3.16b // do compare between 16-32 b1 vs b2
umin v2.16b,v0.16b,v1.16b
umin v2.16b,v2.16b,v2.16b // separate lowest value
NOT v2.16b,v2.16b // invert it (FF -> we want to loop)
shrn v2.8b,v2.8h,#4
fcmp d2,#0.0
b.eq .Lloop
But benchmarks showed that this resulted in better performance but I tried a few other ways to tighten the loop that turned out to be even faster.
.Lloop:
subs limit,limit,#32
b.lt .Lnone //we could do special handling here
ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers
ldp q2,q3,[x9,32]!
eor v4.16b,v0.16b,v2.16b // v0 = b1(16-32) XOR b2(16-32)
eor v5.16b,v1.16b,v3.16b // v1 = b1(32-48) XOR b2(32-48)
umaxp v4.16b,v4.16b,v5.16b // v0 = max(v0,v1)
umaxp v4.16b,v4.16b,v4.16b // v0 = max(v0,v0)
fmov tmp, d4
cbz tmp, .Lloop
cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2
cmeq v1.16b,v1.16b,v3.16b // do compare between 16-32 b1 vs b2
Despite not reusing the result from cmeq this resulted in a few percent better
perfomance across several Arm64 cores.
This could be attributed to cmeq
and shrn
utilizing fewer pipelines than eor
and umaxp
. This is shown for the Neoverse V2 core the the optimization guide.
[3]
A nice thing I realized is that this quick check is also applicable for the first 32 byte iteration aswell. Adding it there resulted in almost a 50% improvement for short string using the now familiar suite of benchmarks.
Stay within your page
Now to the dessert, the crossing into unmapped pages.
We put our memcmp through the unit tests as previously described and we get the following.
One failed! But why? And it’s not even a test of memcmp!
If we look into the log we notice the following
Process with PID 9597 exited with signal 11 and dumped core; attempting to gather stack trace
[New LWP 120611]
Core was generated by `/usr/obj/home/getz/gsoc/freebsd-src/arm64.aarch64/lib/libc/tests/string/checkdir'.
Program terminated with signal SIGSEGV, Segmentation fault.
Address not mapped to object.
#0 memcmp () at /home/getz/gsoc/freebsd-src/lib/libc/aarch64/string/memcmp.S:78
78 ldp q2,q3,[x1] //load 32 bytes of b2 into vector registers
So it’s a SIGSEV
due to a address not being mapped, i.e. an unmapped page.
If we now run it through gdb (or lldb) we see the following values for x0
and
x1
right before the attempted load.
>>> p/tz $x0
$2 = 0x0000000041035000
>>> p/tz $x1
$3 = 0x0000000041036fff
To handle this we can first make check if we are allowed to check more than 32 bytes (limit > 32). If not we can check how many bytes are left to the next page crossing. If its less than 32 bytes we can load our 32 bytes from the buffers pointer minus the offset. Then after its loaded we can shift the entire vector register to get rid of the unwanted bytes.
On SSE there is a PSHUFB
instruction that does this and luckily there is a
more or less equivalent instruction for Aarch64 TBL
. We need to use TBL as the
other shift instructions like EXT
and USHL
either dont work with variable
shifts or across entire vector registers. USHL
would shift each byte a certain
amount depending on the arrangement specifier (the .16b piece you often see).
The numbers followed by a letter for the vector registers are the arrangement specifiers is the number and size of the elements in the vector. A simple way to keep track of them is by thinking:
B = Bytes (8bit)
H = Halfwords (16bit)
S = Single words (32bit)
D = Double words (64bit)
courtesy of [4]
So for checking if we are about to cross a page we can do the following
cmp x2,#32
b.hi .Lbegin // Jump to regular loop if we are allowed to check more than 32 bytes
add x3,x8,#32
add x4,x9,#32
eor x3,x3,x8
eor x4,x4,x9
tst w3,#PAGE_SIZE
b.ne .Lspecialhandling
tst w4,#PAGE_SIZE
b.ne .Lspecialhandling
Then theres also a piece that was overlooked in the main loop. As neither buf1 or buf2 is aligned there is a chance that we may overread when theres less than 32 bytes left to compare. This is easily solved by adding a new branch where we previously checked if the limit had gone down to 0.
.Lloop:
subs x2,x2,#32
b.le .Lnone
cmp x2,#32
b.le .Llast32
ldp q0,q1,[x8,32]! // load 32 bytes to vector registers
ldp q2,q3,[x9,32]!
eor v4.16b,v0.16b,v2.16b
eor v5.16b,v1.16b,v3.16b
umaxp v4.16b,v4.16b,v5.16b
umaxp v4.16b,v4.16b,v4.16b
fmov x6,d4
cbz x6,.Lloop
b .Lmatch
/* If 32 bytes left to compare only load 32 bytes from x8 - limit to
* avoid overread */
.Llast32:
mov x3,#32
sub x3,x3,x2 // x3 = 32 - x2
add x2,x2,x3 // add the amount we shifted to limit
sub x8,x8,x3
sub x9,x9,x3
ldp q0,q1,[x8,#32]! // load 32 bytes to vector registers
ldp q2,q3,[x9,#32]!
.Lmatch:
cmeq v0.16b,v0.16b,v2.16b // compare between 0-15 b1 vs b2
cmeq v1.16b,v1.16b,v3.16b // compare between 16-31 b1 vs b2
16 byte wide shifts
So now the only piece left in the puzzle is to shift the vector register for the
case when we have 32 bytes or less to compare and they are located near the end
of a page.
To utilize TBL
here we need to generate a suitable permutation mask, to do
this we can take the identity permutation 0, 1, 2, ..., 15, 16, 17, ..., 31
and
add to each byte to rotate it, or load from it with a desired offset. This can
table can easily be prepared in a data segment.
.section .rodata
.p2align 4
shift_table:
.byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
.byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
USHR/USHL
could also work but as it can only shift 64 bit words it would only
be useful for very short strings unless there some nice way to utilize it here.
The TBL/TBL
instructions are rather tricky to figure out how they work based
on the reference manual but ARM provides a graphical representation of how they
operate which cleared some things out for me.
We simply check if the 32nd byte from our buffers starting address is located on
another page as defined by the $PAGE_SIZE
preprocessor directive defined in
sys/arm64/include/param.h
. Then we shift and load by the appropriate amount.
So the check and fix can be implemented like the following:
ENTRY(_memcmp)
mov x8, x0
mov x9, x1
cbz x2, .Lnone // 0 length
/*
* Check if buffer is located at end of page to avoid crossing
* into unmapped page. If so, we load 32 bytes from the end of the
* limit and check the other buffer.
*/
cmp x2, #32
b.hi .Lbegin
add x3, x8, #32
add x4, x9, #32
eor x3, x3, x8
eor x4, x4, x9
tst w3, #PAGE_SIZE
b.eq 0f
mov x3, #32
sub x3, x3, x2
sub x8, x8, x3
/*
* We perform a variable shift in the vector register using TBL,
* a suitable permutation is generated by loading a table of bytes
* with a desired offset.
*/
adrp x0, shift_table
add x0, x0, :lo12:shift_table
add x0, x0, x3
ldp q0, q1, [x8]
ldp q4, q5, [x0] // load permutation table
tbl v0.16b, {v0.16b, v1.16b}, v4.16b
tbl v1.16b, {v0.16b, v1.16b}, v5.16b
add x8, x8, x3 // reset pointer to beginning of src
b 1f
0:
ldp q0, q1, [x8]
1:
tst w4, #PAGE_SIZE
b.eq 0f
mov x3, #32
sub x3, x3, x2
sub x9, x9, x3
ldp q2, q3, [x9]
adrp x0, shift_table
add x0, x0, :lo12:shift_table
add x0, x0, x3
ldp q4, q5, [x0]
tbl v2.16b, {v2.16b, v3.16b}, v4.16b
tbl v3.16b, {v2.16b, v3.16b}, v5.16b
add x9, x9, x3
b 1f
/*
* Compare strings of 1--32 bytes. We do this by loading into two
* vector registers and then doing a quick compare with XOR, UMAXP
* do determine if the first 32 bytes all match.
*/
.Lbegin:
ldp q0, q1, [x8]
0:
ldp q2, q3, [x9]
1:
/*
* REST OF CODE HERE
*/
END(_memcmp)
.section .rodata
.p2align 4
shift_table:
.byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
.byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
Benchmarks
Collin Percival was kind enough to give me a voucher for AWS credits so now I can also benchmark on a more powerful ARM core which might be more common to find running a FreeBSD system.
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 139.41µ ± 2% 63.96µ ± 1% -54.12% (p=0.000 n=20) 25.14µ ± 1% -81.97% (p=0.000 n=20)
MemcmpMid 93.38µ ± 3% 12.09µ ± 1% -87.05% (p=0.000 n=20) 10.91µ ± 1% -88.31% (p=0.000 n=20)
MemcmpLong 86.722µ ± 7% 4.648µ ± 1% -94.64% (p=0.000 n=20) 4.931µ ± 0% -94.31% (p=0.000 n=20)
geomean 104.1µ 15.32µ -85.29% 11.06µ -89.38%
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 855.1Mi ± 2% 1864.0Mi ± 1% +117.99% (p=0.000 n=20) 4742.2Mi ± 1% +454.58% (p=0.000 n=20)
MemcmpMid 1.247Gi ± 3% 9.629Gi ± 1% +671.89% (p=0.000 n=20) 10.668Gi ± 1% +755.20% (p=0.000 n=20)
MemcmpLong 1.342Gi ± 6% 25.048Gi ± 1% +1765.92% (p=0.000 n=20) 23.608Gi ± 0% +1658.68% (p=0.000 n=20)
geomean 1.118Gi 7.600Gi +579.66% 10.53Gi +841.33%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 183.12µ ± 0% 101.29µ ± 0% -44.68% (p=0.000 n=20) 39.96µ ± 0% -78.18% (p=0.000 n=20)
MemcmpMid 129.12µ ± 0% 24.55µ ± 1% -80.99% (p=0.000 n=20) 21.48µ ± 0% -83.37% (p=0.000 n=20)
MemcmpLong 111.374µ ± 0% 6.288µ ± 0% -94.35% (p=0.000 n=20) 7.593µ ± 0% -93.18% (p=0.000 n=20)
geomean 138.1µ 25.01µ -81.89% 18.68µ -86.47%
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 651.0Mi ± 0% 1176.9Mi ± 0% +80.78% (p=0.000 n=20) 2983.1Mi ± 0% +358.23% (p=0.000 n=20)
MemcmpMid 923.3Mi ± 0% 4856.0Mi ± 1% +425.95% (p=0.000 n=20) 5551.0Mi ± 0% +501.23% (p=0.000 n=20)
MemcmpLong 1.045Gi ± 0% 18.514Gi ± 0% +1671.24% (p=0.000 n=20) 15.332Gi ± 0% +1366.84% (p=0.000 n=20)
geomean 863.3Mi 4.656Gi +452.24% 6.233Gi +639.33%
References
[1]:
https://research.google/pubs/automemcpy-a-framework-for-automatic-generation-of-fundamental-memory-operations/
[2]: https://www.cs.uaf.edu/2017/fall/cs301/lecture/09_11_registers.html
[3]: https://developer.arm.com/documentation/109898/latest/
[4]: https://stackoverflow.com/a/73293215/17418180