GSoC status update #5 (cpy, ccpy, lcpy)
04-08-2024
Table of Contents
Whats been going on the last few weeks
So the last few weeks I’ve been working on the *cpy
functions. I gave
strcspn
a try but it’s really tricky and I decided to save it to last and
perhaps continue with it past the GSoC deadline if needed.
I started with memccpy
and let it be the base for strlcpy
. I also noticed
that stpncpy
is quite similar but I haven’t gotten around to writing it yet
although the already done functions will serve as a rough base for it.
memccpy
It turned out to be rather tricky to get this one right, a classic case of the
Ninety–ninety rule. It went smooth at first but ironing out little bugs as
being off by one here and there really took a while. Having two debuggers open
side by side and comparing register values from the amd64 variant to my aarch64
port was really useful. To complicate things further I found a bug in the
existing memccpy
implementation which fuz’ took care of. [1] This first
resulted in a small degradation in performance but after some rework it resulted
in an improvement over the original faulty code! :-)
Although this meant that I rewrote parts of memccpy 3 times in order to improve perfomance. In the end it turned out quite nice, beating the Scalar performance by over 1000% for longer strings across all processors I was able to benchmark on. Sadly it turned out to be a whole 20% slower on short strings on the Neoverse N1 chip.
strlcpy
very similar to memccpy except that we return the length of the src
string
instead of a pointer. This means that we essentially run strlen
at the same
time. This leads to some nice multi-tasking, an idea I had was to use strlen
implementation which was fast for short to medium strings when the whole source
string wasn’t copied. My reasoning is that its more common that we generally
reach a limit near the end of a string rather than near the beginning.
strlcpy
is presented in this paper [2] and recently the linux kernel has
been moving away from strlcpy
in favor of strscpy
. [3]
There surely is a meme out there for the naming of each new variant of the
string functions to make it more “safe”. :-)
strlcat
strlcat
is rather simple to implement using existing string function. Take the
one from FreeBSD’s libc. [4]
size_t
strlcat(char *restrict dst, const char *restrict src, size_t dstsize)
{
char *loc = __memchr(dst, '\0', dstsize);
if (loc != NULL) {
size_t dstlen = (size_t)(loc - dst);
return (dstlen + __strlcpy(loc, src, dstsize - dstlen));
} else
return (dstsize + strlen(src));
}
Now the only issue is that I haven’t written a SIMD memchr as there is also a well optimized variant in the arm-optimized-routines repository in /contrib. The issue is that I should not touch anything in /contrib as it just pulled from a git repo. And second is that arm labels their memchar _memchraarch64 and then the Makefile does a little rewrite.
.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+= ${FUNC}.S
.endif
The reason why we can’t just call memchr
is that libc functions must be
designed such that they do not appear to call other libc functions in case the
user overrides any of them. And calling something __something is a rather simple
fix. :-)
So the solution is to create a little wrapper for memchr, which naturally brings us to memcpy!
memcpy
I noticed that the memcpy that was currently included in libc wasn’t SIMD
optimized despite there being an ASIMD memcpy
variant in /contrib.
The reason is that compared to all the other string functions then arm has
provided a Scalar memcpy
and a SIMD memcpy
. This is probably due to the
string functions being grandfathered in from an older optimized routines
repository. Anyway, ASIMD is a part of the base ISA of Aarch64 so as long as
it’s not slower then theres little incentive to not use it. The fix was simple.
diff --git a/lib/libc/aarch64/string/memcpy.S b/lib/libc/aarch64/string/memcpy.S
--- a/lib/libc/aarch64/string/memcpy.S
+++ b/lib/libc/aarch64/string/memcpy.S
@@ -1,6 +1,6 @@
-#define __memcpy_aarch64 memcpy
-#define __memmove_aarch64 memmove
-#include "aarch64/memcpy.S"
+#define __memcpy_aarch64_simd memcpy
+#define __memmove_aarch64_simd memmove
+#include "aarch64/memcpy-advsimd.S"
coming week
GSoC is slowly nearing its end, there’s just over 2 weeks left of the standard (12 week) coding period. These final weeks will consist of me tying up some loose ends such as missing bcmp ifdef’s for memcp which bundles both bcmp and memcmp into the same file and finishing the implementation the fancy algorithm for strcspn.
Benchmarks
There’s quite a few benchmarks so it’s better to visit the DR for each of them instead.
References
[1]: https://reviews.freebsd.org/D46052
[2]:
http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/millert/millert.pdf
[3]: https://lwn.net/Articles/905777/
[4]: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/strlcat.c