GSoC status update #5 (cpy, ccpy, lcpy)

04-08-2024

Table of Contents

GSoC status update #5 (cpy, ccpy, lcpy)
Whats been going on the last few weeks
- memccpy
- strlcpy
- strlcat
- memcpy
- Benchmarks
References

Whats been going on the last few weeks

So the last few weeks I’ve been working on the *cpy functions. I gave strcspn a try but it’s really tricky and I decided to save it to last and perhaps continue with it past the GSoC deadline if needed.

I started with memccpy and let it be the base for strlcpy. I also noticed that stpncpy is quite similar but I haven’t gotten around to writing it yet although the already done functions will serve as a rough base for it.

memccpy

It turned out to be rather tricky to get this one right, a classic case of the Ninety–ninety rule. It went smooth at first but ironing out little bugs as being off by one here and there really took a while. Having two debuggers open side by side and comparing register values from the amd64 variant to my aarch64 port was really useful. To complicate things further I found a bug in the existing memccpy implementation which fuz’ took care of. [1] This first resulted in a small degradation in performance but after some rework it resulted in an improvement over the original faulty code! :-)

Although this meant that I rewrote parts of memccpy 3 times in order to improve perfomance. In the end it turned out quite nice, beating the Scalar performance by over 1000% for longer strings across all processors I was able to benchmark on. Sadly it turned out to be a whole 20% slower on short strings on the Neoverse N1 chip.

strlcpy

very similar to memccpy except that we return the length of the src string instead of a pointer. This means that we essentially run strlen at the same time. This leads to some nice multi-tasking, an idea I had was to use strlen implementation which was fast for short to medium strings when the whole source string wasn’t copied. My reasoning is that its more common that we generally reach a limit near the end of a string rather than near the beginning.

strlcpy is presented in this paper [2] and recently the linux kernel has been moving away from strlcpy in favor of strscpy. [3] There surely is a meme out there for the naming of each new variant of the string functions to make it more “safe”. :-)

strlcat

strlcat is rather simple to implement using existing string function. Take the one from FreeBSD’s libc. [4]

size_t
strlcat(char *restrict dst, const char *restrict src, size_t dstsize)
{
    char *loc = __memchr(dst, '\0', dstsize);

    if (loc != NULL) {
        size_t dstlen = (size_t)(loc - dst);

        return (dstlen + __strlcpy(loc, src, dstsize - dstlen));
    } else
        return (dstsize + strlen(src));
}

Now the only issue is that I haven’t written a SIMD memchr as there is also a well optimized variant in the arm-optimized-routines repository in /contrib. The issue is that I should not touch anything in /contrib as it just pulled from a git repo. And second is that arm labels their memchar _memchraarch64 and then the Makefile does a little rewrite.

.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
    printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
    printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
    printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+=    ${FUNC}.S
.endif

The reason why we can’t just call memchr is that libc functions must be designed such that they do not appear to call other libc functions in case the user overrides any of them. And calling something __something is a rather simple fix. :-)

So the solution is to create a little wrapper for memchr, which naturally brings us to memcpy!

memcpy

I noticed that the memcpy that was currently included in libc wasn’t SIMD optimized despite there being an ASIMD memcpy variant in /contrib. The reason is that compared to all the other string functions then arm has provided a Scalar memcpy and a SIMD memcpy. This is probably due to the string functions being grandfathered in from an older optimized routines repository. Anyway, ASIMD is a part of the base ISA of Aarch64 so as long as it’s not slower then theres little incentive to not use it. The fix was simple.

diff --git a/lib/libc/aarch64/string/memcpy.S b/lib/libc/aarch64/string/memcpy.S
--- a/lib/libc/aarch64/string/memcpy.S
+++ b/lib/libc/aarch64/string/memcpy.S
@@ -1,6 +1,6 @@
-#define    __memcpy_aarch64    memcpy
-#define    __memmove_aarch64   memmove
-#include "aarch64/memcpy.S"
+#define    __memcpy_aarch64_simd   memcpy
+#define    __memmove_aarch64_simd  memmove
+#include "aarch64/memcpy-advsimd.S"

coming week

GSoC is slowly nearing its end, there’s just over 2 weeks left of the standard (12 week) coding period. These final weeks will consist of me tying up some loose ends such as missing bcmp ifdef’s for memcp which bundles both bcmp and memcmp into the same file and finishing the implementation the fancy algorithm for strcspn.

Benchmarks

There’s quite a few benchmarks so it’s better to visit the DR for each of them instead.

memcpy
memccpy
strlcpy

References

[1]: https://reviews.freebsd.org/D46052
[2]: http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/millert/millert.pdf
[3]: https://lwn.net/Articles/905777/
[4]: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/strlcat.c

~getz