stdin
    Hello Beastie and Google Summer of Code!
    
    01-06-2024
    
    <!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
    
    Table of Contents
    
      · Hello Beastie and Google Summer of Code! 
        #hello-beastie-and-google-summer-of-code 
          · Background #background 
          · Why write assembly #why-write-assembly 
              · Architectural levels #architectural-levels 
          · amd64 strlen implementation #amd64-strlen-implementation 
              · Substituting missing instructions 
                #substituting-missing-instructions 
          · strlen PoC #strlen-poc 
              · Simple strlen #simple-strlen 
              · Naïve implementation #naïve-implementation 
              · Improved implementation #improved-implementation 
              · FCMP to avoid GPR move #fcmp-to-avoid-gpr-move 
          · libc integration #libc-integration 
          · Tests #tests 
          · Benchmarks #benchmarks 
          · Whats next #whats-next 
          · References #references 
      · Bonus: Hello World in Aarch64 Assembly 
        #bonus-hello-world-in-aarch64-assembly 
    
    <!-- markdown-toc end -->
    
    I have been admitted to be a part of Google Summer of Code (GSoC)
    2024! [1 https://summerofcode.withgoogle.com/]
    
    Participating in GSoC has been on my radar for many years but I
    always thought, oh I’ll do it next summer, well that summer is now!
    :D
    
    ## Background
    
    The admission process to GSoC is as follows: Organizations publish
    some suggested projects with possible mentors, students get in
    touch with said mentors or come up with their own project and
    convince someone to mentor them. Then voting takes place and the
    top N applicants for the organizations are accepted where N is the
    numbers of slots Google has allocated for that organization. So I
    found a project that I thought sounded interesting, got in touch
    with the mentor and sent in a proposal [2 
    https://dflund.se/~getz/GSOC/FreeBSDproposal.txt] through google’s
    portal, the project was posted on the FreeBSD GSoC ideas page [3 
    https://wiki.freebsd.org/SummerOfCodeIdeas#Port_of_libc_SIMD_enhancements_to_other_architectures].
    The project is to port SIMD enhanced string functions in libc from
    amd64 (x86_64) to arm64 (Aarch64).
    
    A great incentive to the program is that you get the chance to be
    mentored by member(s) of the community. I have the pleasure to be
    mentored by two wonderful people, Robert Clausecker <fuz> and Ed
    Maste <emaste>. Another GSoC contributor is porting the same
    algorithms to RISC-V, he is basing his implementation on the base
    RISC-V ISA using SIMD Within A Register (SWAR) techniques, and thus
    not having any dependency on processor extensions such as the
    recently ratified 1.0 RISC-V Vector Extension for which almost no
    hardware is currently available (and no hardware running FreeBSD).
    His blog documenting his adventures is available at [4 
    https://strajabot.com/].
    
    As for me, I’ve been using FreeBSD for a few years now, before that
    I used Linux but after frustration over lack of good documentation
    and a fragmented system I switched and haven’t looked back. I still
    enjoy the linux kernel but userland and distros is something I just
    want out of my way. 
    
    But this is not an article about why FreeBSD is superior to Linux,
    if it even is (yes, in some regards, but ymmv). But rather about my
    project for the summer.
    
    ## Why write assembly
    
    Most libc functions on other platforms already benefit from being
    handwritten in assembly, both scalar and SIMD variants. Using SIMD
    instructions for string functions are particularly unfit for a
    autovectorizing compiler as we make atypical use of SIMD
    instructions. For the scalar implementations we may use some Bit
    Twiddling Hacks such as those on Sean Eron Anderson’s site [5 
    https://graphics.stanford.edu/~seander/bithacks.html].
    
    Compilers also struggle reasoning to decide which operations to do
    in GPRs and which to do in vector registers. Register allocation is
    also a problem for amd64 and the compiler may spill into the stack
    whereas for handwritten assembly you would be left with registers
    to spare. This is not really as extreme of a case on arm64 as we
    have way more registers than amd64 to play around with.
    
    Another compelling reason to have performance critical libc
    functions written in assembly is that all programs that link
    against libc will benefit from these improvements. Although this
    will put some additional pressure on me as an implementer as the
    code cant break other peoples programs just because they abused
    libc in interesting ways. An example of that is how memcmp on
    FreeBSD differs from the ISO/IEC 9899:1999 requirement’s. In
    particular FreeBSD documents that memcmp returns the difference
    between the first two mismatching characters, as opposed to merely
    returning a negative/positive integer or zero.
    
    But my project will solely deal with using Arm NEON instructions to
    simd-ify the string functions, although bit-twiddling GPRs does
    sound enticing.
    
    ### Architectural levels
    
    The AMD64 SysV ABI supplement defines the following architecture
    levels, where for FreeBSD we have implementations for most of the
    string functions in libc for scalar and baseline. Users are able to
    choose which level of enhancements to use using the ARCHLEVEL flag.
    A complete list of enhanced functions are available in the simd(7)
    manpage [6 
    https://man.freebsd.org/cgi/man.cgi?query=simd&manpath=FreeBSD+15.0-CURRENT].
    
    scalar    scalar enhancements only (no SIMD)
    
    baseline      cmov, cx8, x87 FPU, fxsr, MMX, osfxsr, SSE, SSE2
    
    x86-64-v2  cx16,    lahf/sahf, popcnt, SSE3, SSSE3, SSE4.1, SSE4.2
    
    x86-64-v3  AVX, AVX2, BMI1, BMI2, F16C, FMA, lzcnt, movbe, osxsave
    
    x86-64-v4  AVX-512F/BW/CD/DQ/VL
    
    ## amd64 strlen implementation
    
    The amd64 strlen implementation consists of two parts, a scalar one
    (using bit-twiddling) and a Vectorized one (baseline). I’ll focus
    on the SIMD one here, the interested reader can navigate to
    /usr/src/lib/libc/amd64/string or point their browser to [7 
    https://github.com/freebsd/freebsd-src/blob/main/lib/libc/amd64/string/strlen.S]
    for the scalar implementation.
    
    ARCHENTRY(strlen, baseline)
        mov %rdi, %rcx
        pxor    %xmm1, %xmm1
        and $~0xf, %rdi         # align string
        pcmpeqb (%rdi), %xmm1   # compare head (with junk before string)
        mov %rcx, %rsi          # string pointer copy for later
        and $0xf, %ecx          # amount of bytes rdi is past 16 byte alignment
        pmovmskb %xmm1, %eax
        add $32, %rdi           # advance to next iteration
        shr %cl, %eax           # clear out matches in junk bytes
        test    %eax, %eax      # any match? (can't use ZF from SHR as CL=0 is possible)
        jnz 2f
    
        ALIGN_TEXT
    1:  pxor    %xmm1, %xmm1
        pcmpeqb -16(%rdi), %xmm1    # find NUL bytes
        pmovmskb %xmm1, %eax
        test    %eax, %eax          # were any NUL bytes present?
        jnz 3f
    
        /* the same unrolled once more */
        pxor    %xmm1, %xmm1
        pcmpeqb (%rdi), %xmm1
        pmovmskb %xmm1, %eax
        add $32, %rdi           # advance to next iteration
        test    %eax, %eax
        jz  1b
    
        /* match found in loop body */
        sub $16, %rdi           # undo half the advancement
    3:  tzcnt   %eax, %eax      # find the first NUL byte
        sub %rsi, %rdi          # string length until beginning of (%rdi)
        lea -16(%rdi, %rax, 1), %rax    # that plus loc. of NUL byte: full string length
        ret
    
        /* match found in head */
    2:  tzcnt   %eax, %eax      # compute string length
        ret
    ARCHEND(strlen, baseline)
    
    Most of these instructions aren’t anything odd, MOV,XOR,AND,SHR,
    but what stands out is PCMPEQB and PMOVMSKV.
    
    PMOVMSKB [8 https://www.felixcloutier.com/x86/pmovmskb] is one of
    the most useful instructions for finding where the the index of our
    NULL character. The string functions in libc obviously function on
    C style string so they are NULL terminated. So what we do there is
    a compare followed by figuring out where the match was.
    
    But heres the kicker, there is no PMOVMSKB instruction for Aarch64
    which has caused a whole lot of headache. Im basing this on the
    amount of posts online regarding substitutions for PMOVMSKB on
    Aarch64 whereas assembly instructions are otherwise often rarely
    complained about. [9 
    https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/][10
    https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon][11
    https://www.corsix.org/content/whirlwind-tour-aarch64-vector-instructions]
    
    ### Substituting missing instructions
    
    The most promising substitution to PMOVMSKB appears to be SHRN.
    With it we can take a 128-bit vector, shift by #imm and truncate to
    8 bits. So basically end up with a mask of either chunks of all 0’s
    or all 1’s, then by truncation we end up with single but halfbytes
    which correspond to whether we had a match or not. Hopefully thats
    comes out clearly, otherwise there is an excellent video courtesy
    of [10 
    https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon]
    which shows it in action.
    
    <figure>
        <video src='./shrn_explained.mp4' controls autoplay />
    </figure>
    <figure>
        <figcaption><code>SHRN</code> explained,  courtesy of <a href="https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon">[10]</a></figcaption>
    </figure> 
    
    ## strlen PoC
    
    Here is an evolution of my attempt of porting strlen to Aarch64.
    They also come in the form of a git repository [12 
    https://git.sr.ht/~getz/aarch64_string.h/] where I keep my
    experiments before theyre ready to be integrated into libc at my
    fork of freebsd-src that I keep on github. [13 
    https://github.com/soppelmann/freebsd-src]
    It’s also good to know while reading this code that registers 
    x9-x15 are “corruptible” meaning that a function can change them to
    whatever without having to restore them afterwards. Registers d0-d7
    are “parameter and results registers”. So according to the ARM
    procedure call standard, x0 etc. is where the input to a function
    is stored and you can do whatever you want with x9-x15 without
    breaking anything.
    
    ### Simple strlen
    
    The most simple variant simply checks the first chunk without any
    loop. This simple example will only give valid results for short
    strings which are 16 byte aligned. That is that the data begins on
    a memory address that is a multiple of 16. You can create such a
    string to test it out like this:
    
    #include <stdalign.h>
    #include <stdio.h>
    #include <string.h>
    
    extern size_t _strlen(const char * ptr);
    
    int
    main() {
        alignas(16) char string[] = "str";
        printf("strlen: %zu\n", _strlen(string));
    }
    
    Now for the simple strlen implementation. Note that <machine/asm.h>
    has nice little macros ENTRY() and END() which are used throughout.
    
    ENTRY(_strlen)
        BIC x10,x0,#0xf       // alignment
        LDR q0,[x10]          // load input to Vector register
        CMEQ v0.16b,v0.16b,#0 // look for 0's
        SHRN v0.8b,v0.8h,#4   // ^
        FMOV x0,d0            // move to GPR
        RBIT x0,x0            // reverse bits as NEON has no ctz
        CLZ x0,x0             // count leading zeros
        LSR x0,x0,#2          // get offset index
        RET
    END(_strlen)
    
    ### Naïve implementation
    
    Now for a simple but naïve solution to creating a loop and handling
    strings which are not already 16 byte aligned. We simply calculate
    the offset to the nearest 16 byte boundary and traverse to the
    boundary with a Scalar implementation and then turn to the SIMD
    variant.
    
    ENTRY(_strlen)
        BIC x10,x0,#0xf
        AND x9,x0,#0xf
        MOV x11,#0
        CBZ x9,.Laligned_loop
    
    .Lunaligned_start:
        LDR x5,[x10]
        ADD x10,x10,#1
        CBZ x5,.Lfound_null
        SUB x9,x9,#1
        CBNZ x9,.Lunaligned_start
        B .Laligned_loop
    
    .Lnext_it:
        ADD x11,x11,#16
        ADD x10,x10,#16
    
    .Laligned_loop:
        LDR q0,[x10]
        CMEQ v0.16b,v0.16b,#0
        SHRN v0.8b,v0.8h,#4
        FMOV x0,d0
        CBZ x0,.Lnext_it
        RBIT x0,x0
        CLZ x0,x0
        LSR x0,x0,#2
        ADD x0,x0,x11
        RET
    
    .Lfound_null:
        RET
    END(_strlen)
    
    This doesn’t make use of handy Aarch64 instruction paramaters such
    as LDR qreg,[xreg,#imm]! which increments by the immediate value
    before the load.
    
    ### Improved implementation
    
    Now we use improve our previous implementation by using SIMD
    Instructions for the first chunk. We use a GPR load after the first
    check to improve performance for short strings as a GPR move is
    required for the results and statistically short strings are the
    most common in real world scenarios. This statement is backed by a
    survey conducted by the LLVM project. I have not been able to find
    a direct link to those results, but I will update this post when I
    find them. See: 
    https://code.ornl.gov/llvm-doe/llvm-project/-/tree/doe/libc/benchmarks/distributions
    
    ENTRY(_strlen)
        BIC x10,x0,#0xf
        LDR q0,[x10]
        CMEQ    v0.16b,v0.16b,#0
        SHRN    v0.8b,v0.8h,#4
        FMOV    x1,d0       // move to GPR
        LSL x2,x0,#2        // get the byte offset of the last processed
        LSR x1,x1,x2        // align with offset
        CBZ x1,.Lloop       // jump if no hit
        RBIT    x1,x1
        CLZ x0,x1
        LSR x0,x0,#2
        RET
    
    .Lloop:
        LDR q0,[x10,#16]!    // increment by 16 and load
        CMEQ v0.16b,v0.16b,#0
        SHRN v0.8b,v0.8h,#4
        fmov x1,d0           // get offset in case of hit
        cbz x1,.Lloop        // x1 is zero if no hit in segment
        .Ldone:
        SUB x0,x10,x0
        RBIT x1,x1
        CLZ x3,x1
        LSR x3,x3,#2
        ADD x0,x0,x3
        RET
    END(_strlen)
    
    ### FCMP to avoid GPR move
    
    After improving on the naïve implementation I realized that we can
    avoid a move from a SIMD register to a GPR by using FCMP in the
    loop. I also realized that we can avoid a few instructions if the
    input is already 16 byte aligned (see .Laligned), although it does
    introduce a new branch to be resolved.
    The later benchmarks will indicate whether or not this is a
    worthwhile improvement.
    
    ENTRY(_strlen)
        BIC x10,x0,#0xf
        AND x9,x0,#0xf
        LDR q0,[x10]
        CMEQ    v0.16b,v0.16b,#0
        SHRN    v0.8b,v0.8h,#4
        CBZ x9,.Laligned
        FMOV    x1,d0
        LSL x2,x0,#2
        LSR x1,x1,x2
        CBZ x1,.Lloop
        RBIT    x1,x1
        CLZ x0,x1
        LSR x0,x0,#2
        RET
    
    .Laligned:
        FMOV    x1,d0
        CBNZ    x1,.Ldone
    
    .Lloop:
        LDR q0,[x10,#16]!
        CMEQ v0.16b,v0.16b,#0
        SHRN v0.8b,v0.8h,#4
        fcmp d0,#0.0
        B.EQ .Lloop
        FMOV x1,d0
    .Ldone:
        SUB x0,x10,x0
        RBIT x1,x1
        CLZ x3,x1
        LSR x3,x3,#2
        ADD x0,x0,x3
        RET
    END(_strlen)
    
    Now we can look at what further improvements can be done. We could
    avoid loop carried dependencies, for each iteration a post
    increment is currently used to go to next iteration, we could
    unroll the loop twice and increment x10 every two iterations to
    make it easier for the CPU to run two iterations at once. Aarch64
    also has instructions for loading several SIMD registers at once
    using the LD1,LD2… family of instructions. [16 
    https://www.scs.stanford.edu/~zyedidia/arm64/ld2_advsimd_mult.html]
    
    ## libc integration
    
    Getting a string function written in assembly integrated into libc
    on FreeBSD isn’t as big of an ordeal as it may sound like. We
    simply use the ENTRY() and END() macros in the code and add the
    filenames to the associated Makefile.inc located at 
    lib/libc/aarch64/string/Makefile.inc like the following:
    
    @@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
            strchrnul \
            strcmp \
            strcpy \
    -       memcmp \
            strncmp \
            strnlen \
            strrchr
    
    +MDSRCS+= \
    +       memcmp.S
     #
     # Add the above functions. Generate an asm file that includes the needed
     # Arm Optimized Routines file defining the function name to the libc name.
    
    We can then build libc as a shared library and load it using 
    LD_PRELOAD for running regression tests as running FreeBSD with a
    broken libc makes FreeBSD very sad and prone to severe errors. It’s
    always nice to avoid a broken install while debugging.
    
    Building libc is as simple as
    
    cd /usr/src/lib/libnetbsd && make
    cd /usr/src/lib/libc && make
    
    # OR
    
    make -C /usr/src/lib/libc MAKEOBJDIRPREFIX=/tmp/objdir WITHOUT_TESTS=yes
    
    As for debugging it’s as simple as loading up a test binary with 
    lldb and setting a breakpoint for the string function being
    developed, strlen in our case.
    
    ## Tests
    
    FreeBSD comes bundled with an excellent Test Suite [14 
    https://wiki.freebsd.org/TestSuite], they are written using a test
    framework Kyua with the ATF library. The FreeBSD wiki page has all
    the information necessary for this and there’s no need for my to
    repeat it here. :-) 
    
    Running the tests is as simple as going to 
    /usr/src/lib/libc/tests/string and running make check. If you haven’t
    already run buildworld then you will need to build lib/libnetbsd
    first. This as FreeBSD also borrows some tests from upstream NetBSD
    located at /usr/src/contrib/netbsd-tests/lib/libc/string.
    
    ## Benchmarks
    
    Benchmarks are executed using fuz’ strperf program [17 
    https://github.com/clausecker/strperf], it’s output is compatible
    with benchstat from /devel/go-perf. I benchmark all the
    implementations against the implementations in libc, I also tried
    running the code on a Raspberry Pi 5 running debian to see strlen
    holds up against glibc’s implementation.
    
    I have benchmarked the previously described implementations. Seeing
    performance impact of substituting a GPR move followed by a bnz
    compared to a FCMP followed by a b.eq and the impact of branching
    immediately in the case of an already aligned string.
    
    It’s also important to note that these implementations are hardware
    dependent as different cores may utilize more or fewer pipelines
    for specific instructions.
    
    To test against glibc I borrowed a Raspbery Pi5 running debian and
    installed bsd-make to compile strperf, apt install bmake.
    
    Then to generate benchmark results it’s as simple as:
    for i in {1..20}; do ./strlen >> results/${TEST}; done
    
    This produced the following results when evaluated with benchstat.
    You might need to scroll horizontally to view all the results.
    
    os: FreeBSD
    arch: arm64
    cpu: ARM Cortex-A76 r4p1
            │ libc_Scalar  │              libc_ARM               │                 GPR                 │             GPR_aligned             │                FCMP                 │            FCMP_aligned             │
            │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │   sec/op     vs base                │   sec/op     vs base                │   sec/op     vs base                │
    Short      186.9µ ± 1%   134.6µ ± 0%  -28.01% (p=0.000 n=20)   121.0µ ± 0%  -35.26% (p=0.000 n=20)   118.5µ ± 0%  -36.62% (p=0.000 n=20)   121.8µ ± 0%  -34.85% (p=0.000 n=20)   119.8µ ± 0%  -35.91% (p=0.000 n=20)
    Mid        45.05µ ± 1%   37.07µ ± 0%  -17.73% (p=0.000 n=20)   33.36µ ± 0%  -25.96% (p=0.000 n=20)   30.43µ ± 1%  -32.45% (p=0.000 n=20)   33.37µ ± 0%  -25.93% (p=0.000 n=20)   29.98µ ± 1%  -33.45% (p=0.000 n=20)
    Long      13.894µ ± 0%   4.442µ ± 0%  -68.03% (p=0.000 n=20)   6.978µ ± 0%  -49.78% (p=0.000 n=20)   6.977µ ± 0%  -49.79% (p=0.000 n=20)   6.852µ ± 0%  -50.68% (p=0.000 n=20)   5.627µ ± 0%  -59.50% (p=0.000 n=20)
    geomean    48.91µ        28.09µ       -42.58%                  30.43µ       -37.79%                  29.30µ       -40.09%                  30.31µ       -38.03%                  27.24µ       -44.31%
    
            │ libc_Scalar  │                libc_ARM                │                  GPR                  │              GPR_aligned              │                  FCMP                  │              FCMP_aligned              │
            │     B/s      │      B/s       vs base                 │      B/s       vs base                │      B/s       vs base                │      B/s       vs base                 │      B/s       vs base                 │
    Short     637.7Mi ± 1%    885.9Mi ± 0%   +38.91% (p=0.000 n=20)    985.1Mi ± 0%  +54.47% (p=0.000 n=20)   1006.1Mi ± 0%  +57.77% (p=0.000 n=20)    978.9Mi ± 0%   +53.49% (p=0.000 n=20)    995.0Mi ± 0%   +56.02% (p=0.000 n=20)
    Mid       2.584Gi ± 1%    3.141Gi ± 0%   +21.55% (p=0.000 n=20)    3.490Gi ± 0%  +35.07% (p=0.000 n=20)    3.825Gi ± 1%  +48.03% (p=0.000 n=20)    3.488Gi ± 0%   +35.01% (p=0.000 n=20)    3.883Gi ± 1%   +50.26% (p=0.000 n=20)
    Long      8.379Gi ± 0%   26.210Gi ± 0%  +212.81% (p=0.000 n=20)   16.684Gi ± 0%  +99.12% (p=0.000 n=20)   16.686Gi ± 0%  +99.15% (p=0.000 n=20)   16.990Gi ± 0%  +102.77% (p=0.000 n=20)   20.690Gi ± 0%  +146.93% (p=0.000 n=20)
    geomean   2.380Gi         4.145Gi        +74.15%                   3.826Gi       +60.76%                   3.973Gi       +66.92%                   3.841Gi        +61.37%                   4.274Gi        +79.56%
    
    os: Linux
    arch: aarch64
            │ strlen_glibc │
            │    sec/op    │
    Short      132.1µ ± 0%
    Mid        36.29µ ± 1%
    Long       4.365µ ± 4%
    geomean    27.55µ
    
            │ strlen_glibc │
            │     B/s      │
    Short     902.7Mi ± 0%
    Mid       3.208Gi ± 1%
    Long      26.67Gi ± 4%
    geomean   4.225Gi
    
    ## Whats next
    
    Despite resulting in worse perfomance for longer strings I will now
    continue the porting effort and translate memcmp to Aarch64 NEON. I
    will continue optimizing strlen by unrolling the main loop twice as
    previously mentioned. So instead of the current LDR q0,[x10,#16]! I’ll
    do a LDP q1, q2, [x10, #32]!
    
    Hopefully I can get that strlen done by the end of the week and
    submit it for review to the FreeBSD Phabricator instance.
    
    I also attended the LundLinuxCon [18 https://lundlinuxcon.org] last
    week and saw a really interesting talk regarding safer flexible
    arrays in the Linux kernel [19 
    https://embeddedor.com/slides/2024/llc/llc2024.pdf]. It involved a
    new warning flag and some niceties that were recently merged into
    GCC15 [20 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896] and
    LLVM18 [21 https://github.com/llvm/llvm-project/pull/76348], I will
    try building FreeBSD with those flags and see how many warning are
    present, but first I need to read up a littlebit whether or not
    FreeBSD even permits the use of flexible arrays in the kernel. :-) 
    
    ## References
    
    [1 https://summerofcode.withgoogle.com/] 
    https://summerofcode.withgoogle.com/
    [2 https://dflund.se/~getz/GSOC/FreeBSDproposal.txt] 
    https://dflund.se/~getz/GSOC/FreeBSDproposal.txt
    [3 
    https://wiki.freebsd.org/SummerOfCodeIdeas#Port_of_libc_SIMD_enhancements_to_other_architectures]
    https://wiki.freebsd.org/SummerOfCodeIdeas#Port_of_libc_SIMD_enhancements_to_other_architectures
    [4 https://strajabot.com/] https://strajabot.com/
    [5 https://graphics.stanford.edu/~seander/bithacks.html] 
    https://graphics.stanford.edu/~seander/bithacks.html
    [6 
    https://man.freebsd.org/cgi/man.cgi?query=simd&manpath=FreeBSD+15.0-CURRENT]
    https://man.freebsd.org/cgi/man.cgi?query=simd&manpath=FreeBSD+15.0-CURRENT
    [7 
    https://github.com/freebsd/freebsd-src/blob/main/lib/libc/amd64/string/strlen.S]
    https://github.com/freebsd/freebsd-src/blob/main/lib/libc/amd64/string/strlen.S
    [8 https://www.felixcloutier.com/x86/pmovmskb] 
    https://www.felixcloutier.com/x86/pmovmskb
    [9 
    https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/]
    https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/
    [10 
    https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon]
    https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
    [11 
    https://www.corsix.org/content/whirlwind-tour-aarch64-vector-instructions]
    https://www.corsix.org/content/whirlwind-tour-aarch64-vector-instructions
    [12 https://git.sr.ht/~getz/aarch64_string.h/] 
    https://git.sr.ht/~getz/aarch64_string.h/
    [13 https://github.com/soppelmann/freebsd-src] 
    https://github.com/soppelmann/freebsd-src/
    [14 https://wiki.freebsd.org/TestSuite] 
    https://wiki.freebsd.org/TestSuite/
    [15 
    https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/]
    https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/
    [16 
    https://www.scs.stanford.edu/~zyedidia/arm64/ld2_advsimd_mult.html]
    https://www.scs.stanford.edu/~zyedidia/arm64/ld2_advsimd_mult.html
    [17 https://github.com/clausecker/strperf] 
    https://github.com/clausecker/strperf
    [18 https://lundlinuxcon.org] https://lundlinuxcon.org
    [19 https://embeddedor.com/slides/2024/llc/llc2024.pdf] 
    https://embeddedor.com/slides/2024/llc/llc2024.pdf
    [20 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896] 
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896
    [21 https://github.com/llvm/llvm-project/pull/76348] 
    https://github.com/llvm/llvm-project/pull/76348 
    
    Bonus: Hello World in Aarch64 Assembly
    
    The FreeBSD Developers’ Handbook has a section on writing assembly,
    but it’s targeted towards x86 and is rather outdated. Using it’s
    suggestions Hello World would look like this.
    
        .text
        .global _start
    
    kernel:
        int $0x80
        ret
    
    _start:
        mov     $4, %rax
        mov     $1, %rdi
        mov     $message, %rsi
        mov     $13, %rdx
        call    kernel
        mov     $1, %rax
        mov     $69, %rdi
        syscall
    
    message:
        .ascii "Hello, world\n"
    
    But INT 80h is much slower than syscall or simply using svc #0 on
    arm64, this is because a lot of microcode is run when INT is
    executed, whereas the microcode for syscall is much simpler.
    Although INT 80h still works on amd64!
    
    Whereas 0x80 is the i386 syscall interface it incidentally works
    for amd64 tasks because we don’t check if the process doing a
    syscall is a 32 bit or 64 bit one. But arguments are truncated to
    32 bits, so you can’t e.g. pass pointers to the stack! And arm64
    ofcourse doesn’t have the INT instruction.
    
    In general, the method of doing syscalls is different on each
    architecture. FreeBSD is moving towards what win32 and solaris
    already pioneered: syscalls should be done by calling library
    functions so the kernel ABI and API can be adapted in the future.
    For this reason syscalls will be split into a new library libsys in
    FreeBSD 15.
    
    If you’re wondering how to figure out what number each syscall
    corresponds to then you can check 
    https://cgit.freebsd.org/src/tree/sys/kern/syscalls.master
    
    /*
    Compile with the following for non Aarch64 host:
    aarch64-unknown-freebsd14.0-gcc13 --sysroot /usr/local/freebsd-sysroot/aarch64 hello.S -nostdlib
    
    Run with:
    qemu-aarch64-static ./a.out
    */
    
    .text
    
    /* Our application's entry point. */
    .global _start
    
    _start:
        /* syscall write(int fd, const void *buf, size_t count) */
        mov     x0, #1      /* fd := STDOUT_FILENO */
        ldr     x1, = msg    /* buf := msg */
        ldr     x2, = len    /* count := len */
        mov     w8, #4     /* write is syscall #4 */
        svc     #0          /* invoke syscall */
    
        /* syscall exit(int status) */
        mov     x0, #69      /* status := 69 */
        mov     w8, #1     /* exit is syscall #1 */
        svc     #0          /* invoke syscall */
    
    /* Data segment: define our message string and calculate it's length. */
    .data
    msg:
        .ascii "Hello, world\n"
    len = . - msg