GSoC status update #4 (A bit of everything)
14-07-2024
Table of Contents
Whats been going on this week
This week I did some finishing touches on strncmp, it’s essentially the same as
strcmp
except with special handling for the limit and breaks the main loop
when theres less than 32 bytes left of the limit.
I also got some good code review on str(n)cmp
and applied it which resulted in
a few % performance improvement. [1]
strncmp pains
I thought I could do a clever solution for handling strings near a page boundary
for short strings (less than 16 bytes) for strncmp but I was mistaken.
After updating the test suite to have buffers placed right at the end of a page
with DEATH
waiting across the boundary I noticed that it failed.
So I had to revert to the more complicated handling and make it even more
complicated as simply checking for null bytes isnt enough I also need to insert
a fake null byte wherever the limit is. I solved the first part like this and
the latter part is a problem for tomorrow morning. :-)
@@ -103,24 +103,56 @@ ENTRY(strncmp)
.p2align 4
.Llt16:
- tbz w3, #PAGE_SHIFT, 0f
+ /*
+ * Check if either string is located at end of page to avoid crossing
+ * into unmapped page. If so, we load 16 bytes from the nearest
+ * alignment boundary and shift based on the offset.
+ */
+ tbz w3, #PAGE_SHIFT, 2f
ldr q0, [x8] // load aligned head
ldr q1, [x10]
+ lsl x14, x9, #2
+ lsl x15, x11, #2
+ lsl x3, x13, x14 // string head
+ lsl x4, x13, x15
+
+ cmeq v5.16b, v0.16b, #0
+ cmeq v6.16b, v1.16b, #0
+
+ shrn v5.8b, v5.8h, #4
+ shrn v6.8b, v6.8h, #4
+ fmov x5, d5
+ fmov x6, d6
+
adrp x14, shift_data
add x14, x14, :lo12:shift_data
/* heads may cross page boundary, avoid unmapped loads */
+ tst x5, x3
+ b.eq 0f
+
ldr q4, [x14, x9] // load permutation table
tbl v0.16b, {v0.16b}, v4.16b
+
+ b 1f
+ .p2align 4
+0:
+ ldr q0, [x0] // load true head
+1:
+ tst x6, x4
+ b.eq 0f
+
ldr q4, [x14, x11]
tbl v4.16b, {v1.16b}, v4.16b
+
b 1f
.p2align 4
-0:
+2:
ldr q0, [x0] // load true heads
+0:
ldr q4, [x1]
1:
Next function, strcspn
It’s also time to port a new function, I was choosing between memccpy
and
strcspn
. I chose to start exploring strcspn
and it has some really
interesting tricks required to get it fast. On x86 when SSE4.2
is available we
can use the amazing pcmpistri
instruction to compare a vector register with a
set without overreading past a null byte. [2]
On Aarch64 there is no equivalent,
although SVE2 has the MATCH
instruction, but thats to no avail for us.
The Graviton 3 CPU that I’ve been benchmarking on has SVE
support but not
SVE2
. And FreeBSD is just about to get SVE support, it’s about to land in
-CURRENT any second now IIRC.
strcspn tomfoolery
So checking with SIMD whether a byte is present in a set is not a completely new
problem. Although I haven’t seen any such algorithm implemented for Aarch64
then there is one for x86 developed by Wojciech Muła and Geoff Langdale. [3]
I won’t go into detail just yet, it’s quite a lot to wrap my head around but the article linked is a great resource. A version of this algorithm is used by the Intel Hyperscan project. [4][5]
An interesting suggestion fuz had was to duplicate a byte in the set and iterative over the string during the setup phase. In the case of a match of a early set member then this approach would be greatly beneficial.
This as the LUT for the algorithm can be done using scalar registers and the
check by ld1r
, cmeq
could be done in vector registers. Fully utilizing the
processors different pipelines.
Another useful article is [6] by Harold Aptroot but the lack of the extremely
versatile but tricky gf2p8affineqb
Galois Field Affine Transformation
Instruction on Aarch64 will need to be figured out.
A list of unexcepted uses for the Galois Field Affine Transformation Instruction is available here [7]
Performance analysis
I was also introduced by my mentor to some really cool projects which simulate different computer architectures and predicts the throughput of basic blocks. The first one is uiCA (uops.info Code Analyzer) [8] developed by researchers at Saarland University. The second one is OSACA (Open Source Architecture Code Analyzer) [9] which supports both x86 and Aarch64. It’s developed by RRZE-HPC at the Erlangen National High Performance Computing Center. OSACA also has integration with Compiler Explorer.
These tools are a clear step up from SimpleScalar that I used in university for our computer architecture course. [10] Simplescalar is legacy research software. I suspect it might soon be swapped out in our curriculum, that course has been getting serious upgrades the last few years.
For my strncmp implementation OSACA produces the following:
OSACA results
Open Source Architecture Code Analyzer (OSACA) - 0.5.2
Analyzed file: /app/example.asm
Architecture: A64FX
Timestamp: 2024-07-14 17:00:03
-------------------------- WARNING: No micro-architecture was specified -------------------------
A default uarch for this particular ISA was used. Specify the uarch with --arch.
See --help for more information.
-------------------------------------------------------------------------------------------------
----------------- WARNING: You are analyzing a large amount of instruction forms ----------------
Analysis across loops/block boundaries often do not make much sense.
Specify the kernel length with --length. See --help for more information.
If this is intentional, you can safely ignore this message.
-------------------------------------------------------------------------------------------------
P - Throughput of LOAD operation can be hidden behind a past or future STORE instruction
* - Instruction micro-ops not bound to a port
X - No throughput/latency information for this instruction in data file
Combined Analysis Report
------------------------
Port pressure in cycles
| 0 - 0DV | 1 | 2 | 3 | 4 | 5 - 5D | 6 - 6D | 7 || CP | LCD |
---------------------------------------------------------------------------------------------------
1 | | | | | | | | || | | strncmp:
2 | | | | | | | | || 0.0 | 0.0 | X bic x8, x0, #0xf // x8 is x0 but aligned to the boundary
3 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | and x9, x0, #0xf // x9 is the offset
4 | | | | | | | | || | | X bic x10, x1, #0xf // x10 is x1 but aligned to the boundary
5 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | and x11, x1, #0xf // x11 is the offset
7 | | | | 0.50 | 0.500 | | | || | | subs x2, x2, #1
8 | | | | | | | | 1.00 || | | b.mi .Lempty
10 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | mov x13, #-1 // save constant for later
12 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x3, x0, #16 // end of head
13 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x4, x1, #16
14 | | | | | | | | || | | X eor x3, x3, x0
15 | | | | | | | | || | | X eor x4, x4, x1 // bits that changed
16 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x3, x4 // in either str1 or str2
17 | | | | 0.50 | 0.500 | | | || | | cmp x2,#16
18 | | | | | | | | 1.00 || | | b.lt .Llt16
19 | | | | | | | | || | | X tbz w3, #4096, .Lbegin
21 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8] // load aligned head
22 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10]
24 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x14, x9, #2
25 | | | | | | | | || | | X lsl x3, x13, x14 // string head
26 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x15, x11, #2
27 | | | | | | | | || | | X lsl x4, x13, x15
29 | | | | | | | | || | | X cmeq v5.16b, v0.16b, #0
30 | | | | | | | | || | | X cmeq v6.16b, v1.16b, #0
32 | | | | | | | | || | | X shrn v5.8b, v5.8h, #4
33 | | | | | | | | || | | X shrn v6.8b, v6.8h, #4
34 | | | | | | | | || | | X fmov x5, d5
35 | | | | | | | | || | | X fmov x6, d6
37 | | | | | | | | || | | X adrp x14, shift_data
38 | | | | | | | | || | | X add x14, x14, :lo12:shift_data
40 | | | | | | | | || | | X tst x5, x3
41 | | | | | | | | 1.00 || | | b.eq zero
43 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x9] // load permutation table
44 | | | | | | | | || | | X tbl v0.16b, {v0.16b}, v4.16b
46 | | | | | | | | 1.00 || | | b one
47 | | | | | | | | || | | .p2align 4
48 | | | | | | | | || | | zero:
49 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true head
50 | | | | | | | | || | | one:
51 | | | | | | | | || | | X tst x6, x4
52 | | | | | | | | 1.00 || | | b.eq zeroo
54 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x11]
55 | | | | | | | | || | | X tbl v4.16b, {v1.16b}, v4.16b
57 | | | | | | | | 1.00 || | | b onee
59 | | | | | | | | || | | .p2align 4
60 | | | | | | | | || | | .Lbegin:
61 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true heads
62 | | | | | | | | || | | zeroo:
63 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x1]
64 | | | | | | | | || | | onee:
65 | | | | | | | | || | | X cmeq v2.16b, v0.16b, #0 // NUL byte present?
66 | | | | | | | | || | | X cmeq v4.16b, v0.16b, v4.16b // which bytes match?
68 | | | | | | | | || | | X orn v2.16b, v2.16b, v4.16b // mismatch or NUL byte?
70 | | | | | | | | || | | X shrn v2.8b, v2.8h, #4
71 | | | | | | | | || | | X fmov x5, d2
73 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lhead_mismatch
74 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8, #16] // load second chunk
75 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q3, [x10, #16]
77 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x2, x2, x11
78 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16 // account for length of RSI chunk?
80 | | | | | | | | || | | X subs x9, x9, x11
81 | | | | | | | | 1.00 || | | b.lt .Lswapped // if not swap operands
82 | | | | | | | | 1.00 || | | b .Lnormal
84 | | | | | | | | || | | .p2align 4
85 | | | | | | | | || | | .Llt16:
87 | | | | | | | | || | | X tbz w3, #4096, two
89 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8] // load aligned head
90 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10]
92 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x14, x9, #2
93 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x15, x11, #2
94 | | | | | | | | || | | X lsl x3, x13, x14 // string head
95 | | | | | | | | || | | X lsl x4, x13, x15
97 | | | | | | | | || | | X cmeq v5.16b, v0.16b, #0
98 | | | | | | | | || | | X cmeq v6.16b, v1.16b, #0
100 | | | | | | | | || | | X shrn v5.8b, v5.8h, #4
101 | | | | | | | | || | | X shrn v6.8b, v6.8h, #4
102 | | | | | | | | || | | X fmov x5, d5
103 | | | | | | | | || | | X fmov x6, d6
105 | | | | | | | | || | | X adrp x14, shift_data
106 | | | | | | | | || | | X add x14, x14, :lo12:shift_data
108 | | | | | | | | || | | X tst x5, x3
109 | | | | | | | | 1.00 || | | b.eq zerooo
111 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x9] // load permutation table
112 | | | | | | | | || | | X tbl v0.16b, {v0.16b}, v4.16b
114 | | | | | | | | 1.00 || | | b oneee
115 | | | | | | | | || | | .p2align 4
116 | | | | | | | | || | | zerooo:
117 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true head
118 | | | | | | | | || | | oneee:
119 | | | | | | | | || | | X tst x6, x4
120 | | | | | | | | 1.00 || | | b.eq noll
122 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x11]
123 | | | | | | | | || | | X tbl v4.16b, {v1.16b}, v4.16b
125 | | | | | | | | 1.00 || | | b ett
127 | | | | | | | | || | | .p2align 4
128 | | | | | | | | || | | two:
129 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true heads
130 | | | | | | | | || | | noll:
131 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x1]
132 | | | | | | | | || | | ett:
134 | | | | | | | | || | | X cmeq v2.16b, v0.16b, #0 // NUL byte present?
135 | | | | | | | | || | | X cmeq v4.16b, v0.16b, v4.16b // which bytes match?
137 | | | | | | | | || | | X bic v2.16b, v4.16b, v2.16b // match and not NUL byte
139 | | | | | | | | || | | X shrn v2.8b, v2.8h, #4
140 | | | | | | | | || | | X fmov x5, d2
141 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2
142 | | | | | | | | || | | X lsl x4, x13, x4
143 | | | | | | | | || | | X orn x5, x4, x5 // mismatch or NUL byte?
145 | | | | | | | | || | | .Lhead_mismatch:
146 | | | | | | | | || | | X rbit x3, x5
147 | | | | | | | | || | | X clz x3, x3 // index of mismatch
148 | | | | | | | | || | | X lsr x3, x3, #2
149 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x0, x3]
150 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x1, x3]
151 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5
152 | | | | | | 0.50 | 0.50 | || | | ret
154 | | | | | | | | || | | .p2align 4
155 | | | | | | | | || | | .Lnormal:
156 | | | | 0.50 | 0.500 | | | || | | sub x12, x10, x9
157 | | | | 0.75 | 0.750 | 0.25 0.50 | 0.25 0.50 | || | | ldr q0, [x12, #16]!
158 | | | | 0.50 | 0.500 | | | || | | sub x10, x10, x8
159 | | | | 0.50 | 0.500 | | | || | | sub x11, x10, x9
161 | | | | | | | | || | | X cmeq v1.16b, v3.16b, #0 // NUL present?
162 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // Mismatch between chunks?
163 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
164 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
165 | | | | | | | | || | | X fmov x6, d1
166 | | | | | | | | || | | X fmov x5, d0
168 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #32 // advance to next iteration
170 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2
171 | | | | | | | | || | | X lsl x4, x13, x4
172 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x6, x4 // introduce a null byte match
173 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16 // does the buffer end within x2
174 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt
175 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2 // NUL or end of buffer found?
176 | | | | | | | | || | | X mvn x5, x5
177 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2
178 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16
179 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32 // end of buffer within first main loop iteration?
180 | | | | | | | | 1.00 || | | b.lt .Ltail
182 | | | | | | | | || | | .p2align 4
183 | | | | | | | | || | | nada:
184 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11]
185 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10]
186 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8]
188 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 // end of string?
189 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // do the chunks match?
191 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
192 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
193 | | | | | | | | || | | X fmov x6, d1
194 | | | | | | | | || | | X fmov x5, d0
195 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound
196 | | | | | | | | || | | X mvn x5, x5 // any mismatches?
197 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch
199 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16
202 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11]
203 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10]
204 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8]
206 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16
207 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0
208 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
210 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
211 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
212 | | | | | | | | || | | X fmov x6, d1
213 | | | | | | | | || | | X fmov x5, d0
214 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2
215 | | | | | | | | || | | X mvn x5, x5
216 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2
217 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #32
218 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32 // end of buffer within next iteration
219 | | | | | | | | 1.00 || | | b.ge nada // if yes, process tail
222 | | | | | | | | || | | .Ltail:
223 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11]
224 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10]
225 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8]
227 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 // end of string?
228 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // do the chunks match?
230 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
231 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
232 | | | | | | | | || | | X fmov x6, d1
233 | | | | | | | | || | | X fmov x5, d0
237 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2
238 | | | | | | | | || | | X lsl x4, x13, x4
239 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x6, x4 // introduce a null byte match
240 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16 // does the buffer end within x2
241 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt
243 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound // NUL or end of string found
244 | | | | | | | | || | | X mvn x5, x5
245 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch
247 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16
249 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11]
250 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10]
251 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8]
253 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16
254 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0
255 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
257 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
258 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
259 | | | | | | | | || | | X fmov x6, d1
260 | | | | | | | | || | | X fmov x5, d0
262 | | | | | | | | || | | X ubfiz x4, x2, #2, #4 // (x2 - 16) << 2
263 | | | | | | | | || | | X lsl x4, x13, x4 // take first half into account
264 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x6, x6, x4 // introduce a null byte match
266 | | | | | | | | || | | .Lnulfound2:
267 | | | | 0.50 | 0.500 | | | || 1.0 | 1.0 | sub x8, x8, #16
269 | | | | | | | | || | | .Lnulfound:
270 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | mov x4, x6
272 | | | | | | | | || | | X ubfiz x7, x9, #2, #4
273 | | | | | | | | || | | X lsl x6, x6, x7 // adjust NUL mask to indices
275 | | | | | | | | || | | X orn x5, x6, x5
276 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch
278 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x9]
279 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10]
281 | | | | | | | | || | | X cmeq v1.16b, v0.16b, v1.16b
282 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
283 | | | | | | | | || | | X fmov x5, d1
285 | | | | | | | | || | | X orn x5, x4, x5
287 | | | | | | | | || | | X rbit x3, x5
288 | | | | | | | | || | | X clz x3, x3
289 | | | | | | | | || | | X lsr x5, x3, #2
291 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, x8 // restore x10 pointer
292 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x8, x8, x9 // point to corresponding chunk in x0
294 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x8, x5]
295 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x10, x5]
296 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5
297 | | | | | | 0.50 | 0.50 | || | | ret
299 | | | | | | | | || | | .p2align 4
300 | | | | | | | | || | | .Lmismatch2:
301 | | | | 0.50 | 0.500 | | | || | | sub x8, x8, #16 // roll back second increment
302 | | | | | | | | || | | .Lmismatch:
303 | | | | | | | | || | | X rbit x3, x5
304 | | | | | | | | || | | X clz x3, x3 // index of mismatch
305 | | | | | | | | || | | X lsr x3, x3, #2
306 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x8, x11
308 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x8, x3]
309 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x11, x3]
310 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5 // difference of the mismatching chars
311 | | | | | | 0.50 | 0.50 | || | | ret
313 | | | | | | | | || | | .p2align 4
314 | | | | | | | | || | | .Lswapped:
315 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x12, x8, x9
316 | | | | 0.75 | 0.750 | 0.25 0.50 | 0.25 0.50 | || | | ldr q0, [x12, #16]!
317 | | | | 0.50 | 0.500 | | | || | | sub x8, x8, x10
318 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x8, x9
319 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x2,x2,x9
320 | | | | | | | | || | | X neg x9, x9
322 | | | | | | | | || | | X cmeq v1.16b, v2.16b, #0
323 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v3.16b
324 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
325 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
326 | | | | | | | | || | | X fmov x6, d1
327 | | | | | | | | || | | X fmov x5, d0
329 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #32
331 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2
332 | | | | | | | | || | | X lsl x4, x13, x4
333 | | | | 0.38 | 0.370 | 0.13 | 0.12 | || | | orr x3,x6,x4 // introduce a null byte match
334 | | | | 0.50 | 0.500 | | | || | | cmp x2,#16
335 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt
336 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2s
337 | | | | | | | | || | | X mvn x5, x5
338 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2s
340 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16
341 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32
342 | | | | | | | | 1.00 || | | b.lt .Ltails
344 | | | | | | | | || | | .p2align 4
345 | | | | | | | | || | | nein:
346 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11]
347 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8]
348 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x10]
350 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0
351 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
353 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
354 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
355 | | | | | | | | || | | X fmov x6, d1
356 | | | | | | | | || | | X fmov x5, d0
357 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfounds
358 | | | | | | | | || | | X mvn x5, x5
359 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatchs
361 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #16
364 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11]
365 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8]
366 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x10]
368 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #16
369 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0
370 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
372 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
373 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
374 | | | | | | | | || | | X fmov x6, d1
375 | | | | | | | | || | | X fmov x5, d0
376 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2s
377 | | | | | | | | || | | X mvn x5, x5
378 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2s
379 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #32
380 | | | | 0.25 | 0.750 | | | || | | cmp x2, #32
381 | | | | | | | | 1.00 || | | b.ge nein
383 | | | | | | | | || | | .Ltails:
384 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11]
385 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8]
386 | | | | | | 0.51 0.50 | 0.49 0.50 | || | | ldr q2, [x10]
388 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0
389 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
391 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4
392 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
393 | | | | | | | | || | | X fmov x6, d1
394 | | | | | | | | || | | X fmov x5, d0
397 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2
398 | | | | | | | | || | | X lsl x4, x13, x4
399 | | | | 0.50 | 0.000 | 0.24 | 0.26 | || | | orr x3, x6, x4 // introduce a null byte match
400 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16
401 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt
403 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfounds
404 | | | | | | | | || | | X mvn x5, x5
405 | | | | 0.74 | 0.260 | | | || | | cbnz x5, .Lmismatchs
407 | 0.50 | | 0.50 | 0.00 | -0.01 | | | || 1.0 | 1.0 | add x10, x10, #16
409 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11]
410 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | | ldr q1, [x10, x8]
411 | | | | | | 0.51 0.50 | 0.49 0.50 | || | | ldr q2, [x10]
413 | 0.50 | | 0.50 | 0.00 | -0.01 | | | || | 1.0 | add x10, x10, #16
414 | | | | | | | | || 0.0 | | X cmeq v1.16b, v1.16b, #0
415 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b
417 | | | | | | | | || 0.0 | | X shrn v1.8b, v1.8h, #4
418 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4
419 | | | | | | | | || 0.0 | | X fmov x6, d1
420 | | | | | | | | || | | X fmov x5, d0
422 | | | | | | | | || | | X ubfiz x4, x2, #2, #4
423 | | | | | | | | || | | X lsl x4, x13, x4
424 | | | | 0.00 | 0.510 | 0.23 | 0.26 | || 1.0 | | orr x6, x6, x4 // introduce a null byte match
426 | | | | | | | | || | | .Lnulfound2s:
427 | | | | 0.50 | 0.500 | | | || | 1.0 | sub x10, x10, #16
428 | | | | | | | | || | | .Lnulfounds:
429 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | | mov x4, x6
431 | | | | | | | | || | | X ubfiz x7, x9, #2, #4
432 | | | | | | | | || | | X lsl x6, x6, x7
434 | | | | | | | | || | | X orn x5, x6, x5
436 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatchs
438 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x9]
439 | | | | | | 0.50 0.50 | 0.50 0.50 | || | 5.0 | ldr q1, [x10, x8]
441 | | | | | | | | || | 0.0 | X cmeq v1.16b, v0.16b, v1.16b
442 | | | | | | | | || | 0.0 | X shrn v1.8b, v1.8h, #4
443 | | | | | | | | || | 0.0 | X fmov x5, d1
445 | | | | | | | | || 0.0 | 0.0 | X orn x5, x4, x5
447 | | | | | | | | || 0.0 | 0.0 | X rbit x3, x5
448 | | | | | | | | || 0.0 | 0.0 | X clz x3, x3
449 | | | | | | | | || 0.0 | 0.0 | X lsr x5, x3, #2
451 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x10, x8
452 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x10, x10, x9
454 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x10, x5]
455 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | 5.0 | ldrb w5, [x11, x5]
456 | | | | 0.50 | 0.500 | | | || | | sub w0, w5, w4
457 | | | | | | 0.50 | 0.50 | || | | ret
459 | | | | | | | | || | | .p2align 4
460 | | | | | | | | || | | .Lmismatch2s:
461 | | | | 0.50 | 0.500 | | | || | | sub x10, x10, #16
462 | | | | | | | | || | | .Lmismatchs:
463 | | | | | | | | || 0.0 | 0.0 | X rbit x3, x5
464 | | | | | | | | || 0.0 | 0.0 | X clz x3, x3
465 | | | | | | | | || 0.0 | 0.0 | X lsr x3, x3, #2
466 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x10, x11
468 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | | ldrb w4, [x10, x3]
469 | | | | | | 0.50 0.50 | 0.50 0.50 | || | 5.0 | ldrb w5, [x11, x3]
470 | | | | 0.50 | 0.500 | | | || 1.0 | 1.0 | sub w0, w5, w4
471 | | | | | | 0.50 | 0.50 | || | | ret
473 | | | | | | | | || | | .p2align 4
474 | | | | | | | | || | | .Lempty:
475 | | | | | | | | || 0.0 | 0.0 | X eor x0, x0, x0
476 | | | | | | 0.50 | 0.50 | || | | ret
478 | | | | | | | | || | | .section .rodata
479 | | | | | | | | || | | .p2align 4
480 | | | | | | | | || | | shift_data:
481 | | | | | | | | || | | .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
482 | | | | | | | | || | | .fill 16, 1, -1
483 | | | | | | | | || | | .size shift_data, .-shift_data
18.0 18.0 29.9 29.87 31.1 28.0 31.1 28.0 16.0 29.0 29.0
Loop-Carried Dependencies Analysis Report
-----------------------------------------
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 19.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 19.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 15.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475]
2 | 15.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475]
2 | 11.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 466, 469, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475]
2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475]
2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475]
2 | 17.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475]
2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475]
2 | 26.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 26.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475]
2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475]
2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 466, 469, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475]
2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475]
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475]
2 | 22.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475]
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475]
2 | 27.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 27.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475]
2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475]
2 | 17.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 466, 469, 470, 475]
3 | 22.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
3 | 22.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
3 | 20.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
3 | 20.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
3 | 16.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475]
3 | 16.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475]
3 | 12.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 466, 469, 470, 475]
3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
3 | 10.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 452, 461, 466, 469, 470, 475]
3 | 9.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 452, 461, 468, 470, 475]
7 | 8.0 | subs x2, x2, #1 | [7, 77, 78, 178, 217, 319, 340, 379]
It’s also accessible here if you want to play around with it.
References
[1]: https://reviews.freebsd.org/D45943
[2]: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/strcspn.S#n228
[3]: http://0x80.pl/articles/simd-byte-lookup.html
[4]: https://twitter.com/geofflangdale/status/1053227022795722752
[5]: https://github.com/intel/hyperscan/blob/master/src/nfa/truffle.c
[6]: https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html
[7]: https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0
[8]: https://uica.uops.info/
[9]: https://github.com/RRZE-HPC/OSACA
[10]: https://pages.cs.wisc.edu/~mscalar/simplescalar.html