Add SVE Support #432

wx257osn2 · 2023-06-04T00:46:47Z

Add ADA_SVE macro in include/ada/common_defs.h enabled if defined(__ARM_FEATURE_SVE)
Implement has_tabs_or_newline using Arm SVE

anonrig · 2023-06-04T01:44:50Z

src/unicode.cpp

@@ -44,7 +46,27 @@ constexpr bool to_lower_ascii(char* input, size_t length) noexcept {
  }
  return non_ascii == 0;
 }
-#if ADA_NEON
+#endif


This line is causing build errors

Oops, I fixed it.

src/unicode.cpp

lemire · 2023-06-04T16:29:46Z

Optimizations are always eagerly invited. However, we need to be careful with optimized code that does not run in our continuous integration tests nor on our development machines.

As is, the code won't build: see my fix. Even with the fix, by default SVE will not be enabled, you need to pass a compiler flag (e.g., -march=armv8-a+sve). Let us use a graviton 3 system (GCC 11) and run benchdata three times with the current main branch, and 3 times with the PR...

Before...

BasicBench_AdaURL_aggregator_href   27475530 ns     27472820 ns           25 GHz=2.59283 cycle/byte=8.19443 cycles/url=711.761 instructions/byte=27.0696 instructions/cycle=3.30341 instructions/ns=8.56519 instructions/url=2.35124k ns/url=274.511 speed=316.243M/s time/byte=3.16212ns time/url=274.66ns url/s=3.64087M/s
BasicBench_AdaURL_aggregator_href   27473560 ns     27467550 ns           26 GHz=2.5946 cycle/byte=8.19026 cycles/url=711.4 instructions/byte=27.0627 instructions/cycle=3.30425 instructions/ns=8.5732 instructions/url=2.35064k ns/url=274.185 speed=316.304M/s time/byte=3.16152ns time/url=274.607ns url/s=3.64157M/s
BasicBench_AdaURL_aggregator_href   27487336 ns     27484277 ns           25 GHz=2.59334 cycle/byte=8.19364 cycles/url=711.693 instructions/byte=27.067 instructions/cycle=3.30342 instructions/ns=8.56691 instructions/url=2.35102k ns/url=274.431 speed=316.111M/s time/byte=3.16344ns time/url=274.774ns url/s=3.63935M/s

After...

BasicBench_AdaURL_aggregator_href   27470432 ns     27466735 ns           25 GHz=2.5941 cycle/byte=8.20583 cycles/url=712.752 instructions/byte=26.7438 instructions/cycle=3.25912 instructions/ns=8.4545 instructions/url=2.32295k ns/url=274.759 speed=316.313M/s time/byte=3.16142ns time/url=274.599ns url/s=3.64168M/s
BasicBench_AdaURL_aggregator_href   27451202 ns     27447098 ns           26 GHz=2.59269 cycle/byte=8.19058 cycles/url=711.428 instructions/byte=26.7439 instructions/cycle=3.2652 instructions/ns=8.46564 instructions/url=2.32295k ns/url=274.397 speed=316.54M/s time/byte=3.15916ns time/url=274.402ns url/s=3.64428M/s
BasicBench_AdaURL_aggregator_href   27483095 ns     27479923 ns           25 GHz=2.59405 cycle/byte=8.20717 cycles/url=712.869 instructions/byte=26.7438 instructions/cycle=3.25859 instructions/ns=8.45295 instructions/url=2.32295k ns/url=274.809 speed=316.161M/s time/byte=3.16294ns time/url=274.731ns url/s=3.63993M/s

As you can see, the SVE code is not faster than the NEON code.

You can get a benefit with SVE, but you have to rewrite the code with a tail...

BasicBench_AdaURL_aggregator_href   27167271 ns     27161413 ns           26 GHz=2.59365 cycle/byte=8.09789 cycles/url=703.376 instructions/byte=26.7557 instructions/cycle=3.30404 instructions/ns=8.56953 instructions/url=2.32398k ns/url=271.191 speed=319.869M/s time/byte=3.12628ns time/url=271.546ns url/s=3.68261M/s
BasicBench_AdaURL_aggregator_href   27195788 ns     27192649 ns           26 GHz=2.59409 cycle/byte=8.11229 cycles/url=704.627 instructions/byte=26.7557 instructions/cycle=3.29817 instructions/ns=8.55576 instructions/url=2.32398k ns/url=271.628 speed=319.501M/s time/byte=3.12988ns time/url=271.859ns url/s=3.67838M/s
BasicBench_AdaURL_aggregator_href   27245132 ns     27239912 ns           26 GHz=2.59493 cycle/byte=8.11311 cycles/url=704.698 instructions/byte=26.7532 instructions/cycle=3.29753 instructions/ns=8.55687 instructions/url=2.32376k ns/url=271.567 speed=318.947M/s time/byte=3.13532ns time/url=272.331ns url/s=3.672M/s

ada_really_inline bool has_tabs_or_newline(
    std::string_view user_input) noexcept {
  const svuint8_t mask1 = svdup_n_u8('\r');
  const svuint8_t mask2 = svdup_n_u8('\n');
  const svuint8_t mask3 = svdup_n_u8('\t');
  svbool_t running = svdup_n_b8(false);
  const size_t lanes = svcntb();
  size_t i = 0;
  for (; i + lanes <= user_input.size(); i += lanes) {
    const svbool_t mask = svptrue_b8();
    svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
    running = svorr_b_z(mask,
                        svorr_b_z(mask, running,
                                  svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
                                            svcmpeq_u8(mask, word, mask2))),
                        svcmpeq_u8(mask, word, mask3));
  }
  if (i < user_input.size()) {
    const svbool_t mask = svwhilelt_b8_u64(i, user_input.size());
    svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
    running = svorr_b_z(mask,
                        svorr_b_z(mask, running,
                                  svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
                                            svcmpeq_u8(mask, word, mask2))),
                        svcmpeq_u8(mask, word, mask3));
  }
  return svptest_any(svptrue_b8(), running);
}

See https://lemire.me/blog/2023/03/10/trimming-spaces-from-strings-faster-with-sve-on-an-amazon-graviton-3-processor/ for a related discussion and other ways to skin this cat.

It is a 1% gain which might be significant, but SVE on the graviton 3 has 32-byte SIMD registers whereas it appears that all SVE implementations on commodity processors are moving to 16-byte SIMD registers. Also I did not test on other compilers.

There may well be other optimizations worth considering with SVE.

Co-authored-by: Daniel Lemire <[email protected]>

wx257osn2 · 2023-06-04T18:55:53Z

As is, the code won't build: see my fix.

Oops, I missed. Thanks.

You can get a benefit with SVE, but you have to rewrite the code with a tail...

It seems that the performance of svwhilelt on Neonverse V1 is not good. It seems that at least when the all split codes are small enough to be in the instruction cache, your proposal (splitting the loop for remaining elements) has better performance.

I think that SVE code will give some benefits when the latency of instructions or register width will be improved, but anyway it doesn't so on Graviton 3.

lemire · 2023-06-05T15:01:04Z

@wx257osn2 It could be that this PR could become profitable in the future.

anonrig reviewed Jun 4, 2023

View reviewed changes

src/unicode.cpp Show resolved Hide resolved

src/unicode.cpp Show resolved Hide resolved

implement has_tabs_or_newline in Arm SVE

13e790b

wx257osn2 force-pushed the add-sve-support branch from 96e8c44 to 13e790b Compare June 4, 2023 04:43

lemire reviewed Jun 4, 2023

View reviewed changes

src/unicode.cpp Outdated Show resolved Hide resolved

add missing mask argument

1290de0

Co-authored-by: Daniel Lemire <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SVE Support #432

Add SVE Support #432

wx257osn2 commented Jun 4, 2023

anonrig Jun 4, 2023

wx257osn2 Jun 4, 2023

lemire commented Jun 4, 2023

wx257osn2 commented Jun 4, 2023

lemire commented Jun 5, 2023

Add SVE Support #432

Are you sure you want to change the base?

Add SVE Support #432

Conversation

wx257osn2 commented Jun 4, 2023

anonrig Jun 4, 2023

Choose a reason for hiding this comment

wx257osn2 Jun 4, 2023

Choose a reason for hiding this comment

lemire commented Jun 4, 2023

wx257osn2 commented Jun 4, 2023

lemire commented Jun 5, 2023