-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SVE Support #432
base: main
Are you sure you want to change the base?
Add SVE Support #432
Conversation
src/unicode.cpp
Outdated
@@ -44,7 +46,27 @@ constexpr bool to_lower_ascii(char* input, size_t length) noexcept { | |||
} | |||
return non_ascii == 0; | |||
} | |||
#if ADA_NEON | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing build errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I fixed it.
96e8c44
to
13e790b
Compare
Optimizations are always eagerly invited. However, we need to be careful with optimized code that does not run in our continuous integration tests nor on our development machines. As is, the code won't build: see my fix. Even with the fix, by default SVE will not be enabled, you need to pass a compiler flag (e.g., Before...
After...
As you can see, the SVE code is not faster than the NEON code. You can get a benefit with SVE, but you have to rewrite the code with a tail...
ada_really_inline bool has_tabs_or_newline(
std::string_view user_input) noexcept {
const svuint8_t mask1 = svdup_n_u8('\r');
const svuint8_t mask2 = svdup_n_u8('\n');
const svuint8_t mask3 = svdup_n_u8('\t');
svbool_t running = svdup_n_b8(false);
const size_t lanes = svcntb();
size_t i = 0;
for (; i + lanes <= user_input.size(); i += lanes) {
const svbool_t mask = svptrue_b8();
svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
running = svorr_b_z(mask,
svorr_b_z(mask, running,
svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
svcmpeq_u8(mask, word, mask2))),
svcmpeq_u8(mask, word, mask3));
}
if (i < user_input.size()) {
const svbool_t mask = svwhilelt_b8_u64(i, user_input.size());
svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
running = svorr_b_z(mask,
svorr_b_z(mask, running,
svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
svcmpeq_u8(mask, word, mask2))),
svcmpeq_u8(mask, word, mask3));
}
return svptest_any(svptrue_b8(), running);
} See https://lemire.me/blog/2023/03/10/trimming-spaces-from-strings-faster-with-sve-on-an-amazon-graviton-3-processor/ for a related discussion and other ways to skin this cat. It is a 1% gain which might be significant, but SVE on the graviton 3 has 32-byte SIMD registers whereas it appears that all SVE implementations on commodity processors are moving to 16-byte SIMD registers. Also I did not test on other compilers. There may well be other optimizations worth considering with SVE. |
Co-authored-by: Daniel Lemire <[email protected]>
Oops, I missed. Thanks.
It seems that the performance of I think that SVE code will give some benefits when the latency of instructions or register width will be improved, but anyway it doesn't so on Graviton 3. |
@wx257osn2 It could be that this PR could become profitable in the future. |
ADA_SVE
macro ininclude/ada/common_defs.h
enabled ifdefined(__ARM_FEATURE_SVE)
has_tabs_or_newline
using Arm SVE