This document contains notes about various ideas that for one reason or another are not being actively pursued.
The current plan for a SIMD-accelerated inner loop for handling ASCII bytes makes no use of the bit of information that if the buffers didn't end but the ASCII loop exited, the next byte will not be an ASCII byte.
Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16. unconv doesn't even do anything fancy to manually unroll the loop (see below). Both handle even the ASCII range using table lookup. That is, there's no branch for checking if we're in the lower or upper half of the encoding.
However, adding SIMD acceleration for the ASCII half will likely be a bigger win than eliminating the branch to decide ASCII vs. non-ASCII.
ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte encoding to UTF-16. This appears to be thanks to manually unrolling the conversion loop by 16. See ucnv_MBCSSingleToBMPWithOffsets.
Notably, none of the single-byte encodings have bytes that'd decode to the
upper half of BMP. Therefore, if the unmappable marker has the highest bit set
instead of being zero, the check for unmappables within a 16-character stride
can be done either by ORing the BMP characters in the stride together and
checking the high bit or by loading the upper halves of the BMP charaters
in a u8x8
register and checking the high bits using the _mm_movemask_epi8
/ pmovmskb
SSE2 instruction.
Since the failure mode of SIMD ASCII acceleration involves wasted aligment checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin scripts have runs of non-ASCII even if ASCII spaces and punctuation is used, consider handling the next two or three bytes following non-ASCII as non-SIMD before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if there's ASCII that's not space or punctuation. Maybe with the "space or punctuation" check in place, this code can be allowed to be in place even for UTF-8 and Latin single-byte (i.e. not having different code for Latin and non-Latin single-byte).
Instead of returning to acceleration directly after non-ASCII, consider continuing to the alignment boundary without acceleration.
When the SIMD ASCII check fails, the data has already been read from memory. Test whether it's faster to read the data by lane from the SIMD register than to read it again from RAM (cache).
These two are ordered by radical and then by stroke count, so in principle, they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't fully Unicode-ordered. Is "mostly" good enough for encode accelelation?
Experiment with a function that computes (i / 94, i % 94)
more efficiently
than generic code.
On Cortex-A57, it might be a good idea to move the destination into 16-byte alignment.
Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!)
When there are at least four bytes left, read all four. With each byte index into tables corresponding to magic values indexable by byte in each position.
In the value read from the table indexed by lead byte, encode the following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional bits one of which is set to indicate the type of lead byte (8 valid types, in the 8 lowest bits, and invalid, ASCII would be tenth type), and the mask for extracting the payload bits from the lead byte (for conversion to UTF-16 or UTF-32).
In the tables indexable by the trail bytes, in each positions corresponding byte the lead byte type, store 1 if the trail is invalid given the lead and 0 if valid given the lead.
Use the low 8 bits of the of the 16 bits read from the first table to mask (bitwise AND) one positional bit from each of the three other values. Bitwise OR the results together with the bit that is 1 if the lead is invalid. If the result is zero, the sequence is valid. Otherwise it's invalid.
Use the advance to advance. In the conversion to UTF-16 or UTF-32 case, use the mast for extracting the meaningful bits from the lead byte to mask them from the lead. Shift left by 6 as many times as the advance indicates, etc.