Unicode support? #40

manticore-projects · 2023-01-07T18:00:36Z

select * from मकान;

Parse error at line 1, column 10. Encountered: from
select * from मकान;

Same for Thai, Traditional Chinese and German Umlauts.
I believe we will need to allow Unicode Alphabet Letters explicitly?

kaikalur · 2023-01-07T18:10:01Z

Hmm looks like the spec requires things to be unicode escaped?

| <Unicode_delimited_identifier: "U" "&" "\"" <Unicode_delimiter_body> "\"" ( <Unicode_escape_specifier> )?  >

Is one of the rules.

kaikalur · 2023-01-07T18:12:21Z

Oh never mind. Didn't implement it lol:

| <#identifier_start: ["a"-"z"]  // temp

It's a todo

kaikalur · 2023-01-07T18:15:30Z

This will be a good exercise to do - define the unicode char classes in a separate file as token macros and use that to define identifiers

manticore-projects · 2023-01-07T18:16:37Z

Yes, I thought that much.
Thanks for the feedback.

kaikalur · 2023-01-08T03:49:46Z

https://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Has the definitions. Should be easy to produce private token definitions from this - at least upto FFFD. Looks like now unicode is more than 16bits :( Javacc don't support that yet.

kaikalur · 2023-01-08T05:58:07Z

#41

Initial version! Also added test in my mother tongue (Telugu) lol:

"SELECT 1 ఒకట;"

manticore-projects · 2023-01-08T06:00:09Z

#41

Initial version! Also added test in my mother tongue (Telugu) lol:

"SELECT 1 ఒకట;"

I must confess that I am not able to differentiate between Thai, Khmer, Lao and Telugo.
But I am glad that it works!

kaikalur · 2023-01-08T06:02:50Z

#41
Initial version! Also added test in my mother tongue (Telugu) lol:
"SELECT 1 ఒకట;"

I must confess that I am not able to differentiate between Thai, Khmer, Lao and Telugo. But I am glad that it works!

It's Telugu - spoken in the south east state of Andhra Pradesh in India which is where I'm from originally (though now I'm basically a native Californian - after 30 years lol)

manticore-projects · 2023-01-08T06:07:47Z

Thats a lot of Tokens!
Can't we better define Ranges?


PART_LETTER
         ::= [$@0-9A-Z_a-z#x23#x0-#x8#xE-#x1B#x7F-#x9F#xA2-#xA5#xAA#xB5#xBA#xC0-#xD6#xD8-#xF6#xF8-#x21F#x222-#x233#x250-#x2AD#x2B0-#x2B8#x2BB-#x2C1#x2D0-#x2D1#x2E0-#x2E4#x2EE#x300-#x34E#x360-#x362#x37A#x386#x388-#x38A#x38C#x38E-#x3A1#x3A3-#x3CE#x3D0-#x3D7#x3DA-#x3F3#x400-#x481#x483-#x486#x48C-#x4C4#x4C7-#x4C8#x4CB-#x4CC#x4D0-#x4F5#x4F8-#x4F9#x531-#x556#x559#x561-#x587#x591-#x5A1#x5A3-#x5B9#x5BB-#x5BD#x5BF#x5C1-#x5C2#x5C4#x5D0-#x5EA#x5F0-#x5F2#x621-#x63A#x640-#x655#x660-#x669#x670-#x6D3#x6D5-#x6DC#x6DF-#x6E8#x6EA-#x6ED#x6F0-#x6FC#x70F-#x72C#x730-#x74A#x780-#x7B0#x901-#x903#x905-#x939#x93C-#x94D#x950-#x954#x958-#x963#x966-#x96F#x981-#x983#x985-#x98C#x98F-#x990#x993-#x9A8#x9AA-#x9B0#x9B2#x9B6-#x9B9#x9BC#x9BE-#x9C4#x9C7-#x9C8#x9CB-#x9CD#x9D7#x9DC-#x9DD#x9DF-#x9E3#x9E6-#x9F3#xA02#xA05-#xA0A#xA0F-#xA10#xA13-#xA28#xA2A-#xA30#xA32-#xA33#xA35-#xA36#xA38-#xA39#xA3C#xA3E-#xA42#xA47-#xA48#xA4B-#xA4D#xA59-#xA5C#xA5E#xA66-#xA74#xA81-#xA83#xA85-#xA8B#xA8D#xA8F-#xA91#xA93-#xAA8#xAAA-#xAB0#xAB2-#xAB3#xAB5-#xAB9#xABC-#xAC5#xAC7-#xAC9#xACB-#xACD#xAD0#xAE0#xAE6-#xAEF#xB01-#xB03#xB05-#xB0C#xB0F-#xB10#xB13-#xB28#xB2A-#xB30#xB32-#xB33#xB36-#xB39#xB3C-#xB43#xB47-#xB48#xB4B-#xB4D#xB56-#xB57#xB5C-#xB5D#xB5F-#xB61#xB66-#xB6F#xB82-#xB83#xB85-#xB8A#xB8E-#xB90#xB92-#xB95#xB99-#xB9A#xB9C#xB9E-#xB9F#xBA3-#xBA4#xBA8-#xBAA#xBAE-#xBB5#xBB7-#xBB9#xBBE-#xBC2#xBC6-#xBC8#xBCA-#xBCD#xBD7#xBE7-#xBEF#xC01-#xC03#xC05-#xC0C#xC0E-#xC10#xC12-#xC28#xC2A-#xC33#xC35-#xC39#xC3E-#xC44#xC46-#xC48#xC4A-#xC4D#xC55-#xC56#xC60-#xC61#xC66-#xC6F#xC82-#xC83#xC85-#xC8C#xC8E-#xC90#xC92-#xCA8#xCAA-#xCB3#xCB5-#xCB9#xCBE-#xCC4#xCC6-#xCC8#xCCA-#xCCD#xCD5-#xCD6#xCDE#xCE0-#xCE1#xCE6-#xCEF#xD02-#xD03#xD05-#xD0C#xD0E-#xD10#xD12-#xD28#xD2A-#xD39#xD3E-#xD43#xD46-#xD48#xD4A-#xD4D#xD57#xD60-#xD61#xD66-#xD6F#xD82-#xD83#xD85-#xD96#xD9A-#xDB1#xDB3-#xDBB#xDBD#xDC0-#xDC6#xDCA#xDCF-#xDD4#xDD6#xDD8-#xDDF#xDF2-#xDF3#xE01-#xE3A#xE3F-#xE4E#xE50-#xE59#xE81-#xE82#xE84#xE87-#xE88#xE8A#xE8D#xE94-#xE97#xE99-#xE9F#xEA1-#xEA3#xEA5#xEA7#xEAA-#xEAB#xEAD-#xEB9#xEBB-#xEBD#xEC0-#xEC4#xEC6#xEC8-#xECD#xED0-#xED9#xEDC-#xEDD#xF00#xF18-#xF19#xF20-#xF29#xF35#xF37#xF39#xF3E-#xF47#xF49-#xF6A#xF71-#xF84#xF86-#xF8B#xF90-#xF97#xF99-#xFBC#xFC6#x1000-#x1021#x1023-#x1027#x1029-#x102A#x102C-#x1032#x1036-#x1039#x1040-#x1049#x1050-#x1059#x10A0-#x10C5#x10D0-#x10F6#x1100-#x1159#x115F-#x11A2#x11A8-#x11F9#x1200-#x1206#x1208-#x1246#x1248#x124A-#x124D#x1250-#x1256#x1258#x125A-#x125D#x1260-#x1286#x1288#x128A-#x128D#x1290-#x12AE#x12B0#x12B2-#x12B5#x12B8-#x12BE#x12C0#x12C2-#x12C5#x12C8-#x12CE#x12D0-#x12D6#x12D8-#x12EE#x12F0-#x130E#x1310#x1312-#x1315#x1318-#x131E#x1320-#x1346#x1348-#x135A#x1369-#x1371#x13A0-#x13F4#x1401-#x166C#x166F-#x1676#x1681-#x169A#x16A0-#x16EA#x1780-#x17D3#x17DB#x17E0-#x17E9#x180B-#x180E#x1810-#x1819#x1820-#x1877#x1880-#x18A9#x1E00-#x1E9B#x1EA0-#x1EF9#x1F00-#x1F15#x1F18-#x1F1D#x1F20-#x1F45#x1F48-#x1F4D#x1F50-#x1F57#x1F59#x1F5B#x1F5D#x1F5F-#x1F7D#x1F80-#x1FB4#x1FB6-#x1FBC#x1FBE#x1FC2-#x1FC4#x1FC6-#x1FCC#x1FD0-#x1FD3#x1FD6-#x1FDB#x1FE0-#x1FEC#x1FF2-#x1FF4#x1FF6-#x1FFC#x200C-#x200F#x202A-#x202E#x203F-#x2040#x206A-#x206F#x207F#x20A0-#x20AF#x20D0-#x20DC#x20E1#x2102#x2107#x210A-#x2113#x2115#x2119-#x211D#x2124#x2126#x2128#x212A-#x212D#x212F-#x2131#x2133-#x2139#x2160-#x2183#x3005-#x3007#x3021-#x302F#x3031-#x3035#x3038-#x303A#x3041-#x3094#x3099-#x309A#x309D-#x309E#x30A1-#x30FE#x3105-#x312C#x3131-#x318E#x31A0-#x31B7#x3400-#x4DB5#x4E00-#x9FA5#xA000-#xA48C#xAC00-#xD7A3#xF900-#xFA2D#xFB00-#xFB06#xFB13-#xFB17#xFB1D-#xFB28#xFB2A-#xFB36#xFB38-#xFB3C#xFB3E#xFB40-#xFB41#xFB43-#xFB44#xFB46-#xFBB1#xFBD3-#xFD3D#xFD50-#xFD8F#xFD92-#xFDC7#xFDF0-#xFDFB#xFE20-#xFE23#xFE33-#xFE34#xFE4D-#xFE4F#xFE69#xFE70-#xFE72#xFE74#xFE76-#xFEFC#xFEFF#xFF04#xFF10-#xFF19#xFF21-#xFF3A#xFF3F#xFF41-#xFF5A#xFF65-#xFFBE#xFFC2-#xFFC7#xFFCA-#xFFCF#xFFD2-#xFFD7#xFFDA-#xFFDC#xFFE0-#xFFE1#xFFE5-#xFFE6#xFFF9-#xFFFB]

kaikalur · 2023-01-08T06:10:42Z

Hmm where did you get that from? See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges - exactly as defined in that unicode data file. They are not tokens. They are explicitly listed out along with comment so we know if there is a problem, we can check easily. Look at unicode-identifiers.txt file in my PR

kaikalur · 2023-01-08T06:12:51Z

Thats a lot of Tokens! Can't we better define Ranges?


PART_LETTER
         ::= [$@0-9A-Z_a-z#x23#x0-#x8#xE-#x1B#x7F-#x9F#xA2-#xA5#xAA#xB5#xBA#xC0-#xD6#xD8-#xF6#xF8-#x21F#x222-#x233#x250-#x2AD#x2B0-#x2B8#x2BB-#x2C1#x2D0-#x2D1#x2E0-#x2E4#x2EE#x300-#x34E#x360-#x362#x37A#x386#x388-#x38A#x38C#x38E-#x3A1#x3A3-#x3CE#x3D0-#x3D7#x3DA-#x3F3#x400-#x481#x483-#x486#x48C-#x4C4#x4C7-#x4C8#x4CB-#x4CC#x4D0-#x4F5#x4F8-#x4F9#x531-#x556#x559#x561-#x587#x591-#x5A1#x5A3-#x5B9#x5BB-#x5BD#x5BF#x5C1-#x5C2#x5C4#x5D0-#x5EA#x5F0-#x5F2#x621-#x63A#x640-#x655#x660-#x669#x670-#x6D3#x6D5-#x6DC#x6DF-#x6E8#x6EA-#x6ED#x6F0-#x6FC#x70F-#x72C#x730-#x74A#x780-#x7B0#x901-#x903#x905-#x939#x93C-#x94D#x950-#x954#x958-#x963#x966-#x96F#x981-#x983#x985-#x98C#x98F-#x990#x993-#x9A8#x9AA-#x9B0#x9B2#x9B6-#x9B9#x9BC#x9BE-#x9C4#x9C7-#x9C8#x9CB-#x9CD#x9D7#x9DC-#x9DD#x9DF-#x9E3#x9E6-#x9F3#xA02#xA05-#xA0A#xA0F-#xA10#xA13-#xA28#xA2A-#xA30#xA32-#xA33#xA35-#xA36#xA38-#xA39#xA3C#xA3E-#xA42#xA47-#xA48#xA4B-#xA4D#xA59-#xA5C#xA5E#xA66-#xA74#xA81-#xA83#xA85-#xA8B#xA8D#xA8F-#xA91#xA93-#xAA8#xAAA-#xAB0#xAB2-#xAB3#xAB5-#xAB9#xABC-#xAC5#xAC7-#xAC9#xACB-#xACD#xAD0#xAE0#xAE6-#xAEF#xB01-#xB03#xB05-#xB0C#xB0F-#xB10#xB13-#xB28#xB2A-#xB30#xB32-#xB33#xB36-#xB39#xB3C-#xB43#xB47-#xB48#xB4B-#xB4D#xB56-#xB57#xB5C-#xB5D#xB5F-#xB61#xB66-#xB6F#xB82-#xB83#xB85-#xB8A#xB8E-#xB90#xB92-#xB95#xB99-#xB9A#xB9C#xB9E-#xB9F#xBA3-#xBA4#xBA8-#xBAA#xBAE-#xBB5#xBB7-#xBB9#xBBE-#xBC2#xBC6-#xBC8#xBCA-#xBCD#xBD7#xBE7-#xBEF#xC01-#xC03#xC05-#xC0C#xC0E-#xC10#xC12-#xC28#xC2A-#xC33#xC35-#xC39#xC3E-#xC44#xC46-#xC48#xC4A-#xC4D#xC55-#xC56#xC60-#xC61#xC66-#xC6F#xC82-#xC83#xC85-#xC8C#xC8E-#xC90#xC92-#xCA8#xCAA-#xCB3#xCB5-#xCB9#xCBE-#xCC4#xCC6-#xCC8#xCCA-#xCCD#xCD5-#xCD6#xCDE#xCE0-#xCE1#xCE6-#xCEF#xD02-#xD03#xD05-#xD0C#xD0E-#xD10#xD12-#xD28#xD2A-#xD39#xD3E-#xD43#xD46-#xD48#xD4A-#xD4D#xD57#xD60-#xD61#xD66-#xD6F#xD82-#xD83#xD85-#xD96#xD9A-#xDB1#xDB3-#xDBB#xDBD#xDC0-#xDC6#xDCA#xDCF-#xDD4#xDD6#xDD8-#xDDF#xDF2-#xDF3#xE01-#xE3A#xE3F-#xE4E#xE50-#xE59#xE81-#xE82#xE84#xE87-#xE88#xE8A#xE8D#xE94-#xE97#xE99-#xE9F#xEA1-#xEA3#xEA5#xEA7#xEAA-#xEAB#xEAD-#xEB9#xEBB-#xEBD#xEC0-#xEC4#xEC6#xEC8-#xECD#xED0-#xED9#xEDC-#xEDD#xF00#xF18-#xF19#xF20-#xF29#xF35#xF37#xF39#xF3E-#xF47#xF49-#xF6A#xF71-#xF84#xF86-#xF8B#xF90-#xF97#xF99-#xFBC#xFC6#x1000-#x1021#x1023-#x1027#x1029-#x102A#x102C-#x1032#x1036-#x1039#x1040-#x1049#x1050-#x1059#x10A0-#x10C5#x10D0-#x10F6#x1100-#x1159#x115F-#x11A2#x11A8-#x11F9#x1200-#x1206#x1208-#x1246#x1248#x124A-#x124D#x1250-#x1256#x1258#x125A-#x125D#x1260-#x1286#x1288#x128A-#x128D#x1290-#x12AE#x12B0#x12B2-#x12B5#x12B8-#x12BE#x12C0#x12C2-#x12C5#x12C8-#x12CE#x12D0-#x12D6#x12D8-#x12EE#x12F0-#x130E#x1310#x1312-#x1315#x1318-#x131E#x1320-#x1346#x1348-#x135A#x1369-#x1371#x13A0-#x13F4#x1401-#x166C#x166F-#x1676#x1681-#x169A#x16A0-#x16EA#x1780-#x17D3#x17DB#x17E0-#x17E9#x180B-#x180E#x1810-#x1819#x1820-#x1877#x1880-#x18A9#x1E00-#x1E9B#x1EA0-#x1EF9#x1F00-#x1F15#x1F18-#x1F1D#x1F20-#x1F45#x1F48-#x1F4D#x1F50-#x1F57#x1F59#x1F5B#x1F5D#x1F5F-#x1F7D#x1F80-#x1FB4#x1FB6-#x1FBC#x1FBE#x1FC2-#x1FC4#x1FC6-#x1FCC#x1FD0-#x1FD3#x1FD6-#x1FDB#x1FE0-#x1FEC#x1FF2-#x1FF4#x1FF6-#x1FFC#x200C-#x200F#x202A-#x202E#x203F-#x2040#x206A-#x206F#x207F#x20A0-#x20AF#x20D0-#x20DC#x20E1#x2102#x2107#x210A-#x2113#x2115#x2119-#x211D#x2124#x2126#x2128#x212A-#x212D#x212F-#x2131#x2133-#x2139#x2160-#x2183#x3005-#x3007#x3021-#x302F#x3031-#x3035#x3038-#x303A#x3041-#x3094#x3099-#x309A#x309D-#x309E#x30A1-#x30FE#x3105-#x312C#x3131-#x318E#x31A0-#x31B7#x3400-#x4DB5#x4E00-#x9FA5#xA000-#xA48C#xAC00-#xD7A3#xF900-#xFA2D#xFB00-#xFB06#xFB13-#xFB17#xFB1D-#xFB28#xFB2A-#xFB36#xFB38-#xFB3C#xFB3E#xFB40-#xFB41#xFB43-#xFB44#xFB46-#xFBB1#xFBD3-#xFD3D#xFD50-#xFD8F#xFD92-#xFDC7#xFDF0-#xFDFB#xFE20-#xFE23#xFE33-#xFE34#xFE4D-#xFE4F#xFE69#xFE70-#xFE72#xFE74#xFE76-#xFEFC#xFEFF#xFF04#xFF10-#xFF19#xFF21-#xFF3A#xFF3F#xFF41-#xFF5A#xFF65-#xFFBE#xFFC2-#xFFC7#xFFCA-#xFFCF#xFFD2-#xFFD7#xFFDA-#xFFDC#xFFE0-#xFFE1#xFFE5-#xFFE6#xFFF9-#xFFFB]

Like I mentioned earlier, for this project I want to make spec compliance and easy verification of the spec is what I'm going for.

manticore-projects · 2023-01-08T06:15:20Z

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again).
This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).

Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

kaikalur · 2023-01-08T06:18:58Z

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again). This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).

Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

It's not that big - only 22k. Also they are character lists why are you displaying char lists as tokens there are only 6 local tokens - like Ll, Lu, Lo etc. So the tool processing this grammar should do a better job lol. Or we can strip out the comments if grammar loading is a problem.

kaikalur · 2023-01-08T06:20:00Z

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again). This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).
Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

No that could be bug prone. But if you want to do that, write a preprocessor that takes this file and compacts into ranges before the concatenation.

It's not that big - only 22k. Also they are character lists why are you displaying char lists as tokens there are only 6 local tokens - like Ll, Lu, Lo etc. So the tool processing this grammar should do a better job lol. Or we can strip out the comments if grammar loading is a problem.

manticore-projects · 2023-01-08T06:25:14Z

It's not that big - only 22k.

The TEXT file is 1.5 MB and without mangling it will end like that in the Grammar (where it will be translated into 1 single token of course. My use of the term tokens was not correct, when I was referring to the explicit characters.)
Although I agree, that the Shell script can suppress the comments at least and maybe compress it into ranges.

Although I don't understand why we don't want to provide ranges as per Unicode Page -- but that's rather a matter of taste and not debatable.

manticore-projects · 2023-01-08T06:29:02Z

This one: https://jrgraphix.net/r/Unicode/

kaikalur · 2023-01-08T06:29:33Z

It's not that big - only 22k.

The TEXT file is 1.5 MB and without mangling it will end like that in the Grammar (where it will be translated into 1 single token of course. My use of the term tokens was not correct, when I was referring to the explicit characters.) Although I agree, that the Shell script can suppress the comments at least and maybe compress it into ranges.

Although I don't understand why we don't want to provide ranges as per Unicode Page -- but that's rather a matter of taste and not debatable.

There is no easy way to find the ranges in general. But it should be easy to do in your document ganerator! When there are charlists, you compact them for display purposes.

kaikalur · 2023-01-08T06:30:04Z

This one: https://jrgraphix.net/r/Unicode/

That's not enough. You need Lu, Li etc as defined for each of those separate languages in the Unicode data txt file I attached. Also this is not the official spec so can't use that lol.

manticore-projects · 2023-01-08T06:35:18Z

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

kaikalur · 2023-01-08T06:36:24Z

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff (or "\uffff") won't work.

manticore-projects · 2023-01-08T06:39:33Z

Some ranges are commented out already?
//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;
Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.
Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff won't work.

Ok, I get that.
Your are the boss, but in my opinion, if we can't support All Unicode because technical reasons already, then we could easily stick with a "practical" support of most relevant unicode even when missing out one or two obscure characters.

Although, for me this is not the hill to die upon.
I will change replace this Text File for my own concerns.

kaikalur · 2023-01-08T06:42:34Z

Some ranges are commented out already?
//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;
Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.
Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff won't work.
Ok, I get that. Your are the boss, but in my opinion, if we can't support All Unicode because technical reasons already, then we could easily stick with a "practical" support of most relevant unicode even when missing out one or two obscure characters.

This new unicode spec is relatively new and I don;t think anyone supports it yet properly (not sure if Java even supports it - haven't checked). If/when Java supports it, we can extend JavaCC to do that as well.

Although, for me this is not the hill to die upon. I will change replace this Text File for my own concerns.

👍🏾

manticore-projects · 2023-01-08T06:44:33Z

Yes, lets turn that into a selling point: "The only OSMANYA supporting SQL Parser in the world!". :-D

kaikalur · 2023-01-08T06:58:07Z

Somewhat related - the spec does not allow digits/numbers from other languages!

<digit> ::=
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

manticore-projects · 2023-01-08T07:17:02Z

In my opinion, this makes a kind of sense since the Database would need to calculate with the Numbers eventually.
On the upside, your Identifiers can start with Thai Digits, but not with Latin/Arabic Digits -- so something for everyone.

manticore-projects · 2023-01-08T12:57:02Z

In case anyone else may want to use it.
You can call it like

./unicode.sh Downloads/unicode-identifiers.txt

I kept it verbose for debugging purpose. We can remove the noise when verified.
Cheers!

Edit: image working on a parser and not knowing how large 2 bytes are :P

unicode-identifiers_compact.txt
unicode.sh.zip

kaikalur · 2023-01-09T01:21:57Z

Checkout the PR again - I added an awk script (lol) to fix up the char ranges. We should add a test that it actually captures all the unicode chars in all the ranges properly. Now we have < 500 ranges.

manticore-projects · 2023-01-09T07:53:00Z

Good Morning, thank you for investing into this.

Unfortunately your script ~~does not work~~ thows irritating warnings for me. At first, I thought it was not working but the I found the generated file.

are@archlinux ~/D/s/p/p/grammar (unicode-support)> ./prepare-javacc-grammar.sh
awk: ./compact_char_sets.awk:3: warning: regexp escape sequence `\"' is not a known regexp operator
awk: ./compact_char_sets.awk:16: (FILENAME=- FNR=4) warning: regexp escape sequence `\u' is not a known regexp operator

Have you tested, if the output has as many Characters as the output from my script?
My understanding was, they should produce the same Ranges and Characters?

I have compared the output and we achieve more or less the same. Main difference is the segregation into Unicode Categories. With categories, more ranges. Without ranges, more compact output.

Although my AWK skills are extremely poor, I have 3 comments:
1) I did not understand, where you handle the "Range has only 1 character" case, e. g.
<#Nl: ["ᛮ"-"ᛰ","Ⅰ"-"ↂ","ↅ"-"ↈ","〇","〡"-"〩","〸"-"〺","ꛦ"-"ꛯ"]>

Confirmed, when I found the generated file. All good.

I liked very much the split into the Nl, Lu ... Ll categories as it helps to understand the characters we are dealing with.
Also (at least in Thai) certain characters can't start an Identifier. I believe Identifiers can start only with Ll or Lu. If I was right, keeping the categories may be useful.
AWK may not be available on all machines (at least in the Standard installation).

So my recommendation was to a) check-in the script and b) check-in the Compacted Unicode Files as well and to use that as per default. It should literally never change. Although I feel I am digressing here.

One more thing: If we start pre-pre-processing with AWK scripts now, should we not better operate based on the official source https://www.unicode.org/Public/UNIDATA/UnicodeData.txt instead based on the intermediate unicode-identifiers.txt?

manticore-projects · 2023-01-09T13:36:43Z

We should add a test that it actually captures all the unicode chars in all the ranges properly.

I have rebuilt the ranges based on https://www.unicode.org/Public/UNIDATA/UnicodeData.txt, using the "L" category only.

Now we have < 500 ranges.

I get a few more Ranges. Please see attached.
UnicodeData_compact.txt
unicode2.sh.zip

Example:
"\u00F8"-"\u02B8" vs "\u00F8"-"\u02C1"

02B7;MODIFIER LETTER SMALL W;Lm;0;L;<super> 0077;;;;N;;;;;
02B8;MODIFIER LETTER SMALL Y;Lm;0;L;<super> 0079;;;;N;;;;;
02B9;MODIFIER LETTER PRIME;Lm;0;ON;;;;;N;;;;;
02BA;MODIFIER LETTER DOUBLE PRIME;Lm;0;ON;;;;;N;;;;;
02BB;MODIFIER LETTER TURNED COMMA;Lm;0;L;;;;;N;;;;;

\u02B7 is a Letter of the Category Lm and should be in.
But \u02B8 to \u02BA are no Letters L (despite Category Lm) -- should they still go in?

manticore-projects · 2023-01-09T13:54:11Z

Then what about those:

0BBE;TAMIL VOWEL SIGN AA;Mc;0;L;;;;;N;;;;;
0BBF;TAMIL VOWEL SIGN I;Mc;0;L;;;;;N;;;;;
0BC0;TAMIL VOWEL SIGN II;Mn;0;NSM;;;;;N;;;;;
0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;;
0BC2;TAMIL VOWEL SIGN UU;Mc;0;L;;;;;N;;;;;
0BC6;TAMIL VOWEL SIGN E;Mc;0;L;;;;;N;;;;;
0BC7;TAMIL VOWEL SIGN EE;Mc;0;L;;;;;N;;;;;
0BC8;TAMIL VOWEL SIGN AI;Mc;0;L;;;;;N;;;;;
0BCA;TAMIL VOWEL SIGN O;Mc;0;L;0BC6 0BBE;;;;N;;;;;
0BCB;TAMIL VOWEL SIGN OO;Mc;0;L;0BC7 0BBE;;;;N;;;;;
0BCC;TAMIL VOWEL SIGN AU;Mc;0;L;0BC6 0BD7;;;;N;;;;;

I don't speak Tamil, but with my understanding of Thai I would have expected those to be "in":

0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;

You can't write Thai without those (although they never stand alone).

manticore-projects · 2023-01-09T13:55:49Z

Scientific Symbols?

0E3F;THAI CURRENCY SYMBOL BAHT;Sc;0;ET;;;;;N;THAI BAHT SIGN;;;;

manticore-projects · 2023-01-09T14:14:30Z

Configurable categories:

CATEGORIES=("Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me",  "Nl", "No", "Sc")

My favorite Thai vowels and Small Dollar Sign are in, as well as currency symbols.

UnicodeData_compact.txt
unicode3.sh.zip

kaikalur · 2023-01-09T14:46:06Z

The original unicode list is complete. So they should all work if allowed! Here is the snippet:

1) An <identifier start> is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
“Lo”, or “Nl”.

kaikalur · 2023-01-09T15:03:43Z

Good Morning, thank you for investing into this.

Unfortunately your script ~~does not work~~ thows irritating warnings for me. At first, I thought it was not working but the I found the generated file.

are@archlinux ~/D/s/p/p/grammar (unicode-support)> ./prepare-javacc-grammar.sh
awk: ./compact_char_sets.awk:3: warning: regexp escape sequence \"' is not a known regexp operator awk: ./compact_char_sets.awk:16: (FILENAME=- FNR=4) warning: regexp escape sequence \u' is not a known regexp operator

Hmm let me check

~~Have you tested, if the output has as many Characters as the output from my script? My understanding was, they should produce the same Ranges and Characters?~~

It should be trivial to do that since we have the original file with all the allowed chars. Just generate a test from that.

I have compared the output and we achieve more or less the same. Main difference is the segregation into Unicode Categories. With categories, more ranges. Without ranges, more compact output.

Yeah - when we are doing that - might as well do the best possible job!

Although my AWK skills are extremely poor, I have 3 comments: ~~1) I did not understand, where you handle the "Range has only 1 character" case, e. g. <#Nl: ["ᛮ"-"ᛰ","Ⅰ"-"ↂ","ↅ"-"ↈ","〇","〡"-"〩","〸"-"〺","ꛦ"-"ꛯ"]>~~

Confirmed, when I found the generated file. All good.

I liked very much the split into the Nl, Lu ... Ll categories as it helps to understand the characters we are dealing with.

That's already in the original file

Also (at least in Thai) certain characters can't start an Identifier. I believe Identifiers can start only with Ll or Lu. If I was right, keeping the categories may be useful.
3. AWK may not be available on all machines (at least in the Standard installation).

It is - it is one of the really old tools that's there on any self-respecting Linux installation lol

So my recommendation was to a) check-in the script and b) check-in the Compacted Unicode Files as well and to use that as per default. It should literally never change. Although I feel I am digressing here.

One more thing: If we start pre-pre-processing with AWK scripts now, should we not better operate based on the official source https://www.unicode.org/Public/UNIDATA/UnicodeData.txt instead based on the intermediate unicode-identifiers.txt?

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

manticore-projects · 2023-01-09T15:05:28Z

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

I have done that, just check the attached Bash script above please.

kaikalur · 2023-01-09T15:06:16Z

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

I have done that, just check the attached Bash script above please.

One minor issue is the few ranges that file has - like First> and Last> are beg/end of ranges - annoying :(

manticore-projects · 2023-01-09T15:06:37Z

An is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
“Lo”, or “Nl”.

Sorry, this does not make sense without the "Mn" at least. Example: อักษรไทย, you will need the A Vowel.

kaikalur · 2023-01-09T15:08:30Z

An is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
“Lo”, or “Nl”.

Sorry, this does not make sense without the "Mn" at least.

I didn;'t come up with the spec lol so can't do that if it's not in here!

Also, they had:

An <identifier extend> is U+00B7, “Middle Dot”, or any character in the Unicode General Category classes
“Mn”, “Mc”, “Nd”, “Pc”, or “Cf”.

I need to add that as well

kaikalur · 2023-01-09T15:13:50Z

Ideally we should have these as predefined in JavaCC so any other grammar wanting to use it will benefit from it. If you want to contribute that to JavaCC, please go ahead. I will keep it like this for now.

kaikalur · 2023-01-09T15:17:39Z

In the grammar, I had:

| <#identifier_extend:  ["\u00B7", "0"-"9", "_"] // temp
//!! See the Syntax Rules.

One more todo lol. I will add it today. That will give you 'Mn' (I also noticed that for Telugu - having another vowel gives syntax error)

manticore-projects · 2023-01-09T15:18:02Z

Agreed. We can close this issue.

manticore-projects · 2023-01-09T15:21:01Z

That will give you 'Mn'

I was good with 7 bit ASCII, but my wife stood right behind me!

(I also noticed that for Telugu - having another vowel gives syntax error)

While you are on it: your OUTPUT file has a spelling error (sorry for being pedantic.)

kaikalur · 2023-01-09T15:25:33Z

Agreed. We can close this issue.

Yeah like my PR title said - it's the initial implementation. Let me know if you can take it over and fix it up.

kaikalur · 2023-01-09T15:26:02Z

Agreed. We can close this issue.

Yeah like my PR title said - it's the initial implementation. Let me know if you can take it over and fix it up using the general principle of going from the unicode spec.

manticore-projects · 2023-01-09T15:27:23Z

Yes, if SH/Bash is acceptable.
No, if it needs to be an AWK script -- that's beyond my pay grade :)

kaikalur · 2023-01-09T15:29:24Z

Yes, if SH/Bash is acceptable. No, if it needs to be an AWK script -- that's beyond my pay grade :)

Sure as long as it works and we can verify that it does, we are good! Let's add a test that can be generated from Unicodedata.txt one for each allowed char (and also a negative test to make sure we did not add anything extra)

kaikalur · 2023-01-09T22:36:35Z

OK added the identifier extend stuff and now it works for vowel additions in Telugu. So should work for Thai as well. Check it out. The PR is now clean. I removed my awk shit. It just uses the full list. You can add your shell script separately. I will keep the reference grammar clean.

kaikalur · 2023-02-07T20:11:36Z

FYI - the spec does NOT allow underscore "_" as part of identifier lol - learnt the hard way today.

manticore-projects · 2023-03-07T11:33:04Z

Greetings, from my other project I have learned that the CJK Block needs to be added explicitly:

// CJK Unified Ideographs block according to https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
| <#CJK: ["\u4E00"-"\u62FF", "\u6300"-"\u77FF", "\u7800"-"\u8CFF", "\u8D00"-"\u9FCC"]>

Unicode support? #40

Unicode support? #40

Comments

manticore-projects commented Jan 7, 2023

kaikalur commented Jan 7, 2023

kaikalur commented Jan 7, 2023

kaikalur commented Jan 7, 2023

manticore-projects commented Jan 7, 2023

kaikalur commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 8, 2023

manticore-projects commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 8, 2023

manticore-projects commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 8, 2023

manticore-projects commented Jan 8, 2023

kaikalur commented Jan 8, 2023

kaikalur commented Jan 8, 2023

manticore-projects commented Jan 8, 2023

manticore-projects commented Jan 8, 2023

kaikalur commented Jan 8, 2023

kaikalur commented Jan 8, 2023 • edited Loading

manticore-projects commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 8, 2023 • edited Loading

manticore-projects commented Jan 8, 2023

kaikalur commented Jan 8, 2023

manticore-projects commented Jan 8, 2023

kaikalur commented Jan 8, 2023 • edited Loading

manticore-projects commented Jan 8, 2023

manticore-projects commented Jan 8, 2023 • edited Loading

kaikalur commented Jan 9, 2023

manticore-projects commented Jan 9, 2023 • edited Loading

manticore-projects commented Jan 9, 2023 • edited Loading

manticore-projects commented Jan 9, 2023

manticore-projects commented Jan 9, 2023

manticore-projects commented Jan 9, 2023

kaikalur commented Jan 9, 2023 • edited Loading

kaikalur commented Jan 9, 2023

manticore-projects commented Jan 9, 2023

kaikalur commented Jan 9, 2023

manticore-projects commented Jan 9, 2023 • edited Loading

kaikalur commented Jan 9, 2023 • edited Loading

kaikalur commented Jan 9, 2023

kaikalur commented Jan 9, 2023

manticore-projects commented Jan 9, 2023

manticore-projects commented Jan 9, 2023 • edited Loading

kaikalur commented Jan 9, 2023

kaikalur commented Jan 9, 2023

manticore-projects commented Jan 9, 2023

kaikalur commented Jan 9, 2023

kaikalur commented Jan 9, 2023

kaikalur commented Feb 7, 2023 • edited Loading

manticore-projects commented Mar 7, 2023 • edited Loading

kaikalur commented Jan 8, 2023 •

edited

Loading

manticore-projects commented Jan 8, 2023 •

edited

Loading

manticore-projects commented Jan 8, 2023 •

edited

Loading

kaikalur commented Jan 8, 2023 •

edited

Loading

kaikalur commented Jan 8, 2023 •

edited

Loading

manticore-projects commented Jan 8, 2023 •

edited

Loading

kaikalur commented Jan 8, 2023 •

edited

Loading

kaikalur commented Jan 8, 2023 •

edited

Loading

manticore-projects commented Jan 8, 2023 •

edited

Loading

manticore-projects commented Jan 9, 2023 •

edited

Loading

manticore-projects commented Jan 9, 2023 •

edited

Loading

kaikalur commented Jan 9, 2023 •

edited

Loading

manticore-projects commented Jan 9, 2023 •

edited

Loading

kaikalur commented Jan 9, 2023 •

edited

Loading

manticore-projects commented Jan 9, 2023 •

edited

Loading

kaikalur commented Feb 7, 2023 •

edited

Loading

manticore-projects commented Mar 7, 2023 •

edited

Loading