-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support? #40
Comments
Hmm looks like the spec requires things to be unicode escaped?
Is one of the rules. |
Oh never mind. Didn't implement it lol:
It's a todo |
This will be a good exercise to do - define the unicode char classes in a separate file as token macros and use that to define identifiers |
Yes, I thought that much. |
https://www.unicode.org/Public/UNIDATA/UnicodeData.txt Has the definitions. Should be easy to produce private token definitions from this - at least upto FFFD. Looks like now unicode is more than 16bits :( Javacc don't support that yet. |
Initial version! Also added test in my mother tongue (Telugu) lol: "SELECT 1 ఒకట;" |
I must confess that I am not able to differentiate between Thai, Khmer, Lao and Telugo. |
It's Telugu - spoken in the south east state of Andhra Pradesh in India which is where I'm from originally (though now I'm basically a native Californian - after 30 years lol) |
Thats a lot of Tokens!
|
Hmm where did you get that from? See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges - exactly as defined in that unicode data file. They are not tokens. They are explicitly listed out along with comment so we know if there is a problem, we can check easily. Look at unicode-identifiers.txt file in my PR |
Like I mentioned earlier, for this project I want to make spec compliance and easy verification of the spec is what I'm going for. |
I did that and I found that every single Character is defined one by one (unless I am completely lost again). Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token). |
It's not that big - only 22k. Also they are character lists why are you displaying char lists as tokens there are only 6 local tokens - like Ll, Lu, Lo etc. So the tool processing this grammar should do a better job lol. Or we can strip out the comments if grammar loading is a problem. |
No that could be bug prone. But if you want to do that, write a preprocessor that takes this file and compacts into ranges before the concatenation.
|
The TEXT file is 1.5 MB and without mangling it will end like that in the Grammar (where it will be translated into 1 single token of course. My use of the term Although I don't understand why we don't want to provide ranges as per Unicode Page -- but that's rather a matter of taste and not debatable. |
This one: https://jrgraphix.net/r/Unicode/ |
There is no easy way to find the ranges in general. But it should be easy to do in your document ganerator! When there are charlists, you compact them for display purposes. |
That's not enough. You need Lu, Li etc as defined for each of those separate languages in the Unicode data txt file I attached. Also this is not the official spec so can't use that lol. |
Some ranges are commented out already?
Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please. |
Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff (or "\uffff") won't work. |
Ok, I get that. Although, for me this is not the hill to die upon. |
This new unicode spec is relatively new and I don;t think anyone supports it yet properly (not sure if Java even supports it - haven't checked). If/when Java supports it, we can extend JavaCC to do that as well.
👍🏾 |
Yes, lets turn that into a selling point: "The only OSMANYA supporting SQL Parser in the world!". :-D |
Somewhat related - the spec does not allow digits/numbers from other languages!
|
In my opinion, this makes a kind of sense since the Database would need to calculate with the Numbers eventually. |
In case anyone else may want to use it. ./unicode.sh Downloads/unicode-identifiers.txt I kept it verbose for debugging purpose. We can remove the noise when verified. Edit: image working on a parser and not knowing how large |
Checkout the PR again - I added an awk script (lol) to fix up the char ranges. We should add a test that it actually captures all the unicode chars in all the ranges properly. Now we have < 500 ranges. |
Good Morning, thank you for investing into this. Unfortunately your script are@archlinux ~/D/s/p/p/grammar (unicode-support)> ./prepare-javacc-grammar.sh awk: ./compact_char_sets.awk:3: warning: regexp escape sequence `\"' is not a known regexp operator awk: ./compact_char_sets.awk:16: (FILENAME=- FNR=4) warning: regexp escape sequence `\u' is not a known regexp operator
I have compared the output and we achieve more or less the same. Main difference is the segregation into Unicode Categories. With categories, more ranges. Without ranges, more compact output. Although my AWK skills are extremely poor, I have 3 comments: Confirmed, when I found the generated file. All good.
So my recommendation was to a) check-in the script and b) check-in the Compacted Unicode Files as well and to use that as per default. It should literally never change. Although I feel I am digressing here. One more thing: If we start pre-pre-processing with AWK scripts now, should we not better operate based on the official source https://www.unicode.org/Public/UNIDATA/UnicodeData.txt instead based on the intermediate unicode-identifiers.txt? |
I have rebuilt the ranges based on https://www.unicode.org/Public/UNIDATA/UnicodeData.txt, using the "L" category only.
I get a few more Ranges. Please see attached. Example:
|
Then what about those:
I don't speak Tamil, but with my understanding of Thai I would have expected those to be "in":
You can't write Thai without those (although they never stand alone). |
Scientific Symbols?
|
Configurable categories:
My favorite Thai vowels and Small Dollar Sign are in, as well as currency symbols. |
The original unicode list is complete. So they should all work if allowed! Here is the snippet:
|
Hmm let me check
It should be trivial to do that since we have the original file with all the allowed chars. Just generate a test from that.
Yeah - when we are doing that - might as well do the best possible job!
That's already in the original file
It is - it is one of the really old tools that's there on any self-respecting Linux installation lol
Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol |
I have done that, just check the attached Bash script above please. |
One minor issue is the few ranges that file has - like First> and Last> are beg/end of ranges - annoying :( |
Sorry, this does not make sense without the "Mn" at least. Example: อักษรไทย, you will need the A Vowel. |
I didn;'t come up with the spec lol so can't do that if it's not in here! Also, they had:
I need to add that as well |
Ideally we should have these as predefined in JavaCC so any other grammar wanting to use it will benefit from it. If you want to contribute that to JavaCC, please go ahead. I will keep it like this for now. |
In the grammar, I had:
One more todo lol. I will add it today. That will give you 'Mn' (I also noticed that for Telugu - having another vowel gives syntax error) |
Agreed. We can close this issue. |
I was good with 7 bit ASCII, but my wife stood right behind me!
While you are on it: your OUTPUT file has a spelling error (sorry for being pedantic.) |
Yeah like my PR title said - it's the initial implementation. Let me know if you can take it over and fix it up. |
|
Yes, if SH/Bash is acceptable. |
Sure as long as it works and we can verify that it does, we are good! Let's add a test that can be generated from Unicodedata.txt one for each allowed char (and also a negative test to make sure we did not add anything extra) |
OK added the identifier extend stuff and now it works for vowel additions in Telugu. So should work for Thai as well. Check it out. The PR is now clean. I removed my awk shit. It just uses the full list. You can add your shell script separately. I will keep the reference grammar clean. |
FYI - the spec does NOT allow underscore "_" as part of identifier lol - learnt the hard way today. |
Greetings, from my other project I have learned that the CJK Block needs to be added explicitly: // CJK Unified Ideographs block according to https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
| <#CJK: ["\u4E00"-"\u62FF", "\u6300"-"\u77FF", "\u7800"-"\u8CFF", "\u8D00"-"\u9FCC"]> |
Parse error at line 1, column 10. Encountered: from
select * from मकान;
Same for Thai, Traditional Chinese and German Umlauts.
I believe we will need to allow Unicode Alphabet Letters explicitly?
The text was updated successfully, but these errors were encountered: