-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for unicode 16.0.0. #157
base: master
Are you sure you want to change the base?
Conversation
I haven't looked at Unicode 16; what are the non-binary properties for? |
I think that this documents it: https://www.gnu.org/software/libunistring/manual/html_node/Indic-conjunct-break.html |
@@ -46,7 +46,7 @@ let print_elements ch hashtbl cats = | |||
(fun (b, e) -> Printf.sprintf "0x%x, 0x%x" b e) | |||
(Cset.union_list (Hashtbl.find_all hashtbl c) :> (int * int) list) | |||
in | |||
Printf.fprintf ch " let %s = Sedlex_cset.of_list\n [" c; | |||
Printf.fprintf ch " let %s = Sedlex_utils.Cset.of_list\n [" c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unicode.ml
needs to compilable from within src/syntax
and as examples/unicode_old.ml
.
Ideally, unicode.ml
should be copiable to unicode_old.ml
without any chances.
If you use Sedlex_ppx.Sedlex_cset
is does not compile in src/syntax
.
Happy to find a better solution but, also, I don't think that it matters.
examples/regressions.ml
Outdated
@@ -32,6 +32,7 @@ let compare name (old_l : (int * int) list) (new_l : Sedlex_ppx.Sedlex_cset.t) = | |||
code_points | |||
|
|||
let test new_l (name, old_l) = | |||
let old_l = Sedlex_utils.Cset.to_list old_l in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to manipulate cset directly. See #159
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still fails with master
code (should be renamed main
too). This call is just a pass-through. Looks like defining the type as private
requires it. Again, not sure that this really matters..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type t = private (int * int) list
let to_list l = l
@@ -1,6 +1,6 @@ | |||
(executables | |||
(names tokenizer regressions complement subtraction repeat performance) | |||
(libraries sedlex sedlex_ppx) | |||
(libraries sedlex sedlex_ppx sedlex_utils) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sedlex_utils only contains the cset implementation that is already accessible using Sedlex_ppx.Sedlex_cset
.
@@ -38,6 +38,7 @@ let compare name (old_ : CSet.t) (new_ : CSet.t) = | |||
let test new_l (name, old_l) = | |||
(* Cn is for unassigned code points, which are allowed to be | |||
* used in future version. *) | |||
let old_l = CSet.to_list old_l in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be dropped, but you should also drop the lines below
let old_l =
List.fold_left
(fun acc (a, b) -> CSet.union acc (CSet.interval a b))
CSet.empty old_l
in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then to_list is no longer needed probably
This PR adds support for unicode
16.0.0
Notes:
DerivedCoreProperties.txt
. Those are not currently supported by the library and should be skipped for now.0x1171e
missing inmn
. This could be intentional.There is quite a bit of noise due to some required module renaming to make the new old unicode ml file compile in the regression tests.
Otherwise, this is a fairly straight forward update.