Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta-issue: ICU4X as a low-level dependency #5124

Open
sffc opened this issue Jun 25, 2024 · 19 comments
Open

Meta-issue: ICU4X as a low-level dependency #5124

sffc opened this issue Jun 25, 2024 · 19 comments
Labels
C-meta Component: Relating to ICU4X as a whole discuss Discuss at a future ICU4X-SC meeting S-epic Size: Major project (create smaller child issues) U-google User: Google 1st party

Comments

@sffc
Copy link
Member

sffc commented Jun 25, 2024

ICU4X aims to be lightweight and portable, and it achieves those goals after all is said and done, but during the build process, ICU4X is fairly hefty when it comes to linker size, compile times, source code size, and number of crates.

This was observed downstream in servo/rust-url#939 (#5120, #5121), as well as elsewhere.

There are low-hanging things we can do to improve this as well as other things that might have tradeoffs that we need to weigh.

CC @Manishearth @hsivonen

@sffc sffc added C-meta Component: Relating to ICU4X as a whole discuss Discuss at a future ICU4X-SC meeting S-epic Size: Major project (create smaller child issues) U-google User: Google 1st party labels Jun 25, 2024
@sffc
Copy link
Member Author

sffc commented Jun 25, 2024

Notes from 2024-06-20:

Issues:

servo/rust-url#937
servo/rust-url#939

original PR: servo/rust-url#923

Discussion:

  • @Manishearth - rust-url is a URL parser. The idna crate was reimplemented to be on top of icu4x, using the normalizer and the uts46 data. It removed a lot of implementation code and a dep on unicode-normalization. It also simplifies idna having to maintain its own tables. But, the icu4x dependency got backed out due to various issues which I'll list out. For crates like rusturl, here are the things their users want:
    • The lower down the dependency tree, the more conservative you need to be on MSRV, more conservative than a project of our size can be.
    • Compile times, especially depending on syn.
    • Sheer number of crates; more things to wrangle.
  • @Manishearth - The end result is that their users don't like getting these extra dependencies. My take is that unicode_rs exists and is fine. People could add feature-gated dependencies on icu4x from these crates. I think we should ameliorate some of these problems. We don't need as many dependencies for baked data. Maybe we could get rid of icu_locale_core in icu_properties for example.
  • @robertbastian - Do these crates use zerovec-derive?
  • @Manishearth - It uses CodePointTrie.
  • @robertbastian - We have 30 #[make_ule] in our repo. Can we just stop using it?
  • @Manishearth - That buys us nothing. Even if we do that, they still won't take us as a dependency.
  • @robertbastian - But syn is the biggest chunk.
  • @Manishearth
  • @sffc I do think from a policy position, it's important for us to endeavor to be friendly to the rust ecosystem. This is part of henri's work to integrate with firefox, and he wants to centralize the idna data with ICU4X. This is a positive change, and should be landed behind a feature flag. No controversy on those two things? (general assent from group)
  • @sffc ICU4X is a library. It's not a final end product. Library typically means kind of low level in the tree.
  • @sffc The core components of ICU4X, like zerovec and things. Let's look at zerovec itself; we want it to be a core utility. We've spent a lot of time investing into that for that purpose. zerovec requires syn which makes it more difficult for certain kinds of clients to depend on. icu_properties is low level and we should endeavor to make it clean. We can't get all the way to unicode-normalization, but we can get most of the way there.
  • @sffc I don't agree with Manish's statement that there's nothing we can do to get to the point where people will use ICU4X in this crate as a low level dependency. ICU4X has a bunch of things going for it: our data model, the fact that we come from Unicode, the fact that we keep our data up to date. There may be some cost to that. On all three fronts: compile times, MSRV, depcount, we can make progress. There's a lot of low hanging fruit. If we tackled this low hanging fruit, we may be able to get clients to adopt ICU4X.
  • @Manishearth - People using rust-url aren't looking for zero copy deserialization.
  • @Manishearth - Making a static-only properties crate seems like a win. But stopping using #[make_ule] is a cost to us and I don't see the tradeoff.
  • @robertbastian - All I see on the issue is MSRV and compile times.
  • @Manishearth - The others are things I expect will come up. This type of discussion has happened before.
  • @sffc - The types of problems highlighted in this issue are exacerbated by rust-url being a low-level crate, but these problems are adoption problems for all types of clients.
  • @Manishearth I think this should be a goal, I think this should be a goal which we are careful to sacrifice anything but developer time/prioritization for it (do not sacrifice code cleanliness or whatever)
  • @sffc we can get rid of make_ule in properties with a macro probably
  • @Manishearth agreed

@Manishearth
Copy link
Member

Manishearth commented Jun 25, 2024

Explicitly writing out some of the things I said during the meeting (but @hsivonen wasn't present):

In general I think serving lower-level crates is a good goal. But I want to highlight that these lower-level crates live on a spectrum, and rust-url is on the far end of that spectrum, it is used very heavily and the union of all it's users' needs can be rather precise.

I am very much in favor of identifying incremental changes that bring us closer to that end of the spectrum. There's a lot of low hanging fruit here, including getting rid of syn deps for many of our crates.

I am less in favor of treating "usable by rust-url as a default dependency" as a goal in and of itself. It's a good north star to aim for, but I want to caution us against having hope that it will happen, or treating it as a goal in service of which we sacrifice other stuff, like code cleanliness or unsafe code proliferation. I'm happy to see Henri continue to discuss the topic on the rust-url thread since there is still some chance, but I don't want ICU4X to treat this as something definitely achievable.

(The situation I really wish to avoid is that we start giving up other things in service of making rust-url a client, and then rust-url's users oppose it anyway)

@Manishearth
Copy link
Member

Full list of low hanging fruit that comes to my mind:

@djc
Copy link

djc commented Jun 26, 2024

Appreciate this discussion! While I agree that url might be on the far end of what low-level dependencies require, in the end I think much of ICU4X is sort of conceptually a low-level dependency due to the subject matter, and so IMO it would be great if it was an awesome candidate across the board for all sorts of low-level crates. For one example, I'm also a chrono maintainer and I think depending on date/time internationalization stuff from ICU4X has come up, but in my perception chrono is probably even more low-level than url.

For crates that are all part of the same ecosystem, maybe it would be reasonable to expand "macros" using pre-publish codegen? This can be a pretty nice technique for moving compile-time dependencies to dev-dependencies and generally help save compile time -- I imagine a lot of the Unicode data is statically known based on data files and doesn't conceptually need to be derived/compiled by downstream projects?

See things like mattsse/chromiumoxide#80 and https://github.com/open-telemetry/opentelemetry-rust-contrib/blob/main/opentelemetry-stackdriver/tests/generate.rs.

@robertbastian
Copy link
Member

For crates that are all part of the same ecosystem, maybe it would be reasonable to expand "macros" using pre-publish codegen?

This will put unsafe code into every crate, which people generally dislike.

@djc
Copy link

djc commented Jun 26, 2024

For crates that are all part of the same ecosystem, maybe it would be reasonable to expand "macros" using pre-publish codegen?

This will put unsafe code into every crate, which people generally dislike.

IMO that's just silly -- whether you publish code with derive macros that then goes to expand to unsafe or have code generating code with unsafe in it pre-publish is exactly equivalent, and I don't think you should be afraid of making this argument. It should be more about ensuring the used abstractions are safe in practice.

@robertbastian
Copy link
Member

This makes a big difference for auditing. You have to audit the code of the derive macro once, but when it's inlined you have no guarantee where it came from, and you'll have to audit it hundreds of times.

@djc
Copy link

djc commented Jun 26, 2024

But if you're generating code within crates originating from the icu4x project, surely the auditing issue is solvable?

@robertbastian
Copy link
Member

"The icu4x project" is not something that exists on crates.io. If crate foo adds an icu dependency, and someone with a code audit requirement wants to use foo, they will need to audit the expanded version of icu if we expand it before publishing to crates.io.

@sffc
Copy link
Member Author

sffc commented Jun 26, 2024

We already do check-in unsafe generated code from databake so that clients don't need to run or depend on icu4x-datagen.

https://github.com/unicode-org/icu4x/tree/main/provider/data/properties/data

I can certainly see an argument that checking-in generated unsafe impls could be seen in a similar light.

@hsivonen
Copy link
Member

The way I view url and ICU4X is this: url is very common, and it depends on idna. idna requires a Unicode normalizer and access to four Unicode properties (one of which is also part of normalization and makes sense to be obtained from the normalizer, so really normalization and three properties). The ICU4X efforts at small binary footprint are (partially) defeated, if an app ends up with another copy of data for what idna needs. Therefore, for ICU4X to deliver on small footprint of the compiled product, it needs to be able to serve as the dependency that idna uses for Unicode data.

idna 1.0 demonstrates ICU4X success as far as the characteristics of the compilation output go: Despite ICU4X trading away performance in the interest of normalization data sliceability, idna 1.0 results in better performance and smaller binary size than idna 0.5.

As for concerns about the compilation process itself:

I think it's bad to accept "larger number of crates is worse" as a problem to be addressed. If anything, part of the compilation time increase is that we have a catch-all crate for properties that's been designed to be sliced up by LTO instead of having a crate per property. Crate per property would obviously would result in a larger number of crates, but the approach of having a crate per property (as in https://crates.io/crates/unicode-joining-type and https://crates.io/crates/unicode-general-category ) would likely improve the compile time of idna.

MSRV and compile time are very different concerns: MSRV is a non-issue when the user of Rust is on board with "stability without stagnation": ICU4X's MSRV will stay far enough back that anyone who keeps updating their toolchain will not see an MSRV problem even if they don't update Rust all the time. The MSRV issue affects folks who do "stability" via Debian-level stagnation. (AFAICT, ICU4X's MSRV accommodates users who are on Red Hat's or Ferrocene's Rust toolchain cycle. The former having come up in our policy discussion and the latter by coincidence.)

Compile times, on the other hand, affect users regardless of compiler version (though newer compiler may be faster).

While more energy usage is worse than less energy usage and it that sense excessive compile compute is always bad, the developer experience impact of ICU4X compile times probably varies significantly by hardware and the overall application dependency graph.

servo/rust-url#939 says "now take a whole 20 seconds or so longer to compile than before". When I compile reqwest without application code, I see the compile take about 1.3 seconds longer (dev and release) on M3 Pro (macOS) and 1.0 seconds (release) to 1.2 seconds (dev) longer on Threadripper Pro 5975WX (Ubuntu). See https://hsivonen.com/test/moz/reqwest-idna/ (Note that reqwest builds a different dependency graph on macOS and Linux.)

In both cases, but particularly on the Threadripper, the CPU is underutilized, so if the application has another long dependency chain that does not have url or reqwest in it, there's a good opportunity to absorb ICU4X's extra 1.3 seconds in build parallelism. Of course, if there isn't such a chain, then there isn't.

It looks like the sources for the compile time increase are:

  1. ICU4X supporting dynamic data loading, even though it's not actually used here. Proc macros can be attributed to this.
  2. The data loading mechanism supporting locale-dependent data despite this case only using locale-independent data.
  3. icu_normalizer depending on a type from icu_properties, so the two cannot compile in parallel.
  4. The design of icu_properties being compile everything first and then LTO away the unused parts as opposed manually picking the properties of interest for compilation.
  5. icu_properties carrying some locale-depedendent CLDR data despite this case involving only locale-independent Unicode Database data.

As for audits:

If the recipient really wants to audit everything themselves, then generated code on crates.io is worse for auditability. However, if the recipient uses cargo-vet and trusts that the core participants in cargo-vet ecosystem, chances are that the imported audits will cover ICU4X. I've pondered if I should propose that Unicode or the ICU4X project become a cargo-vet audit publisher, but it's not really clear how useful that would be compared to Mozilla and Google already being cargo-vet publishers.

@Manishearth
Copy link
Member

If the recipient really wants to audit everything themselves, then generated code on crates.io is worse for auditability. However, if the recipient uses cargo-vet and trusts that the core participants in cargo-vet ecosystem, chances are that the imported audits will cover ICU4X. I've pondered if I should propose that Unicode or the ICU4X project become a cargo-vet audit publisher, but it's not really clear how useful that would be compared to Mozilla and Google already being cargo-vet publishers.

Trust is multifaceted and lives on a spectrum, even if people are relying on cargo-vet, "this is harder to audit" is still worse. As someone both publishing and performing unsafe reviews, it is easier to trust an external unsafe review if it is for a simpler crate.

I think it's bad to accept "larger number of crates is worse" as a problem to be addressed

From a supply chain / trust perspective it tends to be trickier to manage, even for multiple crates from a single project. It's not a hard blocker, but I've been on the other side of this equation and it is legitimately sometimes a concern.

I think it is a useful endeavor to try and reduce crate count where unnecessary. I don't think going as far as slicing per-property as something we should do.

@Manishearth
Copy link
Member

Manishearth commented Jun 26, 2024

IMO that's just silly -- whether you publish code with derive macros that then goes to expand to unsafe or have code generating code with unsafe in it pre-publish is exactly equivalent, and I don't think you should be afraid of making this argument. It should be more about ensuring the used abstractions are safe in practice.

@djc I regularly perform unsafe audits for Google, it is a major impact on the auditing process when there's a bunch more unsafe vs macros/derives that generate unsafe (which are hard to review but not as tedious). This isn't silly, it's a real concern that will keep cropping up.

For crates that are all part of the same ecosystem, maybe it would be reasonable to expand "macros" using pre-publish codegen? This can be a pretty nice technique for moving compile-time dependencies to dev-dependencies and generally help save compile time -- I imagine a lot of the Unicode data is statically known based on data files and doesn't conceptually need to be derived/compiled by downstream projects?

The unicode data is already pre-baked into the crate. The derive macros do not generate unicode data bundles (we use the icu4x-datagen binary for that), they generate more fundamental stuff like the ULE/VarULE and Yokeable impls, which ICU4X's architecture relies on. We have some ideas of getting rid of them for some crates (in compiled data mode Yokeable may not strictly be necessary, and ULE/VarULE in many cases could be macro-generated, see #5127)

@djc
Copy link

djc commented Jun 27, 2024

IMO that's just silly -- whether you publish code with derive macros that then goes to expand to unsafe or have code generating code with unsafe in it pre-publish is exactly equivalent, and I don't think you should be afraid of making this argument. It should be more about ensuring the used abstractions are safe in practice.

@djc I regularly perform unsafe audits for Google, it is a major impact on the auditing process when there's a bunch more unsafe vs macros/derives that generate unsafe (which are hard to review but not as tedious). This isn't silly, it's a real concern that will keep cropping up.

Sorry, I was thinking more of random reddit commenters but failed to consider at-scale unsafe audits. Still feels like there's a way out here where you could mark generated code with some tokens that make it obvious where/how the code was generated and this would make things easier for the auditor?

The unicode data is already pre-baked into the crate. The derive macros do not generate unicode data bundles (we use the icu4x-datagen binary for that), they generate more fundamental stuff like the ULE/VarULE and Yokeable impls, which ICU4X's architecture relies on. We have some ideas of getting rid of them for some crates (in compiled data mode Yokeable may not strictly be necessary, and ULE/VarULE in many cases could be macro-generated, see #5127)

Sounds great!

MSRV and compile time are very different concerns: MSRV is a non-issue when the user of Rust is on board with "stability without stagnation": ICU4X's MSRV will stay far enough back that anyone who keeps updating their toolchain will not see an MSRV problem even if they don't update Rust all the time. The MSRV issue affects folks who do "stability" via Debian-level stagnation. (AFAICT, ICU4X's MSRV accommodates users who are on Red Hat's or Ferrocene's Rust toolchain cycle. The former having come up in our policy discussion and the latter by coincidence.)

Similar to @Manishearth's comments on the "number of crates is irrelevant", IMO this is just not that simple. As a fairly experienced library maintainer, there exists some pressure on making things easy to compile on Debian installations by "users" (as opposed to "developers") -- cargo install is a pretty easy distribution method, and if there is not a lot of pressure otherwise in the library to upgrade to newer MSRV it sucks to be "forced" to bump the MSRV due to lower-level libraries.

I could turn the argument around to say that "stability without stagnation" should also mean that you can still get bugfixes releases for most of your libraries even when you are stuck on an older compiler. (I also don't think it's fair to call Debian's release policy stagnation -- it slows things down but it definitely keeps progressing.)

(Also note that it appears that MS is funding work at the Rust Foundation to come up with an LTS policy for Rust itself, so it seems pressure on staying compatible with older compilers will not decrease, and might even increase as/while the amount of change in the toolchain gets smaller over time.)

@robertbastian robertbastian added this to the 2.x Priority ⟨P2⟩ milestone Jun 27, 2024
@Manishearth
Copy link
Member

Still feels like there's a way out here where you could mark generated code with some tokens that make it obvious where/how the code was generated and this would make things easier for the auditor?

This assumes that we control other organizations' review processes. Like, yes, we can make things easier for the auditor, but it's still more complicated, and definitely still a level above not having unsafe code at all, which can typically avoid needing a human involved to verify these things.

I mentioned Google as an example, not as the only unsafe audit process in existence.

Also note that it appears that MS is funding work at the Rust Foundation to come up with an LTS policy for Rust itself, so it seems pressure on staying compatible with older compilers will not decrease, and might even increase as/while the amount of change in the toolchain gets smaller over time

yes, though I imagine libraries that care will come out with LTS versions too. rust-url LTS would be able to use the older idna code, or older ICU4X, or something. The moment you're in a special scenario like this a lot of additional measures can be taken.

@hsivonen
Copy link
Member

hsivonen commented Jul 8, 2024

Filed a new finding relevant to this meta issue. compiled_data constructors having a measurable run-time cost: #5187

@djc
Copy link

djc commented Jul 10, 2024

The lower down the dependency tree, the more conservative you need to be on MSRV, more conservative than a project of our size can be.

Why is this? What are important things you need from 1.67 that aren't in 1.63?

@robertbastian
Copy link
Member

  • 1.64: Cargo workspaces inheritance
  • 1.65: let-else, GATs
  • 1.67: {integer}::ilog10, const char constructors
    === Current MSRV ===
  • 1.68: default alloc error handler
  • 1.70: OnceCell, OnceLock
    === ICU4X 2.0 MSRV ===
  • 1.73: {integer}::div_ceil
  • 1.74: const_transmute_copy, Cargo.toml [lints]
  • 1.75: return-position impl Trait in traits
  • 1.79: const expressions, associated type bounds, <[u8]>::utf8_chunks
  • 1.80: LazyCell, LazyLock
  • 1.81: Error in core

Rust 1.67 is 1.5 years old today, Rust 1.70 will be over a year old when we release ICU4X 2.0.

@hsivonen
Copy link
Member

According to https://mastodon.social/@kornel/112809004985600187 , fewer than 0.1% of crates.io requests are made by Cargo 1.63. It looks like very bad allocation of effort to put effort into making newly-published crate versions accommodate rustc from Debian stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-meta Component: Relating to ICU4X as a whole discuss Discuss at a future ICU4X-SC meeting S-epic Size: Major project (create smaller child issues) U-google User: Google 1st party
Projects
None yet
Development

No branches or pull requests

5 participants