-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meta-issue: ICU4X as a low-level dependency #5124
Comments
Notes from 2024-06-20: Issues: servo/rust-url#937 original PR: servo/rust-url#923 Discussion:
|
Explicitly writing out some of the things I said during the meeting (but @hsivonen wasn't present): In general I think serving lower-level crates is a good goal. But I want to highlight that these lower-level crates live on a spectrum, and rust-url is on the far end of that spectrum, it is used very heavily and the union of all it's users' needs can be rather precise. I am very much in favor of identifying incremental changes that bring us closer to that end of the spectrum. There's a lot of low hanging fruit here, including getting rid of I am less in favor of treating "usable by rust-url as a default dependency" as a goal in and of itself. It's a good north star to aim for, but I want to caution us against having hope that it will happen, or treating it as a goal in service of which we sacrifice other stuff, like code cleanliness or unsafe code proliferation. I'm happy to see Henri continue to discuss the topic on the rust-url thread since there is still some chance, but I don't want ICU4X to treat this as something definitely achievable. (The situation I really wish to avoid is that we start giving up other things in service of making rust-url a client, and then rust-url's users oppose it anyway) |
Full list of low hanging fruit that comes to my mind:
|
Appreciate this discussion! While I agree that url might be on the far end of what low-level dependencies require, in the end I think much of ICU4X is sort of conceptually a low-level dependency due to the subject matter, and so IMO it would be great if it was an awesome candidate across the board for all sorts of low-level crates. For one example, I'm also a chrono maintainer and I think depending on date/time internationalization stuff from ICU4X has come up, but in my perception chrono is probably even more low-level than url. For crates that are all part of the same ecosystem, maybe it would be reasonable to expand "macros" using pre-publish codegen? This can be a pretty nice technique for moving compile-time dependencies to dev-dependencies and generally help save compile time -- I imagine a lot of the Unicode data is statically known based on data files and doesn't conceptually need to be derived/compiled by downstream projects? See things like mattsse/chromiumoxide#80 and https://github.com/open-telemetry/opentelemetry-rust-contrib/blob/main/opentelemetry-stackdriver/tests/generate.rs. |
This will put unsafe code into every crate, which people generally dislike. |
IMO that's just silly -- whether you publish code with derive macros that then goes to expand to |
This makes a big difference for auditing. You have to audit the code of the derive macro once, but when it's inlined you have no guarantee where it came from, and you'll have to audit it hundreds of times. |
But if you're generating code within crates originating from the icu4x project, surely the auditing issue is solvable? |
"The icu4x project" is not something that exists on crates.io. If crate |
We already do check-in unsafe generated code from databake so that clients don't need to run or depend on https://github.com/unicode-org/icu4x/tree/main/provider/data/properties/data I can certainly see an argument that checking-in generated unsafe impls could be seen in a similar light. |
The way I view
As for concerns about the compilation process itself: I think it's bad to accept "larger number of crates is worse" as a problem to be addressed. If anything, part of the compilation time increase is that we have a catch-all crate for properties that's been designed to be sliced up by LTO instead of having a crate per property. Crate per property would obviously would result in a larger number of crates, but the approach of having a crate per property (as in https://crates.io/crates/unicode-joining-type and https://crates.io/crates/unicode-general-category ) would likely improve the compile time of MSRV and compile time are very different concerns: MSRV is a non-issue when the user of Rust is on board with "stability without stagnation": ICU4X's MSRV will stay far enough back that anyone who keeps updating their toolchain will not see an MSRV problem even if they don't update Rust all the time. The MSRV issue affects folks who do "stability" via Debian-level stagnation. (AFAICT, ICU4X's MSRV accommodates users who are on Red Hat's or Ferrocene's Rust toolchain cycle. The former having come up in our policy discussion and the latter by coincidence.) Compile times, on the other hand, affect users regardless of compiler version (though newer compiler may be faster). While more energy usage is worse than less energy usage and it that sense excessive compile compute is always bad, the developer experience impact of ICU4X compile times probably varies significantly by hardware and the overall application dependency graph. servo/rust-url#939 says "now take a whole 20 seconds or so longer to compile than before". When I compile reqwest without application code, I see the compile take about 1.3 seconds longer (dev and release) on M3 Pro (macOS) and 1.0 seconds (release) to 1.2 seconds (dev) longer on Threadripper Pro 5975WX (Ubuntu). See https://hsivonen.com/test/moz/reqwest-idna/ (Note that reqwest builds a different dependency graph on macOS and Linux.) In both cases, but particularly on the Threadripper, the CPU is underutilized, so if the application has another long dependency chain that does not have It looks like the sources for the compile time increase are:
As for audits: If the recipient really wants to audit everything themselves, then generated code on crates.io is worse for auditability. However, if the recipient uses cargo-vet and trusts that the core participants in cargo-vet ecosystem, chances are that the imported audits will cover ICU4X. I've pondered if I should propose that Unicode or the ICU4X project become a cargo-vet audit publisher, but it's not really clear how useful that would be compared to Mozilla and Google already being cargo-vet publishers. |
Trust is multifaceted and lives on a spectrum, even if people are relying on cargo-vet, "this is harder to audit" is still worse. As someone both publishing and performing unsafe reviews, it is easier to trust an external unsafe review if it is for a simpler crate.
From a supply chain / trust perspective it tends to be trickier to manage, even for multiple crates from a single project. It's not a hard blocker, but I've been on the other side of this equation and it is legitimately sometimes a concern. I think it is a useful endeavor to try and reduce crate count where unnecessary. I don't think going as far as slicing per-property as something we should do. |
@djc I regularly perform unsafe audits for Google, it is a major impact on the auditing process when there's a bunch more unsafe vs macros/derives that generate unsafe (which are hard to review but not as tedious). This isn't silly, it's a real concern that will keep cropping up.
The unicode data is already pre-baked into the crate. The derive macros do not generate unicode data bundles (we use the icu4x-datagen binary for that), they generate more fundamental stuff like the ULE/VarULE and Yokeable impls, which ICU4X's architecture relies on. We have some ideas of getting rid of them for some crates (in compiled data mode Yokeable may not strictly be necessary, and ULE/VarULE in many cases could be macro-generated, see #5127) |
Sorry, I was thinking more of random reddit commenters but failed to consider at-scale
Sounds great!
Similar to @Manishearth's comments on the "number of crates is irrelevant", IMO this is just not that simple. As a fairly experienced library maintainer, there exists some pressure on making things easy to compile on Debian installations by "users" (as opposed to "developers") -- I could turn the argument around to say that "stability without stagnation" should also mean that you can still get bugfixes releases for most of your libraries even when you are stuck on an older compiler. (I also don't think it's fair to call Debian's release policy stagnation -- it slows things down but it definitely keeps progressing.) (Also note that it appears that MS is funding work at the Rust Foundation to come up with an LTS policy for Rust itself, so it seems pressure on staying compatible with older compilers will not decrease, and might even increase as/while the amount of change in the toolchain gets smaller over time.) |
This assumes that we control other organizations' review processes. Like, yes, we can make things easier for the auditor, but it's still more complicated, and definitely still a level above not having unsafe code at all, which can typically avoid needing a human involved to verify these things. I mentioned Google as an example, not as the only unsafe audit process in existence.
yes, though I imagine libraries that care will come out with LTS versions too. rust-url LTS would be able to use the older idna code, or older ICU4X, or something. The moment you're in a special scenario like this a lot of additional measures can be taken. |
Filed a new finding relevant to this meta issue. |
Why is this? What are important things you need from 1.67 that aren't in 1.63? |
Rust 1.67 is 1.5 years old today, Rust 1.70 will be over a year old when we release ICU4X 2.0. |
According to https://mastodon.social/@kornel/112809004985600187 , fewer than 0.1% of crates.io requests are made by Cargo 1.63. It looks like very bad allocation of effort to put effort into making newly-published crate versions accommodate rustc from Debian stable. |
ICU4X aims to be lightweight and portable, and it achieves those goals after all is said and done, but during the build process, ICU4X is fairly hefty when it comes to linker size, compile times, source code size, and number of crates.
This was observed downstream in servo/rust-url#939 (#5120, #5121), as well as elsewhere.
There are low-hanging things we can do to improve this as well as other things that might have tradeoffs that we need to weigh.
CC @Manishearth @hsivonen
The text was updated successfully, but these errors were encountered: