Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache "live" results from crates.io #48

Open
Shnatsel opened this issue Feb 24, 2021 · 7 comments
Open

Cache "live" results from crates.io #48

Shnatsel opened this issue Feb 24, 2021 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@Shnatsel
Copy link
Member

Shnatsel commented Feb 24, 2021

When downloading data via the crates.io API, we could cache it for later reuse. This would help if the user wants to view both crates and publishers commands for their crate or adjust the cargo-metadata parameters (e.g. target platform).

The timestamp of when the data was downloaded should be preserved; the cached data should be used only if the --cache-max-age configuration allows it.

If there are any cache entries with a timestamp from the future, they should be discarded.

@Shnatsel Shnatsel added the enhancement New feature or request label Feb 24, 2021
@tarcieri
Copy link
Member

Is the data you need cached locally somewhere already, e.g. either in the crates.io index itself (which can be consumed using the crates_index crate), or via the crate file cache located in ~/.cargo/registry/cache?

@Shnatsel
Copy link
Member Author

Good question! Unfortunately it is not. We need the data about the crates.io publishers, which is present in neither of those places.

@tarcieri
Copy link
Member

Aah, unfortunate. Perhaps it'd be worth opening an upstream issue to include that information in the index?

@Shnatsel
Copy link
Member Author

I don't think it's a good idea to include it in the index, actually. This info is not needed for most uses - that's why it's not in the index!
It is included in the daily database dumps, but they are currently served as a monolithic ~250Mb .tar.gz archive even though we need only 10Mb (uncompressed) from it. Splitting that into a separate file would achieve 100x reduction in traffic for the update subcommand; this is discussed in more detail in #45.

@Shnatsel
Copy link
Member Author

If we choose to use a granular cache, it makes sense to store it on-disk in JSON since it's basically a map and we already have a dependency on serde-json due to the requirement of parsing JSON from crates.io API.

And we already have the cache directory created for storing the crates.io dump.

@tarcieri
Copy link
Member

I'm not sure they ever made a conscious decision whether or not to include it in the index. It's a feature that was added to crates.io quite awhile after the index was created. It's also (somewhat) low-cardinality data that would compress well.

I think the nice part about having it in the index is the index provides a timestamped/append-only(-ish) cryptographic(-ish, with the unfortunate problem of SHA-1 collisions) log, so including audit info would commit to that, as opposed to it potentially being retroactively modified by an attacker in the event of a crates.io compromise.

@Shnatsel
Copy link
Member Author

https://crates.io/crates/structsy sounds like a better way to store data on disk than JSON files.

@Shnatsel Shnatsel added the help wanted Extra attention is needed label Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants