Cache "live" results from crates.io #48

Shnatsel · 2021-02-24T18:19:06Z

When downloading data via the crates.io API, we could cache it for later reuse. This would help if the user wants to view both crates and publishers commands for their crate or adjust the cargo-metadata parameters (e.g. target platform).

The timestamp of when the data was downloaded should be preserved; the cached data should be used only if the --cache-max-age configuration allows it.

If there are any cache entries with a timestamp from the future, they should be discarded.

The text was updated successfully, but these errors were encountered:

tarcieri · 2021-02-24T18:21:39Z

Is the data you need cached locally somewhere already, e.g. either in the crates.io index itself (which can be consumed using the crates_index crate), or via the crate file cache located in ~/.cargo/registry/cache?

Shnatsel · 2021-02-24T18:25:54Z

Good question! Unfortunately it is not. We need the data about the crates.io publishers, which is present in neither of those places.

tarcieri · 2021-02-24T18:30:43Z

Aah, unfortunate. Perhaps it'd be worth opening an upstream issue to include that information in the index?

Shnatsel · 2021-02-24T18:33:24Z

I don't think it's a good idea to include it in the index, actually. This info is not needed for most uses - that's why it's not in the index!
It is included in the daily database dumps, but they are currently served as a monolithic ~250Mb .tar.gz archive even though we need only 10Mb (uncompressed) from it. Splitting that into a separate file would achieve 100x reduction in traffic for the update subcommand; this is discussed in more detail in #45.

Shnatsel · 2021-02-24T18:34:38Z

If we choose to use a granular cache, it makes sense to store it on-disk in JSON since it's basically a map and we already have a dependency on serde-json due to the requirement of parsing JSON from crates.io API.

And we already have the cache directory created for storing the crates.io dump.

tarcieri · 2021-02-24T18:45:28Z

I'm not sure they ever made a conscious decision whether or not to include it in the index. It's a feature that was added to crates.io quite awhile after the index was created. It's also (somewhat) low-cardinality data that would compress well.

I think the nice part about having it in the index is the index provides a timestamped/append-only(-ish) cryptographic(-ish, with the unfortunate problem of SHA-1 collisions) log, so including audit info would commit to that, as opposed to it potentially being retroactively modified by an attacker in the event of a crates.io compromise.

Shnatsel · 2022-01-23T23:05:28Z

https://crates.io/crates/structsy sounds like a better way to store data on disk than JSON files.

Shnatsel added the enhancement New feature or request label Feb 24, 2021

Shnatsel added the help wanted Extra attention is needed label Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache "live" results from crates.io #48

Cache "live" results from crates.io #48

Shnatsel commented Feb 24, 2021 •

edited

Loading

tarcieri commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

tarcieri commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

tarcieri commented Feb 24, 2021

Shnatsel commented Jan 23, 2022

Cache "live" results from crates.io #48

Cache "live" results from crates.io #48

Comments

Shnatsel commented Feb 24, 2021 • edited Loading

tarcieri commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

tarcieri commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

Shnatsel commented Feb 24, 2021

tarcieri commented Feb 24, 2021

Shnatsel commented Jan 23, 2022

Shnatsel commented Feb 24, 2021 •

edited

Loading