-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cache npm metadata #5491
feat: cache npm metadata #5491
Conversation
Implemented cached metadata trimming and the metadata cache size on the berry repo (install after |
Overall looks a good idea; couple of questions:
|
e476317
to
f3fc013
Compare
b55f4f9
to
9e42027
Compare
I'm seeing similar results on GH actions now.
|
These are the benchmark results for serializing and deserializing a mega cache file:
That's because we have to do it for the install state since it includes Here I'll keep using JSON because overall it's faster. |
It might be a bit faster, but it would make the code a lot more complicated because it requires mutexing and doesn't play nicely with the existing resolver architecture which is better suited for concurrent operations. For now, I think the current implementation is fine, we can always tweak it in the future if we need to. |
PR is now finished and ready for review.
Final benchmarks from GitHub Actions (Gatsby): TL;DR: BeforeTesting install-full-cold
Benchmark #1: yarn install
Time (mean ± σ): 53.801 s ± 0.328 s [User: 76.642 s, System: 5.323 s]
Range (min … max): 53.312 s … 54.258 s 10 runs
Testing install-cache-only
Benchmark #1: yarn install
Time (mean ± σ): 23.105 s ± 0.313 s [User: 25.800 s, System: 3.807 s]
Range (min … max): 22.708 s … 23.590 s 10 runs
Testing install-cache-and-lock
Benchmark #1: yarn install
Time (mean ± σ): 7.691 s ± 0.071 s [User: 9.150 s, System: 1.762 s]
Range (min … max): 7.601 s … 7.839 s 10 runs
Testing install-ready
Benchmark #1: yarn add dummy-pkg@link:./dummy-pkg
Time (mean ± σ): 1.893 s ± 0.019 s [User: 2.268 s, System: 0.211 s]
Range (min … max): 1.863 s … 1.924 s 10 runs AfterTesting install-full-cold
Benchmark #1: yarn install
Time (mean ± σ): 53.710 s ± 0.372 s [User: 79.227 s, System: 4.789 s]
Range (min … max): 53.285 s … 54.395 s 10 runs
Testing install-cache-only
Benchmark #1: yarn install
Time (mean ± σ): 12.371 s ± 0.277 s [User: 13.988 s, System: 2.125 s]
Range (min … max): 11.930 s … 12.803 s 10 runs
Testing install-cache-and-lock
Benchmark #1: yarn install
Time (mean ± σ): 7.677 s ± 0.053 s [User: 9.183 s, System: 1.722 s]
Range (min … max): 7.574 s … 7.736 s 10 runs
Testing install-ready
Benchmark #1: yarn add dummy-pkg@link:./dummy-pkg
Time (mean ± σ): 1.898 s ± 0.028 s [User: 2.270 s, System: 0.205 s]
Range (min … max): 1.859 s … 1.950 s 10 runs |
Thanks @paul-soporan, some questions
PS: That interest me because I could update some gists:
|
It is completely independent of the archive cache or its settings. It will always be inside
It is a global cache, that applies to all projects on the machine, and will grow indefinitely, just like the mirror / global cache (which is not purged over time at the moment either). The more metadata that's stored in this cache (and the more recent it is), the faster installs will be in new projects when dependencies aren't already part of the lockfile.
It is a registry metadata cache. The lockfile is based on it, not the other way around. It is also completely useless when the lockfile exists and is up-to-date with the manifests.
Only if you want to trim it to reduce disk usage. The data in the cache will only be used when it's safe to do so, so we shouldn't run into the risk of stale data. We store the i.e. We already have a builtin invalidation mechanism. Note: We also use the cached data directly for exact version (e.g. when
Most likely yes, even though it's not a breaking change. We could technically backport it to 3.x, but 4.x is getting closer and closer and we probably won't have another minor 3.x version. |
Thanks a lot for this comprehensive answer. |
} | ||
|
||
function getMetadataFolder(configuration: Configuration) { | ||
return ppath.join(configuration.get(`globalFolder`), `npm-metadata`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking perhaps <global>/cache/metadata
might be a better location, what do you think? Actions that already cache the <global>/cache
folder would benefit from that out of the box, rather than having to add a new path (or cache the whole <global>
, which may not be as common).
I'd also suggest moving the npm-
prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for the delay.)
I'm thinking perhaps
<global>/cache/metadata
might be a better location, what do you think?
I don't really like the idea, the mirror and the npm metadata cache are 2 separate things, and merging them in a single folder would lead to confusion. We've also got index
which would have to be moved to cache
too with this line of thinking.
This would just complicate things like yarn cache clean
, where --mirror
would have to stop removing the entire folder and just remove the archives.
I'd prefer all of them to be in separate folders. One thing we could do would be to move the mirror to <global>/cache/mirror
in addition to what you proposed, then I'd be fine with it and it would still benefit from existing caching (but it would look a bit weird to have the global cache in <global>/cache/mirror
, but I guess that's just the consequence of the mirror and the global cache being the same thing).
I'd also suggest moving the npm- prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).
I'd rather not. It would just give the illusion of consistency.
The cache can do it because it's the sole source that controls that folder.
The npm metadata cache is supposed to be specific to the plugin-npm
resolvers - they are the only ones that populate and control it.
Moving the prefix to the filenames and having the folder called just metadata
gives the illusion that the folder is a generalized metadata cache. And it could be indeed, if other resolvers ever need to cache metadata in it, but it would have different shapes (and possibly different filename formats) based on the resolvers that cache the data and that might lead to confusion.
Edit: The files also wouldn't be in a single folder like the cache, since e.g. for npm we have npm-metadata/<hash>/<registry>/lodash.json
.
It would also have to be controlled by the core, which would be tasked with automatically generating the paths to ensure that no 2 resolvers accidentally use the same path (and also to make it possible for us to change the metadata cache path in the future).
In addition, we're not even certain that the current kind of cache is better than a monolithic one, that's something I'm open to experimenting with in the future.
That's why I think that opening the folder to anything but the npm resolvers is not something I want to do yet, if ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Thought more about it and I'd be open to making it <global>/metadata/npm/<hash>/<registry>/lodash.json
.
This way, we still have a common metadata folder but we make it clear that each resolver has to manage it manually.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable. I like the idea of moving the cache into cache/mirror (that said, I think we could also just rename --mirror into -g for the same effect; I think I'd like this even better 🤔).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I still think I'd prefer to keep the mirror, metadata, and index separate.
I think we could also just rename --mirror into -g for the same effect
We could have a -g
too but I prefer to retain the granularity. yarn cache clean
will become interactive in a future PR anyways (just like we discussed some time ago), and it will clean everything by default.
const runRequest = () => prettyNetworkError(request(target, null, {configuration, wrapNetworkRequest, ...rest}), {configuration, customErrorMessage}) | ||
.then(response => response.body); | ||
|
||
// We cannot cache responses when wrapNetworkRequest is used, as it can differ between calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that the user could mutate the request wrapping? Perhaps it'd be worth having a secondary wrapNetworkRequestPure
hook? Maybe for later.
(Many usages ofwrapNetworkRequest
will want to log the queries, or perhaps cache them even further, but not modify their results in an inconsistent way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about the hook, it's about the function parameter.
The wrapNetworkRequest
hook can indeed modify the response, but for a given URL it's kinda guaranteed that the same hooks will run even across different calls.
However, the function argument (that I'm using for caching npm metadata) can be passed different values in different calls, meaning that in one call I could pass a wrapNetworkRequest
that returns the body foo
, and in a different call one that returns the body bar
, therefore breaking caching.
Thinking about it more, the entire cache in httpUtils
is a bit unsafe because it only depends on the target, but the result could be different depending on the configuration
(settings and hooks) / headers
/ jsonResponse
.
I assume that it was implemented with the assumption that there won't be different option bags passed to it across different calls for a given URL in the same process, but that's in no way enforced and could lead to bugs.
In any case, I don't intend to fix that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the function argument (that I'm using for caching npm metadata) can be passed different values in different calls, meaning that in one call I could pass a wrapNetworkRequest that returns the body foo, and in a different call one that returns the body bar, therefore breaking caching.
Yes, that's what I mean by having a pure version of the hook - one that would semantically require implementers to return the same output given the same input.
But yes, that's a follow-up, maybe not even necessary until requested.
Co-authored-by: Maël Nison <[email protected]>
This reverts commit dd07bf4.
The results are even better in my case so great job!
|
**What's the problem this PR addresses?** - 07b4ceb changed the metadata cache structure but didn't update the benchmark script to match. - We're not removing the `$globalFolder/index` folder so `install-full-cold` is skipping some work. Follow-up to #5491 **How did you fix it?** Clear the entire global folder. **Checklist** - [x] I have read the [Contributing Guide](https://yarnpkg.com/advanced/contributing). - [x] I have set the packages that need to be released for my changes to be effective. - [x] I will check that all automated PR checks pass before the PR gets reviewed.
**What's the problem this PR addresses?** Now that #5491 has landed we can cache the metadata for our e2e tests to potentially speed them up. **How did you fix it?** Cache metadata. **Checklist** - [x] I have read the [Contributing Guide](https://yarnpkg.com/advanced/contributing). - [x] I have set the packages that need to be released for my changes to be effective. - [x] I will check that all automated PR checks pass before the PR gets reviewed.
**What's the problem this PR addresses?** - yarnpkg/berry@07b4ceb changed the metadata cache structure but didn't update the benchmark script to match. - We're not removing the `$globalFolder/index` folder so `install-full-cold` is skipping some work. Follow-up to yarnpkg/berry#5491 **How did you fix it?** Clear the entire global folder. **Checklist** - [x] I have read the [Contributing Guide](https://yarnpkg.com/advanced/contributing). - [x] I have set the packages that need to be released for my changes to be effective. - [x] I will check that all automated PR checks pass before the PR gets reviewed.
**What's the problem this PR addresses?** Now that yarnpkg/berry#5491 has landed we can cache the metadata for our e2e tests to potentially speed them up. **How did you fix it?** Cache metadata. **Checklist** - [x] I have read the [Contributing Guide](https://yarnpkg.com/advanced/contributing). - [x] I have set the packages that need to be released for my changes to be effective. - [x] I will check that all automated PR checks pass before the PR gets reviewed.
What's the problem this PR addresses?
Resolving package metadata is slower than it has to be because, most times, Yarn has already fetched it in the past, and some things can be cached and reused.
This should improve performance in various cases (ranging from creating new projects and cache-only-but-no-lockfile installs to
yarn up
when no new versions are available), since the server can avoid resending the response body if nothing has changed.How did you fix it?
This PR makes Yarn cache npm package metadata inside
<globalFolder>/metadata/npm/<cacheKey>/<registry>/<package>.json
whengetPackageMetadata
is used.If an exact version is requested, Yarn will return the metadata from disk directly and avoid hitting the network altogether.
Otherwise, Yarn will set the
If-None-Match
&If-Modified-Since
headers using theetag
&last-modified
values that were cached during previous requests. This tells the server to skip sending the response body and just respond with304
, making Yarn reuse the cached metadata.TODO:
yarn cache clean
clean the npm metadata cache (different PR)Checklist