-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redefining model parameter compute
#187
Comments
One example I'm remembering where this does have a major impact is in the rotator classes. The design where we fit these iteratively with a tolerance check simply can't be done in a single dask graph, so I had to drop that and just run a fixed number of iterations. The other piece that couldn't be done lazily was sorting the modes by variance explained, because that would require reindexing by a dask array which isn't currently allowed. I worked around that though by manually redefining the compute method on the model classes and bundling the sorting operation in at the end. It would be interesting to do more benchmarking on compute time and memory footprint of adding some intermediate compute/persist steps to on the preprocessor factors like you mention. Although if we saw major improvements there, we could also try inserting things like |
Oh right, I completely forgot about the rotator classes. Thanks for pointing me to To be more specific, I recently adapted (#184) the Decomposer to perform a truncated SVD based on a specified amount of explained variance. Right now, I just prohibit users from using variance-based truncation with Dask. Do you think it’s possible to adapt the code with |
I don't think so. All Since this is a truncation, and we're running the larger computation anyways, I wonder if we could make this feature dask enabled by using the On larger datasets in practice I've seen very little difference in runtime with |
I admit that this is a bit beyond my programming brain right now, and I can't really see how to apply |
Right now, setting
compute=True
triggers a sequential computation of the following steps:I’m starting to doubt the usefulness of the
compute
parameter in its current form. When @slevang managed to implement fully lazy evaluation, I thought it made sense to keep this parameter, but now I'm not so sure. My experience has been that it’s almost always faster to let Dask handle everything and find the most efficient way to compute, rather than forcing intermediate steps withcompute=True
. I plan to run some benchmarks when I get a chance to confirm this.The only real benefit I see for this parameter might be in cross-decomposition models (like CCA, MCA, etc.), where PCA preprocessing is done on individual datasets to make subsequent steps more computationally feasible. Often, this results in a reduced PC space that fits into memory, so continuing with eager computation for the SVD, components, scores, etc., makes sense. Another advantage could be when defining the number of PCs based on explained variance rather than a fixed number—this requires computing individual PCA models upfront.
So, I’m thinking it might be more practical to redefine the
compute
parameter as something likecompute_preprocessor
. This would compute the following in one go, using Dask’s efficiency:From there, we can continue with eager computation for the SVD, scores, and components.
I’d love to hear your thoughts on this @slevang since you have probably much more experience with Dask’s magic! :)
The text was updated successfully, but these errors were encountered: