Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefining model parameter compute #187

Open
nicrie opened this issue Aug 19, 2024 · 4 comments
Open

Redefining model parameter compute #187

nicrie opened this issue Aug 19, 2024 · 4 comments
Labels
design Internal code structure

Comments

@nicrie
Copy link
Contributor

nicrie commented Aug 19, 2024

Right now, setting compute=True triggers a sequential computation of the following steps:

  1. the preprocessor scaler
  2. optional NaN checks
  3. SVD
  4. scores and components

I’m starting to doubt the usefulness of the compute parameter in its current form. When @slevang managed to implement fully lazy evaluation, I thought it made sense to keep this parameter, but now I'm not so sure. My experience has been that it’s almost always faster to let Dask handle everything and find the most efficient way to compute, rather than forcing intermediate steps with compute=True. I plan to run some benchmarks when I get a chance to confirm this.

The only real benefit I see for this parameter might be in cross-decomposition models (like CCA, MCA, etc.), where PCA preprocessing is done on individual datasets to make subsequent steps more computationally feasible. Often, this results in a reduced PC space that fits into memory, so continuing with eager computation for the SVD, components, scores, etc., makes sense. Another advantage could be when defining the number of PCs based on explained variance rather than a fixed number—this requires computing individual PCA models upfront.

So, I’m thinking it might be more practical to redefine the compute parameter as something like compute_preprocessor. This would compute the following in one go, using Dask’s efficiency:

  1. the preprocessor scaler
  2. optional NaN checks
  3. PCA preprocessing

From there, we can continue with eager computation for the SVD, scores, and components.

I’d love to hear your thoughts on this @slevang since you have probably much more experience with Dask’s magic! :)

@nicrie nicrie added the design Internal code structure label Aug 19, 2024
@slevang
Copy link
Contributor

slevang commented Aug 19, 2024

One example I'm remembering where this does have a major impact is in the rotator classes. The design where we fit these iteratively with a tolerance check simply can't be done in a single dask graph, so I had to drop that and just run a fixed number of iterations.

The other piece that couldn't be done lazily was sorting the modes by variance explained, because that would require reindexing by a dask array which isn't currently allowed. I worked around that though by manually redefining the compute method on the model classes and bundling the sorting operation in at the end.

It would be interesting to do more benchmarking on compute time and memory footprint of adding some intermediate compute/persist steps to on the preprocessor factors like you mention. Although if we saw major improvements there, we could also try inserting things like dask.graph_manipulation.wait_on in strategic locations, which essentially does the same thing as specifying something like compute_preprocessor but allows you to still build the entire graph lazily.

@nicrie
Copy link
Contributor Author

nicrie commented Aug 20, 2024

Oh right, I completely forgot about the rotator classes. Thanks for pointing me to dask.graph_manipulation.wait_on - I wasn’t aware of that one. Do you think this approach could help build the entire graph for the PCA Preprocessor lazily, where the number of PCs to retain (and thus the shape of the output matrices) is determined by the data instead of being a fixed parameter?

To be more specific, I recently adapted (#184) the Decomposer to perform a truncated SVD based on a specified amount of explained variance. Right now, I just prohibit users from using variance-based truncation with Dask. Do you think it’s possible to adapt the code with wait_on so that the model can still be built lazily even when it’s based on variance?

@slevang
Copy link
Contributor

slevang commented Aug 20, 2024

I don't think so. All wait_on does is tell dask when it goes to order and execute the task graph that all tasks up to that point need to run prior to anything else downstream. You still aren't able to build indexing operations with dask arrays. Indexing with an array of unknown size would be very hard to ever implement in the dask framework, while I believe indexing with an array of known size but unknown values is maybe possible but not implemented.

Since this is a truncation, and we're running the larger computation anyways, I wonder if we could make this feature dask enabled by using the model._post_compute() trick I used for the variance sorting. That seems like a natural fit. Although in some of the more complex cross models maybe this gets messy?

On larger datasets in practice I've seen very little difference in runtime with dask.svd_compressed between computing say 10 modes vs 100 modes. So I default to some larger number and then subset down from there as needed once the decomposition is run.

@nicrie
Copy link
Contributor Author

nicrie commented Aug 20, 2024

I admit that this is a bit beyond my programming brain right now, and I can't really see how to apply model._post_compute() to the cross models (maybe I've just been staring at the screen for too long today 😅). I hope to send a PR today or tomorrow with a slightly cleaner implementation of the BaseCrossModel class - maybe that will help too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Internal code structure
Projects
None yet
Development

No branches or pull requests

2 participants