Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCA with nan values #177

Open
jiwon-j opened this issue Aug 2, 2024 · 13 comments
Open

CCA with nan values #177

jiwon-j opened this issue Aug 2, 2024 · 13 comments
Labels
documentation Improvements or additions to documentation

Comments

@jiwon-j
Copy link

jiwon-j commented Aug 2, 2024

Hello,
now i'm trying to conduct CCA with two datasets, SST(ERA5) and precipitation(GPCP).
i'm following this example(https://xeofs.readthedocs.io/en/latest/auto_examples/2multi/plot_cca.html),
However, I get an error saying that the array should not contain infs or NaNs.
I realize that the SST data used in the example code is a different ERSST than the one I used, but the SST data will naturally have a nan value for land, and that's what the result plot in the example shows.
Do I need to pre-process the nan values before CCA?

@jiwon-j jiwon-j added the documentation Improvements or additions to documentation label Aug 2, 2024
@nicrie
Copy link
Contributor

nicrie commented Aug 2, 2024

hey did you verify that you only have full-dimensional NaNs within your data? Grid cells representing land in SST data sets are full-dimensional and not a problem. However, there may be some hidden, individual NaNs within your data.

Can you post the exact error message?

@jiwon-j
Copy link
Author

jiwon-j commented Aug 2, 2024

File ~/miniconda3/lib/python3.10/site-packages/xeofs/models/cca.py:312, in CCA._fit_algorithm(self, views)
309 [assert_not_complex(view) for view in views]
311 self.c = process_parameter("c", self.c, 0, self.n_views)
--> 312 eigvals, eigvecs = self._solve_gevp(views)
313 self.eigvals = eigvals
314 self.eigvecs = eigvecs
...
--> 630 raise ValueError(
631 "array must not contain infs or NaNs")
632 return a

ValueError: array must not contain infs or NaNs

it show like above.
but when i count the number of nan values for each year in dataset, all numbers are same.
So I don't think this dataset has any nan values other than the nan value for land area.
(I also checked that there was no inf value.)

@nicrie
Copy link
Contributor

nicrie commented Aug 2, 2024

Would it be possible for you to share a minimal working example that shows the error? Perhaps you can isolate the error for a small subset of the data that you can share so I can have a look at it?

@jiwon-j
Copy link
Author

jiwon-j commented Aug 3, 2024

data_list = [precip_xr, SST_xr]

model = xeofs.models.CCA(
    n_modes=2,
    use_coslat=True,
    pca=True,
    variance_fraction=0.9,
    init_pca_modes=0.30,
)
model.fit(data_list, dim="time")
components = model.components()
scores = model.scores()

two datas have same time steps.
I just tried inputting only precipitation data and then only SST data (ex, data_list=[SST_xr, SST_xr] and data_list=[precip_xr, precip_xr]) and no errors appeared.
Could it be problem that the two datasets are different and have different resolutions?

@nicrie
Copy link
Contributor

nicrie commented Aug 3, 2024

Thank you for sharing the example. In theory, different spatial resolutions do not matter in the context of CCA.

I'm not immediately sure what the underlying issue might be. Could you provide more details about the dataset sizes, specifically the number of samples and grid points for both PRCP and SST? Have you experimented with the variance_fraction parameter? There might be edge cases where the PCA preprocessing selects either only one principal component or too many (exceeding the number of samples), leading to an ill-posed CCA problem and possibly resulting in NaNs.

I'd be happy to help you identify the issue :) But to assist you more effectively, I would need a fully reproducible example, including the data that demonstrates the error.

@jiwon-j
Copy link
Author

jiwon-j commented Aug 4, 2024

thank you for your caring

  1. about dataset
    GPCP precipitation dataset have resolution of 2.5 x 2.5.
    In the domain I defined, the data has a total of 784 grids, each with 28 steps of latitude and longitude.
    and ERA5 SST dataset have resolution of 0.25 x 0.25.
    i have 801 longitude steps and 561 latitude steps, so it have total of 449361 grid points.

at first, i set the variance_fraction to 0.9, but it still return the error 'ValueError: array must not contain infs or NaNs' whenever i set the option to the range of 0.1~0.9.

@nicrie
Copy link
Contributor

nicrie commented Aug 4, 2024

Thanks for the info! What are the number of time steps in both data sets ?

@jiwon-j
Copy link
Author

jiwon-j commented Aug 5, 2024

it was 20 same for both

@nicrie
Copy link
Contributor

nicrie commented Aug 5, 2024

ok in that case it may be that you're indeed running into one of these edge cases mentioned above. Would you mind testing how many modes you need for each data set to explain at least 90 % of your variance? For this you want to fit an EOF(n_modes=20, standardize=False, use_coslat=True) model on both data sets separately and then print the model.explained_variance_ratio() where model is either SST or PRCP. Could you report back with the output for both models?

@jiwon-j
Copy link
Author

jiwon-j commented Aug 5, 2024

  1. GPCP precipitation
model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
model.fit(precip_P1_xr, dim='time')
model.explained_variance_ratio()

array([2.15341437e-01, 1.91357740e-01, 1.07027890e-01, 6.50667319e-02,
       6.04127807e-02, 5.37885277e-02, 4.60878220e-02, 3.65153038e-02,
       3.32812623e-02, 2.91941211e-02, 2.61814632e-02, 2.43078690e-02,
       2.32423052e-02, 1.88798822e-02, 1.86685204e-02, 1.50839098e-02,
       1.30147616e-02, 1.25216876e-02, 1.00259842e-02, 2.05406409e-16])

this data need at least 13 modes.

  1. ERA5 SST
model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
model.fit(SST_P1_xr, dim='time')
model.explained_variance_ratio()

array([4.18731037e-01, 9.03651828e-02, 7.90773958e-02, 5.90202030e-02,
       5.12331694e-02, 4.36145692e-02, 3.98111592e-02, 3.75551818e-02,
       2.69943871e-02, 2.56492026e-02, 2.26891458e-02, 1.77368605e-02,
       1.70833855e-02, 1.59150489e-02, 1.38574669e-02, 1.27135494e-02,
       1.11173236e-02, 8.87997290e-03, 7.95575824e-03, 7.39957211e-16])

SST need at least 12 modes to explain 90% of variance.

@nicrie
Copy link
Contributor

nicrie commented Aug 5, 2024

I'm not so sure about this anymore... Setting variance_fraction=0.5 should result in selecting 3 and 2 principal components, which seems fine. I don't see anything unusual in the variance_fraction range of 0.1 to 0.9 that you mentioned earlier. I fear that without a fully reproducible example, including the data, I can't provide further help. Sorry about that.

@jiwon-j
Copy link
Author

jiwon-j commented Aug 5, 2024

That's okay, thanks for your help in the meantime.
I'll try a few more things and update if the issue is resolved.

@nicrie
Copy link
Contributor

nicrie commented Sep 5, 2024

hey @jiwon-j just in case you're still facing the same problem, you may give the new (xeofs version 3) xe.cross.CCA class a try and see if it solves the issue you had? This implementation of CCA is simpler (only allows 2 sets of data), but it's likely more robust than the one you used in version 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants