CCA with nan values #177

jiwon-j · 2024-08-02T08:56:19Z

Hello,
now i'm trying to conduct CCA with two datasets, SST(ERA5) and precipitation(GPCP).
i'm following this example(https://xeofs.readthedocs.io/en/latest/auto_examples/2multi/plot_cca.html),
However, I get an error saying that the array should not contain infs or NaNs.
I realize that the SST data used in the example code is a different ERSST than the one I used, but the SST data will naturally have a nan value for land, and that's what the result plot in the example shows.
Do I need to pre-process the nan values before CCA?

nicrie · 2024-08-02T10:05:32Z

hey did you verify that you only have full-dimensional NaNs within your data? Grid cells representing land in SST data sets are full-dimensional and not a problem. However, there may be some hidden, individual NaNs within your data.

Can you post the exact error message?

jiwon-j · 2024-08-02T10:57:37Z

File ~/miniconda3/lib/python3.10/site-packages/xeofs/models/cca.py:312, in CCA._fit_algorithm(self, views)
309 [assert_not_complex(view) for view in views]
311 self.c = process_parameter("c", self.c, 0, self.n_views)
--> 312 eigvals, eigvecs = self._solve_gevp(views)
313 self.eigvals = eigvals
314 self.eigvecs = eigvecs
...
--> 630 raise ValueError(
631 "array must not contain infs or NaNs")
632 return a

ValueError: array must not contain infs or NaNs

it show like above.
but when i count the number of nan values for each year in dataset, all numbers are same.
So I don't think this dataset has any nan values other than the nan value for land area.
(I also checked that there was no inf value.)

nicrie · 2024-08-02T11:12:54Z

Would it be possible for you to share a minimal working example that shows the error? Perhaps you can isolate the error for a small subset of the data that you can share so I can have a look at it?

jiwon-j · 2024-08-03T07:02:38Z

data_list = [precip_xr, SST_xr]

model = xeofs.models.CCA(
    n_modes=2,
    use_coslat=True,
    pca=True,
    variance_fraction=0.9,
    init_pca_modes=0.30,
)
model.fit(data_list, dim="time")
components = model.components()
scores = model.scores()

two datas have same time steps.
I just tried inputting only precipitation data and then only SST data (ex, data_list=[SST_xr, SST_xr] and data_list=[precip_xr, precip_xr]) and no errors appeared.
Could it be problem that the two datasets are different and have different resolutions?

nicrie · 2024-08-03T10:31:58Z

Thank you for sharing the example. In theory, different spatial resolutions do not matter in the context of CCA.

I'm not immediately sure what the underlying issue might be. Could you provide more details about the dataset sizes, specifically the number of samples and grid points for both PRCP and SST? Have you experimented with the variance_fraction parameter? There might be edge cases where the PCA preprocessing selects either only one principal component or too many (exceeding the number of samples), leading to an ill-posed CCA problem and possibly resulting in NaNs.

I'd be happy to help you identify the issue :) But to assist you more effectively, I would need a fully reproducible example, including the data that demonstrates the error.

jiwon-j · 2024-08-04T08:29:47Z

thank you for your caring

about dataset
GPCP precipitation dataset have resolution of 2.5 x 2.5.
In the domain I defined, the data has a total of 784 grids, each with 28 steps of latitude and longitude.
and ERA5 SST dataset have resolution of 0.25 x 0.25.
i have 801 longitude steps and 561 latitude steps, so it have total of 449361 grid points.

at first, i set the variance_fraction to 0.9, but it still return the error 'ValueError: array must not contain infs or NaNs' whenever i set the option to the range of 0.1~0.9.

nicrie · 2024-08-04T10:14:09Z

Thanks for the info! What are the number of time steps in both data sets ?

jiwon-j · 2024-08-05T00:23:13Z

it was 20 same for both

nicrie · 2024-08-05T08:26:42Z

ok in that case it may be that you're indeed running into one of these edge cases mentioned above. Would you mind testing how many modes you need for each data set to explain at least 90 % of your variance? For this you want to fit an EOF(n_modes=20, standardize=False, use_coslat=True) model on both data sets separately and then print the model.explained_variance_ratio() where model is either SST or PRCP. Could you report back with the output for both models?

jiwon-j · 2024-08-05T10:21:29Z

GPCP precipitation

model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
model.fit(precip_P1_xr, dim='time')
model.explained_variance_ratio()

array([2.15341437e-01, 1.91357740e-01, 1.07027890e-01, 6.50667319e-02,
       6.04127807e-02, 5.37885277e-02, 4.60878220e-02, 3.65153038e-02,
       3.32812623e-02, 2.91941211e-02, 2.61814632e-02, 2.43078690e-02,
       2.32423052e-02, 1.88798822e-02, 1.86685204e-02, 1.50839098e-02,
       1.30147616e-02, 1.25216876e-02, 1.00259842e-02, 2.05406409e-16])

this data need at least 13 modes.

ERA5 SST

model = xeofs.models.EOF(n_modes=20,
                        standardize=False,
                        use_coslat=True,)
model.fit(SST_P1_xr, dim='time')
model.explained_variance_ratio()

array([4.18731037e-01, 9.03651828e-02, 7.90773958e-02, 5.90202030e-02,
       5.12331694e-02, 4.36145692e-02, 3.98111592e-02, 3.75551818e-02,
       2.69943871e-02, 2.56492026e-02, 2.26891458e-02, 1.77368605e-02,
       1.70833855e-02, 1.59150489e-02, 1.38574669e-02, 1.27135494e-02,
       1.11173236e-02, 8.87997290e-03, 7.95575824e-03, 7.39957211e-16])

SST need at least 12 modes to explain 90% of variance.

nicrie · 2024-08-05T11:02:52Z

I'm not so sure about this anymore... Setting variance_fraction=0.5 should result in selecting 3 and 2 principal components, which seems fine. I don't see anything unusual in the variance_fraction range of 0.1 to 0.9 that you mentioned earlier. I fear that without a fully reproducible example, including the data, I can't provide further help. Sorry about that.

jiwon-j · 2024-08-05T11:11:24Z

That's okay, thanks for your help in the meantime.
I'll try a few more things and update if the issue is resolved.

nicrie · 2024-09-05T12:50:12Z

hey @jiwon-j just in case you're still facing the same problem, you may give the new (xeofs version 3) xe.cross.CCA class a try and see if it solves the issue you had? This implementation of CCA is simpler (only allows 2 sets of data), but it's likely more robust than the one you used in version 2.

jiwon-j added the documentation Improvements or additions to documentation label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCA with nan values #177

CCA with nan values #177

jiwon-j commented Aug 2, 2024

nicrie commented Aug 2, 2024

jiwon-j commented Aug 2, 2024 •

edited

Loading

nicrie commented Aug 2, 2024

jiwon-j commented Aug 3, 2024

nicrie commented Aug 3, 2024

jiwon-j commented Aug 4, 2024

nicrie commented Aug 4, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Aug 5, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Aug 5, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Sep 5, 2024

CCA with nan values #177

CCA with nan values #177

Comments

jiwon-j commented Aug 2, 2024

nicrie commented Aug 2, 2024

jiwon-j commented Aug 2, 2024 • edited Loading

nicrie commented Aug 2, 2024

jiwon-j commented Aug 3, 2024

nicrie commented Aug 3, 2024

jiwon-j commented Aug 4, 2024

nicrie commented Aug 4, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Aug 5, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Aug 5, 2024

jiwon-j commented Aug 5, 2024

nicrie commented Sep 5, 2024

jiwon-j commented Aug 2, 2024 •

edited

Loading