FEAT - Add quadratic datafit with no access to the target #249

QB3 · 2024-05-27T14:55:31Z

Description of the feature

Exact feature

Solve the following optimization problem

$$\arg \min_{\beta} \frac{1}{2n} || X \beta ||^2 + \frac{1}{n} \beta^\top X^\top y + \text{penalty} \enspace,$$

with no access to $y$, but with access to $X^\top y$.

Additional context

Context,
I have been discussing with @shz9 to implement a specific datafit for genomic applications (@shz9 is finishing his PhD on statistical analysis of genomics data). From what I understood, genomics data are sensitive: one does not have access to the target $y$: one only has access to the design matrix $X$, and an estimation of $X^\top y$ (usually estimated from another dataset).

Steps

I guess we have to add the datafit $$\frac{1}{2n} || X \beta ||^2 + \frac{1}{n} \beta^\top X^\top y \enspace ,$$ and handle the fact there is no $y$ provided.

mathurinm · 2024-05-27T16:05:07Z

Thanks @QB3 that sounds doable

How would we pass the current design ? y does not have a default value in Solver.solve, so it's required. Making it a keyword arg may break some stuff.

Thoughts @Badr-MOUFAD @QB3 ?

Badr-MOUFAD · 2024-05-27T16:19:15Z

It is true that the form of the problem is a bit different but it fits our framework as we can write the datafit as $F(y, X\beta)$

As you mentioned @mathurin, while it breaks our conventions, I don’t think it would break the code.
We can create a new datafit and pass in (X, XTy) to Solver.solve

Thinking about it, adding support of this case, we will have in total three cases that don't abide by the conventions: Cox, and LinearSVC.
Perhaps, we should start thinking about a more flexible API that encloses all these cases.

mathurinm · 2024-05-27T16:25:55Z

There's a working (withoutintercept) implementation with minimal overhead (we inherit most of the stuff from Quadratic) in #250

I need to get the intercept update step right, but it was not painful at all to implement !

shz9 · 2024-05-27T16:32:48Z

Thanks @QB3, this would be great!

Just to add a bit more context: In statistical genetics, we have extremely high dimensional data (10s of millions of features) that we'd like to use to predict a trait (e.g. blood cholesterol levels) or disease (e.g. diabetes). Due to privacy, large cohorts that collect the data don't release either X or y, but they release "summary statistics" in the form of $X^\top y$.

So, we have to resort to approximations of $X$. In previous work, people relied on the form:

$$ \frac{1}{n}||y - X\beta||_2^2 = \frac{1}{n}\Big[y^\top y -2\beta^\top X^\top y + \beta^\top X^\top X\beta\Big] $$

We usually assume both $y$ and $X$ are standardized. The first term is a constant. The second term can be defined based on the summary statistics. The third term can be approximated using small, publicly available data (under the assumption that covariance $X^\top X$ between features is roughly the same across cohorts).

Given this, two options are possible for the implementation:

Provide external $X$ that has roughly similar characteristics to censored data (what @QB3 suggested).
Provide external $X^\top X$ (which is a sparse, block-diagonal matrix). This is what practitioners in the field usually do. One main benefit of this second approach is that some of the larger cohorts that release $X^\top y$ can also release $X^\top X$ estimated from their data, because it doesn't violate privacy agreements.

Happy to help with the implementation or testing.

mathurinm · 2024-05-27T17:20:42Z

For the implementation passing X and Xty see a working draft implementation at WIP ENH add censored quadratic df #250
For the one with XtX I guess we need to do something else, like directly modifying our GramCD solver to support being passed XtX and Xty as arguments. I'll give this a try too

mathurinm linked a pull request May 27, 2024 that will close this issue

WIP ENH add censored quadratic df #250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT - Add quadratic datafit with no access to the target #249

FEAT - Add quadratic datafit with no access to the target #249

QB3 commented May 27, 2024

mathurinm commented May 27, 2024

Badr-MOUFAD commented May 27, 2024

mathurinm commented May 27, 2024

shz9 commented May 27, 2024 •

edited

Loading

mathurinm commented May 27, 2024 •

edited by Badr-MOUFAD

Loading

FEAT - Add quadratic datafit with no access to the target #249

FEAT - Add quadratic datafit with no access to the target #249

Comments

QB3 commented May 27, 2024

Description of the feature

mathurinm commented May 27, 2024

Badr-MOUFAD commented May 27, 2024

mathurinm commented May 27, 2024

shz9 commented May 27, 2024 • edited Loading

mathurinm commented May 27, 2024 • edited by Badr-MOUFAD Loading

shz9 commented May 27, 2024 •

edited

Loading

mathurinm commented May 27, 2024 •

edited by Badr-MOUFAD

Loading