Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT - Add quadratic datafit with no access to the target #249

Open
QB3 opened this issue May 27, 2024 · 5 comments · May be fixed by #250
Open

FEAT - Add quadratic datafit with no access to the target #249

QB3 opened this issue May 27, 2024 · 5 comments · May be fixed by #250

Comments

@QB3
Copy link
Collaborator

QB3 commented May 27, 2024

Description of the feature

Exact feature

Solve the following optimization problem

$$\arg \min_{\beta} \frac{1}{2n} || X \beta ||^2 + \frac{1}{n} \beta^\top X^\top y + \text{penalty} \enspace,$$

with no access to $y$, but with access to $X^\top y$.

Additional context

Context,
I have been discussing with @shz9 to implement a specific datafit for genomic applications (@shz9 is finishing his PhD on statistical analysis of genomics data). From what I understood, genomics data are sensitive: one does not have access to the target $y$: one only has access to the design matrix $X$, and an estimation of $X^\top y$ (usually estimated from another dataset).

Steps

I guess we have to add the datafit $$\frac{1}{2n} || X \beta ||^2 + \frac{1}{n} \beta^\top X^\top y \enspace ,$$ and handle the fact there is no $y$ provided.

@mathurinm
Copy link
Collaborator

Thanks @QB3 that sounds doable

How would we pass the current design ? y does not have a default value in Solver.solve, so it's required. Making it a keyword arg may break some stuff.

Thoughts @Badr-MOUFAD @QB3 ?

@Badr-MOUFAD
Copy link
Collaborator

It is true that the form of the problem is a bit different but it fits our framework as we can write the datafit as $F(y, X\beta)$

As you mentioned @mathurin, while it breaks our conventions, I don’t think it would break the code.
We can create a new datafit and pass in (X, XTy) to Solver.solve

Thinking about it, adding support of this case, we will have in total three cases that don't abide by the conventions: Cox, and LinearSVC.
Perhaps, we should start thinking about a more flexible API that encloses all these cases.

@mathurinm mathurinm linked a pull request May 27, 2024 that will close this issue
@mathurinm
Copy link
Collaborator

There's a working (withoutintercept) implementation with minimal overhead (we inherit most of the stuff from Quadratic) in #250

I need to get the intercept update step right, but it was not painful at all to implement !

@shz9
Copy link

shz9 commented May 27, 2024

Thanks @QB3, this would be great!

Just to add a bit more context: In statistical genetics, we have extremely high dimensional data (10s of millions of features) that we'd like to use to predict a trait (e.g. blood cholesterol levels) or disease (e.g. diabetes). Due to privacy, large cohorts that collect the data don't release either X or y, but they release "summary statistics" in the form of $X^\top y$.

So, we have to resort to approximations of $X$. In previous work, people relied on the form:

$$ \frac{1}{n}||y - X\beta||_2^2 = \frac{1}{n}\Big[y^\top y -2\beta^\top X^\top y + \beta^\top X^\top X\beta\Big] $$

We usually assume both $y$ and $X$ are standardized. The first term is a constant. The second term can be defined based on the summary statistics. The third term can be approximated using small, publicly available data (under the assumption that covariance $X^\top X$ between features is roughly the same across cohorts).

Given this, two options are possible for the implementation:

  1. Provide external $X$ that has roughly similar characteristics to censored data (what @QB3 suggested).
  2. Provide external $X^\top X$ (which is a sparse, block-diagonal matrix). This is what practitioners in the field usually do. One main benefit of this second approach is that some of the larger cohorts that release $X^\top y$ can also release $X^\top X$ estimated from their data, because it doesn't violate privacy agreements.

Happy to help with the implementation or testing.

@mathurinm
Copy link
Collaborator

mathurinm commented May 27, 2024

  • For the implementation passing X and Xty see a working draft implementation at WIP ENH add censored quadratic df #250
  • For the one with XtX I guess we need to do something else, like directly modifying our GramCD solver to support being passed XtX and Xty as arguments. I'll give this a try too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants