Skip to content

Commit

Permalink
Merge branch 'dev' of https://github.com/Quantmetry/qolmat into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
Julien Roussel authored and Julien Roussel committed Apr 15, 2024
2 parents 05417ca + 59c25cd commit 0b579e3
Show file tree
Hide file tree
Showing 12 changed files with 462 additions and 328 deletions.
1 change: 1 addition & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ History
* Tutorial plot_tuto_categorical showcasing mixed type imputation
* Titanic dataset added
* accuracy metric implemented
* metrics.py rationalized, and split with algebra.py

0.1.3 (2024-03-07)
------------------
Expand Down
30 changes: 17 additions & 13 deletions docs/imputers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,28 @@ Imputers

All imputers can be found in the ``qolmat.imputations`` folder.

1. Simple (mean/median/shuffle)
-------------------------------
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
1. Simple (mean/median/mode)
----------------------------
Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See :class:`~qolmat.imputations.imputers.ImputerSimple`.

2. LOCF
2. Shuffle
----------
Imputes the missing values using a random value sampled in the same column. See :class:`~qolmat.imputations.imputers.ImputerShuffle`.

3. LOCF
-------
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
Imputes the missing values using the last observation carried forward. See :class:`~qolmat.imputations.imputers.ImputerLOCF`.

3. Time interpolation and TSA decomposition
4. Time interpolation and TSA decomposition
-------------------------------------------
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See :class:`~qolmat.imputations.imputers.ImputerResiduals`.


4. MICE
5. MICE
-------
Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses `IterativeImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_. See the :class:`~qolmat.imputations.imputers.ImputerMICE` class.

5. RPCA
6. RPCA
-------
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.

Expand All @@ -46,7 +50,7 @@ The class :class:`RpcaNoisy` implements an recommanded improved version, which r
with :math:`\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}`.
See the :class:`~qolmat.imputations.imputers.ImputerRpcaNoisy` class for implementation details.

6. SoftImpute
7. SoftImpute
-------------
SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` with observed entries indexed by the set :math:`\Omega`, this algorithm solves the following problem:

Expand All @@ -56,11 +60,11 @@ SoftImpute is an iterative method for matrix completion that uses nuclear-norm r
The imputed values are then given by the matrix :math:`M=LQ` on the unobserved data.
See the :class:`~qolmat.imputations.imputers.ImputerSoftImpute` class for implementation details.

7. KNN
8. KNN
------
K-nearest neighbors, based on `KNNImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html>`_. See the :class:`~qolmat.imputations.imputers.ImputerKNN` class.

8. EM sampler
9. EM sampler
-------------
Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the :class:`~qolmat.imputations.imputers.ImputerEM` class.
Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized by some parameter :math:`\theta`. The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.
Expand Down Expand Up @@ -104,7 +108,7 @@ Two parametric distributions are implemented:
* :class:`~qolmat.imputations.em_sampler.VARpEM`: [7]: :math:`\mathbf{X} \in \mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)` is generated by a VAR(p) process such that :math:`X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t` where :math:`\nu \in \mathbb{R}^d` is a vector of intercept terms, the :math:`B_i \in \mathbb{R}^{d \times d}` are the lags coefficient matrices and :math:`u_t` is white noise nonsingular covariance matrix :math:`\Sigma_u \mathbb{R}^{d \times d}`, so that :math:`\theta = (\nu, B_1, ..., B_p, \Sigma_u)`.


9. TabDDPM
10. TabDDPM
-----------

:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
Expand Down
25 changes: 5 additions & 20 deletions examples/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,6 @@ jupyter:
**This notebook aims to present the Qolmat repo through an example of a multivariate time series.
In Qolmat, a few data imputation methods are implemented as well as a way to evaluate their performance.**

```python

```

First, import some useful librairies

Expand All @@ -36,26 +33,18 @@ from IPython.display import Image
import pandas as pd
from datetime import datetime
import numpy as np
import scipy
import hyperopt as ho
from hyperopt.pyll.base import Apply as hoApply
np.random.seed(1234)
import pprint
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
import matplotlib.ticker as plticker

tab10 = plt.get_cmap("tab10")
plt.rcParams.update({'font.size': 18})

from typing import Optional

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, HistGradientBoostingRegressor


import sys
from qolmat.benchmark import comparator, missing_patterns, hyperparameters
from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers
from qolmat.utils import data, utils, plot

Expand Down Expand Up @@ -240,12 +229,8 @@ dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.i
```

```python tags=[]
dfs_imputed["VAR_max"].groupby("station").min()
```

```python tags=[]
# station = df_plot.index.get_level_values("station")[0]
station = "Huairou"
station = df_plot.index.get_level_values("station")[0]
# station = "Huairou"
df_station = df_plot.loc[station]
dfs_imputed_station = {name: df_plot.loc[station] for name, df_plot in dfs_imputed.items()}
```
Expand Down Expand Up @@ -362,7 +347,7 @@ comparison = comparator.Comparator(
)
```

```python jupyter={"outputs_hidden": true} tags=[]
```python tags=[]
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=3, groups=('station',), subset=cols_to_impute, ratio_masked=ratio_masked)

comparison = comparator.Comparator(
Expand Down Expand Up @@ -393,7 +378,7 @@ plt.show()
df_plot = df_data[cols_to_impute]
```

```python jupyter={"outputs_hidden": true} tags=[]
```python tags=[]
dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.items()}
```

Expand Down
2 changes: 1 addition & 1 deletion examples/tutorials/plot_tuto_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
# - manage categorical features though one hot encoding
# - manage missing features (native to the HistGradientBoosting)

pipestimator = preprocessing.make_robust_MixteHGB(allow_new=False)
pipestimator = preprocessing.make_robust_MixteHGB(avoid_new=True)
imputer_hgb = ImputerRegressor(estimator=pipestimator, handler_nan="none")
imputer_wrap_hgb = preprocessing.WrapperTransformer(imputer_hgb, bt)

Expand Down
Loading

0 comments on commit 0b579e3

Please sign in to comment.