Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
adibender committed Jul 30, 2024
1 parent 2248568 commit 6a9f29d
Showing 1 changed file with 163 additions and 0 deletions.
163 changes: 163 additions & 0 deletions mlr-org/gallery/survival/2024-07-30-discrete-time/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
---
title: "Survival modeling in mlr3 using Bayesian Additive Regression Trees (BART)"
description: |
Demonstrate use of survival BART on the lung dataset via mlr3proba and distr6.
author:
- name: Andreas Bender, Philip Studener, John Zobolas
orcid: #TODO add ORC-IDs0000-0002-3609-8674
url: #TODO: add proba link
date: 2024-07-30
bibliography: ../../bibliography.bib # TODO: Add references to discrete time here
---

{{< include ../../_setup.qmd >}}

## Intro

In this tutorial we illustrate how to perform *discrete time-to-event analysis* [@Tutz...] using packages `mlr3` and `mlr3proba`.
Discrete time-to-event analysis can be viewed as a reduction technique, where various survival tasks can be "reduced" to a more standard classification task.
To do so, the data needs to be transformed into a specific format. Details will be provided later, but briefly, we partition the follow-up into a fixed amount of intervals and create a new binary outcome variable that indicates a subjects' status within each interval. We can use this new outcome to estimate probabilities for an event within an interval, conditional on time and features. These conditional probabilities can then be combined to calculate survival probabilities.

The advantage of this approach is that once the data are transformed, any classifier available in the `mlr3` eccosystem can be used for survival analysis.
The disadvantage is that the data transformation can be cumbersome to perform manually and that it takes quite some boilerplate code to do it consistently across resampling iterations and to combine the conditional probabilities (discrete hazards) to survival probabilities.
However, as we illustrate below, using the discrete time-to-event pipeline implemented in `mlr3proba` using `mlr3pipelines`, all of these details are abstracted away so that practitioners can focus on modeling their data.

## Libraries

```{r}
#| output: false
library(mlr3extralearners)
library(mlr3pipelines)
# library(mlr3proba)
devtools::load_all("~/Dropbox/GitHub/mlr/mlr3proba")
library(survival)
library(pammtools)
library(distr6)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
```

## Data

We will use the `tumor` data set from package `pammtools` [@Bender2019].
The data set contains survival times (in days) of patients with stomach area tumor that was removed during operation. Some features of interst are

- `resection`: Was the tumor removal performed with an additional area around the tumor removed (yes/no)
- `complications`: Did complications occur during operation (yes/no)
- `metastases`: Did the tumor metastasis already (yes/no)
- `age`: The patients age
- `charlson_score`: Comorbidity score


Below we can see the first rows of the data set and check for missing values:

```{r}
# Load data
data("tumor", package = "pammtools")
head(tumor)
tsk_tumor$missings()
```

Next we look at the marginal survival probabilities using the Kaplan-Meier estimate:

```{r}
mean(tumor$status)# ~ca. 50% of subjects are censored during the folow up
tsk_tumor = TaskSurv$new(id = "tumor", backend = tumor, event = "status", time = "days")
# create survival task
autoplot(tsk_tumor) + ylim(c(0, 1))
```
The median survival time is approximately 1500 days. The Probability of surviving beyond 3000 days is about 25%.


## Discrete time-to-event analysis

In this section we first show how discrete time-to-event analysis works "by foot" for illustration.
Later we show how to use `mlr3proba` to achieve the same.

The first step is to transform the data (we use the `as_ped` function from package `pammtools`).
To do so, we first have to decide how to partition the follow-up into intervals.
The general trade-off is
- high number of intervals $\rightarrow$ higher precision but also higher variability and computational burden
- low number of intervals $\rightarrow$ low variability and computational burden but lower precision

For now, we simply divide the follow-up into 100 intervals of equal length rounded to whole days (so each interval covers approximately 38 days).
The output below shows the original and first and last row of the transformed data.
Note how each subject (id) has a different number of rows in the transformed data.
This is because we only create intervals for a subject in which they were still at risk for an event (i.e. not censored or deceased in a previous interval).
Subject 4 has only one entry, as the survival time is 33 days, thus it experiences an event in the first interval.
The new outcome variable is called `ped_status` and indicates for each interval for which a subject was at risk whether the subject survived the interval (0) or experienced an event in the interval (1).
Per construction, a subject has status (0) for all intervals, except the last one, which can be either 0 or 1.


```{r}
cut <- seq(0, max(tumor$days), length.out = 100) |> round()
disc_data <- pammtools::as_ped(data = tumor, Surv(days, status)~., cut = cut) |>
select(-offset)
unique(disc_data$interval) # The intervals in the transformed data
tumor |> slice(c(1, 3, 4)) # original data
disc_data |> filter(id %in% c(1, 3, 4)) |> group_by(id) |> summarize(number_intervals = n())
disc_data |> filter(id %in% c(1, 3, 4)) |> group_by(id) |> slice(1, n())
```

Once we have the transformed data, we can fit a classifier (that returns class probabilities) to the new outcome variable to obtain predictions of the discrete time hazard, i.e. the probability $P(y =1|T = t)$ (here $t$ is a short cut for interval 1, 2, 3, ...).

For illustration, we first use a logistic regression without covariates.
Note however, that since the new 0/1 variable is the new outcome, the variables that respresent time (e.g. `tend` or `interval`) can (and should) be viewed as features. We use them to estimate different event probabilities in different intervals (i.e. probabilities conditional on the feature time).


```{r}
m_0 <- glm(ped_status ~ interval, data = disc_data, family = binomial())
coef(m_0)
```
The coefficients that we obtain from this model can be interpreted as discrete baseline hazards.
Because of the reference coding, the intercept is the logit baseline hazard in the first interval, the other coefficients are the differences in baseline hazard compared to the first interval.

We can use those values to calculate the survival probability in for each time point (for comparison we also add the previous Kaplan-Meier estimate to the graph):

```{r}
km <- survival::survfit(Surv(days, status)~1, data = tumor) |> broom::tidy()
ndf <- disc_data |> pammtools::make_newdata(tend = unique(tend))
ndf$disc_haz <- predict(m_0, newdata = ndf, type = "response")
ndf$surv_prob <- cumprod(1-ndf$disc_haz)
ggplot(ndf, aes(x = tend, y = surv_prob)) +
pammtools::geom_stephazard(aes(col = "Discrete-Time disc")) +
pammtools::geom_surv(aes(col = "Discrete-Time")) +
geom_step(data = km, aes(x = time, y = estimate, col = "Kaplan-Meier"), lty = 2) +
ylim(c(0, 1)) +
scale_color_discrete(name = "method") +
labs(y = "Survival Probability")
```

As you can see, there is barely any difference between the two estimates.
This illustrates, that essentially any survival time distribution can be approximated using this discrete time approach. What we have shown here annecdotaly can also be shown mathematically (TODO citation effron).


## Modeling the baseline hazard or "what is a featureless discrete time model?"



## Covariate effect vs. Stratified baseline hazard

Similar to a Cox model, in a standard discrete-time model (using logistic regression or similar), we have to decide if a covariate simply shifts the baseline hazard or if it alters the shape of a baseline hazard. In the survival analysis context this is also known as proportional hazards assumption or time-varying effects.

For illustration, we consider the `complications` variable in the `tumor` data set.
For comparison, we first fit a standard Cox PH model, which indicates that complications has an substantial effect on survival (hazard twice as large as for subjects without complications).

```{r}
table(tumor$complications)
m_cox <- survival::coxph(Surv(days, status)~complications, data = tumor)
summary(m_cox)
```

However, looking at the Shoenfeld residuals for this model, we see that the PH assumption is not fullfilled:

```{r}
zph <- cox.zph(m_cox)
```



## References

0 comments on commit 6a9f29d

Please sign in to comment.