Notes:
And I am not a core-contributor to the Lifelines project, but a user
I hope you and your families are well during these times! Moreover, please be safe as I encourage y’all to Social Distance, Wear Masks, Wash Hands, and of course be well. #Masks4All
Survival Analysis is used to estimate the time of event of interest for some sample or population. In it’s origination through medical research, one would like to understand the time of death, as random variable time T goes on. Since it's origination, the analysis has been used in other applications like customer churn, error logging, or mechanical failure.
As a summary: “Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail?” - Source
Say we have database of Insta2U customers and their subscriptions,a fake a delivery service company for 2U albums. This database has start & ending of subscription dates, and the customer's associated features/signals.
churn_dataSample.head()
gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | Churn | Churn - Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customerID | ||||||||||||||||||||
7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | No | 0 |
5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | No | 0 |
3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | Yes | 1 |
7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | No | 0 |
9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | Yes | 1 |
Because of the start and end date of these customer, we can observe these customer's life expectancies (or their churn) over time t, seen below:
Moreover, there are nice properties of this implementation we can utilize to ensure we have an accurate depiction of observing the event of interest, in our case churn, over time.
Look at Table 5. First it's amazing to see a paper w/ >1,100 vented #Covid19 pts. But of these 1151 vented pts, *at a median follow-up time of 4.5d* only 38 went home, 282 died, while 831 were still hospitalized. 282/(38+282) = 88% mortality. But thats *quite* misleading👇(4/x) pic.twitter.com/za3S6qEBFj
— John W Scott, MD MPH (@DrJohnScott) April 23, 2020
If we look at the following image, we observe one case where people have not experiences some occurence, and continue to persist. If we take the mean of these data points, we are underestimate the average because those who continue to stay alive after time t, skew our average.
And if we only consider one segmentation of individuals, say those who had their event of occurence happen, we too still underestimate our average of the values.
So, it solves the problem of right-censoring, mentioned later. Even with no censoring, this analysis is still great for understanding time and events of interest.
from lifelines.plotting import plot_lifetimes
ax = plot_lifetimes(churn_dataSample['tenure'],event_observed = churn_dataSample['Churn - Yes'])
ax.vlines(50, 0, 50, linestyles='--')
ax.set_xlabel("time")
ax.set_title("Customer Tenure, at t=50")
Text(0.5, 1.0, 'Customer Tenure, at t=50')
The survival function $S(t)$ estimates the probability of surviving past some time $t$, for the observations going to $t \rightarrow \infty$.
As a definition:
For $T$ be a non-negative random lifetime taken from the population, the survival function $S(t)$ is defined as
$$S(t) = \text{Pr}(T>t) = 1-F(t)$$where $T$ is the response variable, where $T \geq 0$
Including it's properties
Survival regression allows us to regress other feature against another variable--this case durations. This regression is particularly different in that:
from lifelines import KaplanMeierFitter
Benefits:
lifelines.statistics.logrank_test()
Kaplan-Meier Estimation allows us to create an algorithm to calculate $S(t)$.
We calculate the Survival Function through the following formula:
$\hat{S(t)} = \prod_{t_i < t} \tfrac{n_i-d_i}{n_i}$
where $d_i$ are the number of death events at time $t$ and $n_i$ is the number of subjects at risk of death just prior to time $t$.
Let's look at our Customer Churn scenario, and apply it as an example using this Customer dataset from the Lifelines example workbooks.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(durations = churn_data['tenure'], event_observed = churn_data['Churn - Yes'])
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 7043 total observations, 5174 right-censored observations>
kmf.plot(ci_show=True)
plt.title('Kaplan Meier Survival Curve')
plt.ylabel('P(Being a customer)')
plt.xlabel('Tenure')
plt.tight_layout()
plt.savefig(firstDemoImage)
plt.show()
print("We output the Event occurence table over time 'event_at'")
kmf.event_table.head()
We output the Event occurence table over time 'event_at'
removed | observed | censored | entrance | at_risk | |
---|---|---|---|---|---|
event_at | |||||
0 | 11 | 0 | 11 | 7043 | 7043 |
1 | 613 | 380 | 233 | 0 | 7032 |
2 | 238 | 123 | 115 | 0 | 6419 |
3 | 200 | 94 | 106 | 0 | 6181 |
4 | 176 | 83 | 93 | 0 | 5981 |
print("As another part of the functionality with Lifelines are" +
"conveniently printing out statistical information as Dataframes")
kmf.confidence_interval_.head()
As another part of the functionality with Lifelines areconveniently printing out statistical information as Dataframes
KM_estimate_lower_0.95 | KM_estimate_upper_0.95 | |
---|---|---|
0.0 | 1.000000 | 1.000000 |
1.0 | 0.940418 | 0.951002 |
2.0 | 0.921506 | 0.933672 |
3.0 | 0.906857 | 0.920108 |
4.0 | 0.893733 | 0.907879 |
An example of this implementation will be covered in a future talk.
However, there are example of implementing Survival Regression in the Lifelines documentation.
There is an R implementation called Survival