Back to the website: https://RaulingAverage.dev/

Survival Analysis

with Lifelines

A little bit about me

  • Data Analyst @ Autodesk
  • You can find me at RaulingAverage.dev
  • Enjoy Coffee, Learning, and Running..near the beach

Notes:

  • This presentation does not reflect any workings or material from Autodesk
  • And I am not a core-contributor to the Lifelines project, but a user

  • I hope you and your families are well during these times! Moreover, please be safe as I encourage y’all to Social Distance, Wear Masks, Wash Hands, and of course be well. #Masks4All

What will we be talking about today?

Survival Analysis

What is Survival Analysis?

Survival Analysis is used to estimate the time of event of interest for some sample or population. In it’s origination through medical research, one would like to understand the time of death, as random variable time T goes on. Since it's origination, the analysis has been used in other applications like customer churn, error logging, or mechanical failure.

As a summary: “Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail?” - Source

Why should I consider Survival Analysis?
Let's look at the customer churn scenario.

Say we have database of Insta2U customers and their subscriptions,a fake a delivery service company for 2U albums. This database has start & ending of subscription dates, and the customer's associated features/signals.

In [4]:
churn_dataSample.head()
Out[4]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges Churn Churn - Yes
customerID
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 No 0
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 No 0
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 Yes 1
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 No 0
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 Yes 1

Because of the start and end date of these customer, we can observe these customer's life expectancies (or their churn) over time t, seen below:

Moreover, there are nice properties of this implementation we can utilize to ensure we have an accurate depiction of observing the event of interest, in our case churn, over time.

Survival Curve Example

Another recent example...

What are those special properties?

Censoring

  • If we look at the following image, we observe one case where people have not experiences some occurence, and continue to persist. If we take the mean of these data points, we are underestimate the average because those who continue to stay alive after time t, skew our average.

  • And if we only consider one segmentation of individuals, say those who had their event of occurence happen, we too still underestimate our average of the values.

So, it solves the problem of right-censoring, mentioned later. Even with no censoring, this analysis is still great for understanding time and events of interest.

In [5]:
from lifelines.plotting import plot_lifetimes

ax = plot_lifetimes(churn_dataSample['tenure'],event_observed = churn_dataSample['Churn - Yes'])
ax.vlines(50, 0, 50, linestyles='--')
ax.set_xlabel("time")
ax.set_title("Customer Tenure, at t=50")
Out[5]:
Text(0.5, 1.0, 'Customer Tenure, at t=50')

Probabilities over Time

Survival Function

The survival function $S(t)$ estimates the probability of surviving past some time $t$, for the observations going to $t \rightarrow \infty$.

As a definition:

For $T$ be a non-negative random lifetime taken from the population, the survival function $S(t)$ is defined as

$$S(t) = \text{Pr}(T>t) = 1-F(t)$$

where $T$ is the response variable, where $T \geq 0$

Including it's properties

  • As t ranges from $0$ to $\infty$, the survival function has the following properties;
    • $S(t)$ is Non-increasing
    • At t=0, S(t) = 1. In other words, the probability of surviving past time $t= 0$ is 1.
      • Moreover, At t=$\infty$, $S(t) = S(\infty)=0$. As time goes to infinity, survival curve goes to 0.
    • $0\leq S(t)\leq 1$
    • $F_T(t) = 1 - S(t)$, where $F_T(t)$ is a "Cumaltive Density function"

Survival Curve Example

Regression

Survival regression allows us to regress other feature against another variable--this case durations. This regression is particularly different in that:

  • It abides to characteristic of censoring, compared to traditional Linear Regression
  • Though it can operate like traditional linear regression, itt is used to explore the relationship between the 'survival' of person and characteristics
    • Predict the survivability through the model, as opposed to predicting estimations
  • All models attempt to represent the hazard rate $h(t|x_i)$ for some $i=1....n$
    • Cox’s proportional hazard model
    • Aalen’s additive model
    • Exponential

And more!

Okay, I'm convinced. How can we implement this process?

Survival Curve Example

Survival Curve Example

from lifelines import KaplanMeierFitter

Benefits:

  • SciKit-Learn friendly
  • Built on top of Pandas
  • Only focus is survival analysis
  • Handles right, left and interval censored data
  • Estimating Hazard Rates
  • Defining personal Survival Models
  • Compare two or more survival functions
    • lifelines.statistics.logrank_test()
  • and more!

Kaplan-Meier Estimate & S(t)

How is the S(t) calculated? Kaplan-Meier Estimation

Kaplan-Meier Estimation allows us to create an algorithm to calculate $S(t)$.

We calculate the Survival Function through the following formula:

$\hat{S(t)} = \prod_{t_i < t} \tfrac{n_i-d_i}{n_i}$

where $d_i$ are the number of death events at time $t$ and $n_i$ is the number of subjects at risk of death just prior to time $t$.

Source

Let's look at our Customer Churn scenario, and apply it as an example using this Customer dataset from the Lifelines example workbooks.

Churn

In [6]:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()

kmf.fit(durations = churn_data['tenure'], event_observed = churn_data['Churn - Yes'])
Out[6]:
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 7043 total observations, 5174 right-censored observations>
In [7]:
kmf.plot(ci_show=True)
plt.title('Kaplan Meier Survival Curve')
plt.ylabel('P(Being a customer)')
plt.xlabel('Tenure')
plt.tight_layout()
plt.savefig(firstDemoImage)
plt.show()
In [8]:
print("We output the Event occurence table over time 'event_at'")

kmf.event_table.head()
We output the Event occurence table over time 'event_at'
Out[8]:
removed observed censored entrance at_risk
event_at
0 11 0 11 7043 7043
1 613 380 233 0 7032
2 238 123 115 0 6419
3 200 94 106 0 6181
4 176 83 93 0 5981
In [9]:
print("As another part of the functionality with Lifelines are" +
      "conveniently printing out statistical information as Dataframes")
kmf.confidence_interval_.head()
As another part of the functionality with Lifelines areconveniently printing out statistical information as Dataframes
Out[9]:
KM_estimate_lower_0.95 KM_estimate_upper_0.95
0.0 1.000000 1.000000
1.0 0.940418 0.951002
2.0 0.921506 0.933672
3.0 0.906857 0.920108
4.0 0.893733 0.907879

What about Regression?

An example of this implementation will be covered in a future talk.

However, there are example of implementing Survival Regression in the Lifelines documentation.

With that being said

Resources