Notes:
And I am not a core-contributor to the Lifelines project, but a user
I hope you and your families are well during these times! Moreover, please be safe as I encourage y’all to Social Distance, Wear Masks, and contribute to social movements during these next critical months.
Survival Analysis is an analytics process to estimate the time of event of interest for some group, sample, or population.
In it’s origination through medical research, one would like to understand the time of an "event of interest" will occur, in consideration of time t. Since it's origination, the analysis has been used in other applications like customer churn, error logging, or mechanical failure.
As a summary, “Survival analysis attempts to answer questions such as:
What is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail?” - Source
churn_dataSample.head(3)
gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | Churn | Churn - Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customerID | ||||||||||||||||||||
7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | No | 0 |
5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | No | 0 |
3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | Yes | 1 |
print(f"Average Tenure for customers, all data points considered: {churn_dataSample['tenure'].mean()}")
Average Tenure for customers, all data points considered: 29.9
However, what about cases where individuals may:
Wouldn't this improperly represent our estimated customer lifespans?
In the coding section later on, we will see that we can visualize an accurate depication of these customer's life expectancies (or churn) over time t, seen below:
Look at Table 5. First it's amazing to see a paper w/ >1,100 vented #Covid19 pts. But of these 1151 vented pts, *at a median follow-up time of 4.5d* only 38 went home, 282 died, while 831 were still hospitalized. 282/(38+282) = 88% mortality. But thats *quite* misleading👇(4/x) pic.twitter.com/za3S6qEBFj
— John W Scott, MD MPH (@DrJohnScott) April 23, 2020
Side note: If you want to follow some reliable epidemiologists I recommend the following Twitter handles:
@DrTomFrieden @EpiEllie @DrEricDing @DWUhlfelderLaw @ASlavitt @CT_Bergstrom
This case of misinterpreting #COVID19 values without the consideration of time or condition can miscontrue our results.
Hence why this talk isn't more focused on using healthcare or epidemiology data
A form of missing data problem in which time to event is not observed for reasons:
Censoring allows for considering time-based data problems with inclusion of event of interests, and moreover other factors.
The following is a visualization and expansion of what types of Censoring exist:
from lifelines.plotting import plot_lifetimes
ax = plot_lifetimes(churn_dataSample['tenure'],
event_observed = churn_dataSample['Churn - Yes']
)
ax.vlines(50, 0, 50, linestyles='--')
ax.set_xlabel("Time")
ax.set_ylabel("Person")
ax.set_title("Customer Tenure, at t=50")
Text(0.5, 1.0, 'Customer Tenure, at t=50')
There are two types of censoring.
Right Censoring:
Se the first top 5 lines for those data points who do go beyond some times t, but are right censored in the calculations
Left Censoring:
Note: Useful in cumulative density calculations.
If we calculate a statistic, like average, for either:
Then we would tend to underestimate our results.
So, the consideration of censoring both an effective and crucial part of Survival Analysis
The survival function $S(t)$ estimates the probability of surviving past some time $t$, for the observations going to $t \rightarrow \infty$.
As a definition:
For $T$ be a non-negative random lifetime taken from the population, the survival function $S(t)$ is defined as
$$S(t) = \text{Pr}(T>t) = 1-F(t)$$where $T$ is the response variable, where $T \geq 0$
Survival regression allows us to regress other feature against another variable--this case durations. This type of regression is different from other common regressions in that:
Benefits:
from lifelines import KaplanMeierFitter
## With some neat API usage...coming up
lifelines.statistics.logrank_test()
Kaplan-Meier Estimation allows us to analyze data with event of interest for some time t.
We calculate the Survival Function $S(t)$ by the following formula:
$\hat{S(t)} = \prod_{t_i < t} \tfrac{n_i-d_i}{n_i}$
where $d_i$ are the number of death events at time $t$ and $n_i$ is the number of subjects at risk of death just prior to time $t$.
Let's look at our Customer Churn data, from the Lifelines example library, with the Kaplan-Meier implementation to estimate Survival Curve
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(durations = churn_data['tenure'], event_observed = churn_data['Churn - Yes'])
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 7043 total observations, 5174 right-censored observations>
kmf.plot(ci_show=True)
# kmf.plot(ci_show=True, at_risk_counts=True)
plt.title('Kaplan Meier Survival Curve')
plt.ylabel('Survival Rate')
plt.xlabel('Tenure')
plt.tight_layout()
plt.savefig(firstDemoImage)
plt.show()
print("We output the Event occurence table over time 'event_at'")
kmf.event_table.head()
We output the Event occurence table over time 'event_at'
removed | observed | censored | entrance | at_risk | |
---|---|---|---|---|---|
event_at | |||||
0 | 11 | 0 | 11 | 7043 | 7043 |
1 | 613 | 380 | 233 | 0 | 7032 |
2 | 238 | 123 | 115 | 0 | 6419 |
3 | 200 | 94 | 106 | 0 | 6181 |
4 | 176 | 83 | 93 | 0 | 5981 |
Remember that perceivably convoluted Kaplan Mier equation?
$\hat{S(t)} = \prod_{t_i < t} \tfrac{n_i-d_i}{n_i}$
Let's take a small sample of the first 3 rows to see how (readably) calculated:
event_atZero = (7043 - 0) / 7043
event_atOne = event_atZero * ((7032 - 380)/ 7032)
event_atTwo = event_atOne * ((6419 - 123)/ 6419)
demoEventsSurvival = [event_atZero, event_atOne, event_atTwo]
totalExamples = len(demoEventsSurvival)
for i in range(totalExamples):
survivalRate_rounded = round(demoEventsSurvival[i], 5)
print(f"Calculated S(t) = {survivalRate_rounded}, at timeline event_at = {i}")
Calculated S(t) = 1.0, at timeline event_at = 0 Calculated S(t) = 0.94596, at timeline event_at = 1 Calculated S(t) = 0.92783, at timeline event_at = 2
Just to confirm our calculations...
print('The following are our S(t) estimation calculations from our Kaplan Meir object:\n')
kmf.survival_function_.head()
The following are our S(t) estimation calculations from our Kaplan Meir object:
KM_estimate | |
---|---|
timeline | |
0.0 | 1.000000 |
1.0 | 0.945961 |
2.0 | 0.927835 |
3.0 | 0.913725 |
4.0 | 0.901045 |
print("As another part of the functionality with Lifelines are" +
"conveniently printing out statistical information as Dataframes\n")
kmf.confidence_interval_.head()
As another part of the functionality with Lifelines areconveniently printing out statistical information as Dataframes
KM_estimate_lower_0.95 | KM_estimate_upper_0.95 | |
---|---|---|
0.0 | 1.000000 | 1.000000 |
1.0 | 0.940418 | 0.951002 |
2.0 | 0.921506 | 0.933672 |
3.0 | 0.906857 | 0.920108 |
4.0 | 0.893733 | 0.907879 |
(Don't worry, we'll come to the end of this session soon...)
We notice that the Kaplan-Meir implementation focuses on categorical factors to produce the resulting Survival Rate Curve over time. Moreover, we were not able to do the following:
from lifelines import CoxPHFitter
cox_saRegression = CoxPHFitter()
selectedFeatures = ['tenure','SeniorCitizen', 'MonthlyCharges', 'Churn - Yes']
cox_saRegression.fit(churn_data[selectedFeatures],
duration_col='tenure', event_col='Churn - Yes')
cox_saRegression.print_summary()
model | lifelines.CoxPHFitter |
---|---|
duration col | 'tenure' |
event col | 'Churn - Yes' |
number of observations | 7043 |
number of events observed | 1869 |
partial log-likelihood | -15586.83 |
time fit was run | 2020-08-16 15:46:31 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
SeniorCitizen | 0.47 | 1.60 | 0.05 | 0.36 | 0.58 | 1.44 | 1.78 | 8.61 | <0.005 | 56.89 |
MonthlyCharges | 0.00 | 1.00 | 0.00 | 0.00 | 0.01 | 1.00 | 1.01 | 5.92 | <0.005 | 28.19 |
Concordance | 0.55 |
---|---|
Log-likelihood ratio test | 132.41 on 2 df, -log2(p)=95.52 |
Note: If the Hazard Ratio is <1, then there is a lower hazard rate between some two groups. Else, it's higher.
cox_saRegression.plot()
plt.title('Confidence Interval for Cox Model')
plt.ylabel('Factors/Covariates')
plt.show()
cox_saRegression.summary
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
SeniorCitizen | 0.469313 | 1.598895 | 0.054524 | 0.362448 | 0.576178 | 1.436843 | 1.779225 | 8.607475 | 7.468755e-18 | 56.893838 |
MonthlyCharges | 0.004781 | 1.004792 | 0.000808 | 0.003197 | 0.006364 | 1.003202 | 1.006385 | 5.917166 | 3.275355e-09 | 28.185702 |
To visbally understand the impact of a factor, we plot survival curves for varying covariate values, while holding everything else equal. This is useful to understand the impact of a factor, given the model.
covariateElement = 'SeniorCitizen'
cox_saRegression.plot_covariate_groups(covariates='SeniorCitizen', values=[0, 1], cmap='coolwarm')
plt.title(f'Covariate Example: Cox Model by Covariate {covariateElement}')
plt.xlabel('Time')
plt.ylabel('Survival Rate')
plt.show()
This is an introduction to Survival Regression, but what about cases when:
In that case, instead of using the Cox Proportion Model, we could use the Aaelen's model or other parametric or customized models from Lifelines!
You can more about this in the Lifelines documentation.