Introduce to Survival Analysis

Survival Function and Hazard Rate

Survival Function
Definition

Let $T \geq 0$ be a non-negative Random Variable representing survival time with PDF $f (t)$ and CDF $F (t)$ . The survival function of the random variable is defined as $S (t) = 1 - F (t) = P (T > t)$

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time.
Link to original

Hazard Function
Definition

Hazard Function

Let $T \geq 0$ be a non-negative Random Variable representing survival time with PDF $f (t)$ and CDF $F (t)$ . The hazard function $0 < λ (t) < \infty$ of the random variable $T$ is defined as $λ (t) = \frac{f ( t )}{S ( t )} = lim_{Δ t \to 0} \frac{P ( T < t + Δ t ∣ T > t )}{Δ t}$ where $S (t)$ is the Survival Function.

The hazard function refers to the rate of occurring event at a given time $t$ .

Cumulative Hazard Function

The hazard function can alternatively be represented in terms of the cumulative hazard function, defined as $Λ (t) = \int_{0}^{t} λ (u) d u$

Facts

The Survival Function $S (t)$ , the Cumulative Hazard Function, the density (PDF) $f (t)$ , the Hazard Function, and the distribution function (CDF) of survival time $F (t)$ are related through $S (t) = exp [- Λ (t)] = \frac{f ( t )}{λ ( t )} = 1 - F (t)$
Link to original

Types of Censoring

Right Censoring
Kinds

Type 1 Censoring
Definition

Type 1 Censoring

when $c_{1} = \dots = c_{n} = t_{c}$ ( $c_{i}$ ’s are constant)

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. We observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

In a type 1 censoring setting, the censored times $C_{i}$ are fixed constants, not random variables.

The PDF of the observation $Y_{i}$ is derived as $P (Y_{i} = C_{i}, δ_{i} = 0) P (Y_{i} = T_{i}, δ_{i} = 1) = P (T_{i} > C_{i}) = S (y_{i}) = f (y_{i})$

Likelihood of Type 1 Censoring Data

The Likelihood Function of type 1 censoring data is defined as $L (θ) = i = 1 \prod n f (y_{i})^{δ_{i}} S (y_{i})^{1 - δ_{i}} = i = 1 \prod n λ (y_{i})^{δ_{i}} S (y_{i}) = i = 1 \prod n λ (y_{i})^{δ_{i}} exp [- Λ (y_{i})]$
Link to original

Type 2 Censoring
Definition

Type 2 Censoring

when $r = 3$

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. We observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

In a type 2 censoring setting, we observe first $r$ out of $n$ experiment $T_{1}, \dots, T_{r}$ . In other words, for the order statistics $T_{(i)}$ of $T_{i}$ , we only observe $T_{(1)} \leq T_{(2)} \leq \dots \leq T_{(r)}$ . Where $C_{1} = C_{2} = \dots = C_{n}$ are not constants, but random variables.

Likelihood of Type 2 Censoring Data

The likelihood function of type 2 censored data can be computed using the same equation used for type 1 censored data but computing the joint-PDF of the Order Statistic $T_{(1)} \leq \dots \leq T_{(r)}$ is easier.

The joint PDF of $T_{(1)} \leq \dots \leq T_{(r)}$ is derived as $L (θ) = \frac{n !}{( n - r )!} [i = 1 \prod r f (t_{(i)})] S (t_{(r)})^{n - r}$
Link to original

Random Censoring
Definition

Random Censoring

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times, which can be both random variable or constant. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator.

In a random censoring setting, we only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ , and the censoring times are $C_{1}, \dots, C_{n}$ are i.i.d. Random Variable follows PDF $g_{γ}$ and CDF $G_{γ}$ .

The PDF of the observation $Y_{i}$ is derived as $P (Y_{i} = t, δ_{i} = 0) P (Y_{i} = t, δ_{i} = 1) = P (C_{i} = t, T_{i} > C_{i}) = g_{γ} (t) S_{θ} (t) = P (T_{i} = t, T_{i} \leq C_{i}) = f_{θ} (t) (1 - G_{γ} (t))$ where $θ$ is the parameter of interest, and $γ$ is the: nuisance parameter

Likelihood of Random Censoring Data

The Likelihood Function of random censored data is defined as $L (θ) = i = 1 \prod n [f_{θ} (t) (1 - G_{γ} (t))]^{δ_{i}} [S_{θ} (t) g_{γ} (t)]^{1 - δ_{i}} = c \cdot i = 1 \prod n f_{θ} (t)^{δ_{i}} S_{θ} (y_{i})^{1 - δ_{i}}$ where $c$ is a constant.
Link to original
Link to original

Left Censoring
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times, and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times, which can be both random variable or constant. And let $δ_{i} = {10 T_{i} \geq C_{i} : uncensored T_{i} < C_{i} : censored$ be the censoring indicator.

In a left censoring setting, we observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = max (T_{i}, C_{i})$ . In other words the event has already occurred before it becomes the subject of observed.
Link to original

Interval-Censored Data
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times. Interval censored data is given as an interval, not an exact point of time. We only observe the interval $(L_{i}, R_{i}]$ that includes $T_{i}$ Interval-censored data is divided into four cases.

Case 1 Interval-Censored Data (Current Status Data)

Data is given the form of $[0, C_{i}]$ or $(C_{i}, \infty)$ , where $C_{i}$ is a fixed time point.

Case 2 Interval-Censored Data

Data is given the form of $(L_{i}, R_{i}]$ where $0 < L_{i} < R_{i} < \infty$ .

Double Censored Data

Data is given the form of $(L_{i}, R_{i}]$ where $0 \leq L_{i} \leq R_{i} \leq \infty$ .

If $R_{i} = \infty$ , then it is right-censored data, if $L_{i} = 0$ , then it is left-censored data, and if $L_{i} = R_{i}$ , then it is uncensored data.

Panel Data

Observations are made at discrete time points. The period between these observations can be viewed as an interval.
Link to original

Mean Imputation Method
Definition

Mean imputation method substitute the given interval of case 2 interval-censored data $(L_{i}, R_{i}]$ with the mean of the interval $Y_{i} = \frac{( L _{i} + R _{i} )}{2}$ if the interval is finite. If $R_{i} = \infty$ , the data $(L_{i}, \infty)$ is substituted with the left value $Y_{i} := L_{i}$ and treated as right-censored data.
Link to original

Parametric Models

Distributions

Survival Models based on Distributions
Kinds

Exponential Distribution

Exponential Distribution assumes that the Hazard Function is a constant regardless of time. The distribution is memoryless $P (T > t + s ∣ T > s) = P (T > t), 0 < s < t < \infty$ .

The PDF of survival time follows Exponential Distribution $f (x) = λ e^{- λ t}$ The Hazard Function $λ (t) = λ$ The Survival Function $S (t) = e^{- λ t}$

Gamma Distribution

The PDF of survival time follows Gamma Distribution $f (t) = \frac{λ ^{α}}{Γ ( α )} t^{α - 1} e^{- λ t}$ The Hazard Function and Survival Function don’t have closed form expression.

Weibull Distribution

Weibull Distribution uses additional parameter $α$ than Exponential Distribution to control the shape of the Hazard Function. If $α 1$ then the hazard function is increasing, if $α = 1$ then it is constant, and if $0 < α < 1$ then it is decreasing.

The PDF of survival time follows Weibull Distribution $f (t) = α λ (λ t)^{α - 1} exp [- (λ t)^{α}]$ where $λ$ is a scale parameter and $α$ is a shape parameter.

The Hazard Function $λ (t) = α λ (λ t)^{α - 1}$ The Survival Function $S (t) = exp [- (λ t)^{α}]$

Rayleigh Distribution

The PDF of survival time follows Rayleigh distribution $f (x) = (λ_{0} + λ_{1} t) exp (- λ_{0} t - \frac{1}{2} λ_{1} t^{2})$ The Hazard Function $λ (t) = λ_{0} + λ_{1} t$ The Survival Function $S (t) = exp (- λ_{0} t - \frac{1}{2} λ_{1} t^{2})$

Log-Normal Distribution

We assume that the survival time follows a Log-Normal Distribution $T \sim Log-normal (μ, σ^{2})$ or $ln T \sim N (μ, σ^{2})$ . The Hazard Function of the distribution is hump-shaped.

The PDF of survival time follows Log-Normal Distribution $f (t) = \frac{1}{2 πσ} \frac{1}{t} exp (- \frac{( l n t - μ ) ^{2}}{2 σ ^{2}})$ The Survival Function $S (t) = 1 - Φ (\frac{l n t - μ}{σ})$ where $Φ$ is the CDF of standard normal distribution

The Hazard Function doesn’t have closed form expression.

Gompertz Distribution

The Hazard Function $ln λ (t) = λ_{0} + λ_{1} t \Leftrightarrow λ (t) = λ e^{γ t}$ where $λ := e^{λ_{0}}$ and $γ = λ_{1}$

Gompertz-Makeham Distribution

The Hazard Function $λ (t) = α + λ (t) = λ e^{γ t}$
Link to original

Survival Models based on Log-Lifetime
Definition

Assume that the log survival time $T$ can be modeled with Log-Linear Model $Y = ln T = σW + α$ where $σ$ is the Scale Parameter, $α$ is the Location Parameter, and $W$ is some well-known distribution.

Kinds

Standard Gumbel Distribution

If $W$ follows the standard Gumbel distribution (extreme value distribution), then $f (y) = \frac{1}{σ} exp [(\frac{y - α}{σ}) - exp (\frac{y - α}{σ})]$ $f (t) = \frac{1}{σ t} exp [(\frac{l n t - α}{σ}) - exp (\frac{l n t - α}{σ})]$ If $α = - ln λ$ and $σ = \frac{1}{ρ}$ , then $f (t) = ρ λ (λ t)^{ρ - 1} exp [- (λ t)^{ρ}]$ follows a Weibull Distribution.

Normal Distribution

If $W$ follows the standard normal distribution, then $T$ follows log-normal distribution $f (t) = \frac{1}{2 π} \frac{ρ}{t} exp [- \frac{ρ ^{2}}{2} (ln λ t)^{2}]$ where $α = - ln λ$ and $σ = \frac{1}{ρ}$

Logistic Distribution

If $W$ follows a Logistic Distribution, then $T$ follows Log-Logistic Distribution $f (y) = \frac{1}{σ} \frac{exp ( \frac{y - α}{σ} )}{[ 1 + exp ( \frac{y - α}{σ} ) ] ^{2}}$ $f (t) = λ ρ \frac{( λ t ) ^{ρ - 1}}{[ 1 + ( λ t ) ^{ρ} ] ^{2}}$ where $α = - ln λ$ and $σ = \frac{1}{ρ}$ .

Gumbel Distribution

If $W$ follows a Gumbel Distribution (generalized extreme value distribution), then $f (t) = λ ρ (λ t)^{ρ k - 1} \frac{e x p [ - ( λ t ) ^{ρ} ]}{Γ ( k )}$ where $α = - ln λ$ and $σ = \frac{1}{ρ}$ .

Special Cases

If $ρ = 1$ then $T \sim Γ (k, λ)$ If $k = 1$ then $T \sim Weibull (λ, k)$ If $ρ = k = 1$ then $T \sim Exp (λ)$ If $k \to \infty$ then $T \sim Lognormal (μ, σ^{2})$

Exponential-F Distribution

If $e^{W}$ follows a F-Distribution $F (2 m, 2 n)$ , then $T$ follows a generalized F-distribution.

Special Cases

If $m = n = 1$ then $T$ follows Log-Logistic Distribution If $n \to \infty$ then $T$ follows generalized gamma distribution If $m, n \to \infty$ then $T$ follows Log-Normal Distribution
Link to original

Survival Models with Surviving Fractions
Definition

Consider an event such as death from a specific cause on the incidence of a particular disease.
Define a binary variable $z$ where $z = 1$ indicates that an individual will experience the event eventually, and $z = 0$ indicates that the individual will never experience the event.
Let $T$ denote the time of occurrence of the event with PDF $f_{θ}$ and CDF $F_{θ}$ , $T$ is defined only when $z = 1$ , and $T$ and $z$ are independent. We observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} C_{i} : censored$ .

Suppose that the probability of $z$ is $p$ , $p = P (z = 1)$ .

No Covariate Case

$P (δ_{i} = 1, T_{i} = Y_{i}, z_{i = 1}) P (δ_{i} = 0, T_{i} C_{i}, z_{i = 1}) = p f_{θ} (y_{i}) = (1 - p) + p S_{θ} (y_{i})$

The likelihood is defined as $L (θ) = i = 1 \prod n [p f_{θ} (y_{i})]^{δ_{i}} [(1 - p) + p S_{θ} (y_{i})]^{1 - δ_{i}}$

With Covariates Case

The likelihood is defined as $L (θ) = i = 1 \prod n [p (x_{i}) f_{θ} (y_{i} ∣ x_{i})]^{δ_{i}} [(1 - p (x_{i})) + p (x_{i}) S_{θ} (y_{i} ∣ x_{i})]^{1 - δ_{i}}$
Link to original

Nonparametric Methods: One Sample

Life Tables

Empirical Survival Function
Definition

Let $T_{i}, i = 1, 2, \dots, n$ be i.i.d. random variables with the Survival Function $S (t)$ , then the empirical survival function is defined as $\hat{S} (t) = \frac{1}{n} i = 1 \sum n I (T_{i} > t)$ Since $I (T_{i} > t)$ is a Bernoulli trial, $n \hat{S} (t) \sim B (n, S (t))$
Link to original

Reduced Sample Estimator
Definition

Notations

Assume intervals have equal length and let the notations:

$I_{i}$ : the $i$ -th interval $(τ_{i - 1}, τ_{i}]$ where $τ_{0} = 0$

$n_{i}$ : the number of individuals alive at the beginning of $I_{i}$

$d_{i}$ : the number of deaths during $I_{i}$

$l_{i}$ : the number of individuals censored during $I_{i}$

$P_{i}$ : $P (T > τ_{i} ∣ T > τ_{i - 1}) = P (surviving through I_{i} ∣ alive at beginning of I_{i})$ where $T_{i}$ is the survival time.

$q_{i} = 1 - P_{i}$

Estimation of Reduced Sample Estimator

Reduced sample method estimates the Survival Function as $\hat{S} (τ_{k}) = 1 - \frac{d _{k}^{*}}{n _{k}^{*}}$ where $d_{k}^{*} = i = 1 \sum k d_{i}$ is the number of deaths in the interval $(0, τ_{k})$ , and $n_{k}^{*} = n_{1} - i = 1 \sum k l_{i}$ is the number of uncensored data in $(0, τ_{k})$

The drawback of the reduced sampling method is that it ignore the information contained in censored observations, therefore it usually a biased (under) estimator of the Survival Function $S (t)$ .
Link to original

Life Table Estimator
Definition

Notations

Assume intervals have equal length and let the notations:

$I_{i}$ : the $i$ -th interval $(τ_{i - 1}, τ_{i}]$ where $τ_{0} = 0$

$n_{i}$ : the number of individuals alive at the beginning of $I_{i}$

$d_{i}$ : the number of deaths during $I_{i}$

$l_{i}$ : the number of individuals censored during $I_{i}$

$P_{i}$ : $P (T > τ_{i} ∣ T > τ_{i - 1}) = P (surviving through I_{i} ∣ alive at beginning of I_{i})$ where $T_{i}$ is the survival time.

$q_{i} = 1 - P_{i}$

Estimation of Life Table Estimator

The life table estimator is derived from the expression $S (t) = i = 1 \prod k P_{i}$ where $P_{i} = P (T > τ_{i} ∣ T > τ_{i - 1})$

The life table estimator estimates the Survival Function as $\hat{S} (τ_{k}) = i = 1 \prod k \hat{P}_{i}$ where $\hat{P}_{i} = 1 - \frac{d _{i}}{n _{i}^{'}}$ and $n_{i}^{'} = n_{i} - \frac{l _{i}}{2}$ is called the effective sample size.

We assume that, on average, those individuals who became censored during $I_{i}$ were at risk for half the interval

Variance of Life Table Estimator

For a given $n_{i}^{'}$ , assume that $n_{i}^{'} \hat{P}_{i} \approx B (n_{i}^{'}, P_{i})$ , where $P_{i} = \frac{S ( τ _{i} )}{S ( τ _{i - 1} )}$ . Since $ln \hat{S} (τ_{k}) = i = 1 \sum k ln \hat{P}_{i}$ , under the assumption of the independence of $ln \hat{P}_{i}$ ‘s, the variance of the life table estimator is approximated as $Var [ln \hat{S} (τ_{k})] \approx i = 1 \sum k Var [ln \hat{P}_{i}]$ Then, by the Delta Method, Greenwood’s formula is derived as $\hat{Var} (\hat{S} (τ_{k})) = \hat{S} (τ_{k})^{2} i = 1 \sum k \frac{d _{i}}{n _{i}^{'} ( n _{i}^{'} - d _{i} )}$

Confidence Interval for Life Table Estimator

Under the asymptotic normality of the estimator, we can use $\hat{S} (t) \pm z_{α /2} \hat{Var} (\hat{S} (t))$ as a $100 (1 - α) %$ confidence interval for $S (t)$ . However, this region could take values outside $(0, 1)$ . To avoid this kind of problem, the log-log-transformation is used.

Log-log Transformation

To guarantee for the CI of $S (t)$ to be within $(0, 1)$ , use log-log transformation. Let $V (t) = ln (- ln S (t))$ . Then, by delta-method $\hat{Var} [\hat{V} (t)] \approx \frac{Var ^ ( S ^ ( t ))}{[ l n S ^ ( t ) ] ^{2} S ^ ( t ) ^{2}}$ i.e. CI for $V (t)$ is $\hat{V} (t) \pm z_{α /2} \hat{Var} (\hat{V} (t))$ and CI for $S (t)$ is calculated as $exp [- exp (\hat{V} (t) \pm z_{α /2} \hat{Var} (\hat{V} (t)))] \in (0, 1)$

Examples

Consider a life table

$I_{i}$ $n_{i}$ $d_{i}$ $l_{i}$ $n_{i}^{*} = n_{1} - j = 1 \sum i l_{j}$ $d_{i}^{*} = j = 1 \sum i d_{j}$ $\hat{S}_{R} = 1 - \frac{d _{i}^{*}}{n _{i}^{*}}$ $n_{i}^{'} = n_{i} - \frac{l _{i}}{2}$ $\hat{P}_{i} = 1 - \frac{d _{i}}{n _{i}^{'}}$ $\hat{S}_{L} = j = 1 \prod i \hat{P}_{j}$
$(0, 1]$ $126$ $47$ $19$ $107$ $47$ $0.56$ $116.5$ $0.60$ $0.60$
$(1, 2]$ $60$ $5$ $17$ $90$ $52$ $0.42$ $51.5$ $0.90$ $0.54$
$(2, 3]$ $38$ $2$ $15$ $75$ $54$ $0.28$ $30.5$ $0.93$ $0.50$
$(3, 4]$ $21$ $2$ $9$ $66$ $56$ $0.15$ $16.5$ $0.88$ $0.44$
$(4, 5]$ $10$ $0$ $6$ $60$ $56$ $0.07$ $7$ $1.00$ $0.44$
where $\hat{S}_{R}$ is the Reduced Sample Estimator and $\hat{S}_{R}$ is the life table estimator.

The survival functions are estimated as
Link to original

$I_{i}$	$n_{i}$	$d_{i}$	$l_{i}$	$n_{i}^{*} = n_{1} - j = 1 \sum i l_{j}$	$d_{i}^{*} = j = 1 \sum i d_{j}$	$\hat{S}_{R} = 1 - \frac{d _{i}^{}}{n _{i}^{}}$	$n_{i}^{'} = n_{i} - \frac{l _{i}}{2}$	$\hat{P}_{i} = 1 - \frac{d _{i}}{n _{i}^{'}}$	$\hat{S}_{L} = j = 1 \prod i \hat{P}_{j}$
$(0, 1]$	$126$	$47$	$19$	$107$	$47$	$0.56$	$116.5$	$0.60$	$0.60$
$(1, 2]$	$60$	$5$	$17$	$90$	$52$	$0.42$	$51.5$	$0.90$	$0.54$
$(2, 3]$	$38$	$2$	$15$	$75$	$54$	$0.28$	$30.5$	$0.93$	$0.50$
$(3, 4]$	$21$	$2$	$9$	$66$	$56$	$0.15$	$16.5$	$0.88$	$0.44$
$(4, 5]$	$10$	$0$	$6$	$60$	$56$	$0.07$	$7$	$1.00$	$0.44$
where $\hat{S}_{R}$ is the Reduced Sample Estimator and $\hat{S}_{R}$ is the life table estimator.

Kaplan-Meier Estimator

Kaplan-Meier Estimator
Definition

Kaplan-Meier Estimator

Consider a Random Censoring case $Y_{i} = min (T_{i}, C_{i})$ . Assume that $y_{1} \leq y_{2} \leq \dots \leq y_{n}$ , and the distinct failure times are $τ_{1} < τ_{2} < \dots < τ_{k}$ where $k \leq n$ . Let $d_{j} := i = 1 \sum n I (y_{i} = τ_{j}, δ_{i} = 1)$ be the number of deaths at $τ_{j}$ , and $n_{j} := i = 1 \sum n I (y_{i} \geq τ_{j})$ be the number of alive at $τ_{j}$ where the set ${y_{i} ∣ y_{i} \geq τ_{j}}$ is called the risk set at $τ_{j}$ . We only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$

The Kaplan-Meier estimator is derived from the expression $S (t) = i = 1 \prod k P_{j}, P_{j} = P (T > τ_{j} ∣ T > τ_{j - 1})$ where $τ_{0} = 0$

General Case

As an estimator of $P_{j}$ , consider $\hat{P}_{j} = \hat{P} (T > τ_{j} ∣ T > τ_{j - 1}) = \frac{P ^ ( T > τ _{j} )}{P ^ ( T > τ _{j - 1} )} = \frac{n _{j + 1} / n _{1}}{n _{j} / n _{1}} = \frac{n _{j} - d _{j}}{n _{j}} = 1 - \frac{d _{j}}{n _{j}}$

The Kaplan-Meier estimator is defined with the estimated $\hat{P}_{j}$ ‘s $\hat{S} (t) = j = 1 \prod k \hat{P}_{j} = {j ∣ τ_{j} \leq t} \prod (1 - \frac{d _{j}}{n _{j}})$ The cumulative hazard function is estimated by Nelson-Aalen Estimator in the same logic $\hat{Λ} (t) = {j ∣ τ_{j} \leq t} \sum \frac{d _{j}}{n _{j}}$

No Ties Case

When there’s no tie in the observation, $y_{1} < y_{2} < \dots < y_{n}$ , then the failure times are equal to the observation $τ_{j} = y_{i}$ , death is equal to the censoring indicator $d_{j} = δ_{j}$ , and $n_{j} = n - j + 1$ . Thus, the Kaplan-Meier estimator is defined as $\hat{S} (t) = {j ∣ y_{j} \leq t} \prod (1 - \frac{δ _{j}}{n - j + 1}) = {j ∣ τ_{j} \leq t} \prod (1 - \frac{1}{n - j + 1})^{δ_{j}}$

Properties

Self-Consistency

An estimator $\hat{S} (t)$ is self-consistent if $\hat{S} (t) = \frac{1}{n} i = 1 \sum n [1 I (Y_{i} > t) + 0 I (Y_{i} \leq t, δ_{i} = 1) + \frac{S ^ ( t )}{S ^ ( Y _{i} )} I (Y_{i} < t, δ_{i} = 0)] = \frac{1}{n} [N_{y} (t) + i : y_{i} \leq t \sum (1 - δ_{i}) \frac{S ^ ( t )}{S ^ ( Y _{i} )}]$ where $N_{y} (t) = i = 1 \sum n I (Y_{i} > t) = # (Y_{i} > t)$

The Kaplan-Meier estimator is the unique self-consistent estimator for $t < Y_{(n)}$ where $Y_{(n)}$ is the largest observation.

Generalized MLE

The Kaplan-Meier estimator gives the Generalized Maximum Likelihood Estimation of the Survival Function $S$ .

Strong Consistency

The Kaplan-Meier estimator uniformly Almost Surely converges to $S (t)$ $\hat{S} (t) \to a . s . S (t), \forall t \in R^{+}$

Proof

Consider a function $S^{*} (t) = P (Y > t)$ and decompose it to the sum of the subsurvival functions $S_{u}^{*} (t)$ and $S_{c}^{*} (t)$ . $S^{*} (t) = S_{u}^{*} (t) + S_{c}^{*} (t)$ where $S_{u}^{*} (t) = P (Y > t, δ = 1)$ is the uncensored case and $S_{c}^{*} (t) = P (Y > t, δ = 0)$ is the censored case.

Then, the survival function $S (t) = P (T > t)$ can be expressed as a function of the subsurvival functions. $S (t) = Ψ (S_{u}^{*}, S_{c}^{*}, t)$

Define the empirical subsurvival functions $\hat{S}_{u}^{*} (t) = \frac{1}{n} i = 1 \sum n I (Y_{i} > t, δ_{i} = 1)$ and $\hat{S}_{c}^{*} (t) = \frac{1}{n} i = 1 \sum n I (Y_{i} > t, δ_{i} = 0)$ . The Kaplan-Meier estimator also can be expressed as a function of the empirical subsurvival functions. $\hat{S} (t) = Ψ (\hat{S}_{u}^{*}, \hat{S}_{c}^{*}, t)$

By Glivenko-Cantelli theorem, $\hat{S}_{u}^{*} (t) \to a . s . S_{u}^{*} (t)$ and $\hat{S}_{c}^{*} (t) \to a . s . S_{c}^{*} (t)$ for all $t \in R^{+}$ . Since $Ψ$ is a continuous function of $S_{u}^{*} (t)$ and $S_{c}^{*} (t)$ , $\hat{S} (t) = Ψ (\hat{S}_{u}^{*}, \hat{S}_{c}^{*}, t) \to a . s . Ψ (S_{u}^{*}, S_{c}^{*}, t) = S (t), \forall t \in R^{+}$

Asymptotic Normality

Kaplan-Meier estimator has asymptotic normality. $n (\hat{S} (t) - S (t)) \to D N (0, S (t)^{2} \int_{0}^{t} \frac{d F _{u} ( X )}{( 1 - H ( x ) ) ^{2}})$ where $F_{u} (t) = P (T \leq t, δ = 1) = \int_{0}^{t} (1 - G (x)) d F (x)$ , $C \sim G$ , and $Y \sim H$ .

The variance of the estimator is estimated by Greenwood’s formula $σ_{S}^{2} (t) \approx \hat{S} (t)^{2} {j ∣ τ_{j} \leq t} \sum \frac{d _{j}}{n _{j} ( n _{j} - d _{j} )}$ For the no ties case, the formula is $σ_{S}^{2} (t) \approx \hat{S} (t)^{2} {j ∣ τ_{j} \leq t} \sum \frac{δ _{j}}{( n - j ) ( n - j + 1 )}$

Examples

$n = 8$ case where ${\times \circ δ_{i} = 1 δ_{i} = 0$

Facts

Kaplan-Meier estimator has Self-Consistency and Asymptotic Normality, and it is generalized MLE

If no censoring, Kaplan-Meier estimator is just the Empirical Survival Function.

Link to original

Hazard Function Estimators

Nelson-Aalen Estimator
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. In a Random Censoring setting, we only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

The Nelson-Aalen estimator estimates cumulative hazard function as $\hat{Λ} (t) = i : y_{i} \leq t \sum \frac{δ _{i}}{( n - i + 1 )}$ It is derived from Kaplan-Meier Estimator
Link to original

Peterson Estimator
Definition

Suppose that $T_{1}, T_{2}, \dots, T_{n}$ are i.i.d. random variables represent survival times with CDF $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ are the censoring times. And let $δ_{i} = {10 T_{i} \leq C_{i} : uncensored T_{i} > C_{i} : censored$ be the censoring indicator. In a Random Censoring setting, we only observe $(Y_{i}, δ_{i}), i = 1, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ .

The Peterson estimator estimates cumulative hazard function as $\hat{Λ} (t) = i : y_{i} \leq t \sum - ln (1 - \frac{δ _{i}}{n - i + 1})$ It is derived from Kaplan-Meier Estimator
Link to original

Robust Estimators

Estimators for Survival Function
Definition

Mean of Survival Time

Without Censoring

$\hat{θ} = \int_{0}^{\infty} x d F_{n} (x) = \overset{x}{ˉ}$ where $F_{n}$ is the empirical CDF

With Censoring

$\hat{θ} = \int_{0}^{\infty} x d \hat{F} (x) = \int_{0}^{\infty} \hat{S} (t) d t = i = 1 \sum n S_{i} Y_{i}$ where $\hat{F}$ is calculated by the Kaplan-Meier Estimator and $S_{i}$ is the jump size at $Y_{i}$ .

The asymptotic variance of the parameter is obtained by $AVar (\hat{θ}) = \frac{1}{n} \int_{0}^{\infty} \frac{1}{( 1 - H ( S ) ) ^{2}} (\int_{0}^{\infty} S (t) d t)^{2} d F_{u} (S)$ And the asymptotic variance is estimated by $\hat{AVar} (\hat{θ}) = i = 1 \sum n (\int_{y_{i}}^{\infty} S (t) d t)^{2} \frac{δ _{i}}{( n - i ) ( n - i + 1 )}$ where $(1 - \hat{H (S)})^{2} = \frac{n - i + 1}{n}$ and $d \hat{F}_{u} (y_{i}) = \frac{δ _{i}}{n}$

Median of Survival Time

A reasonable estimator for $θ$ is $\hat{θ} = \hat{S}^{- 1} (0.5)$ where $\hat{S}$ is the Kaplan-Meier Estimator

If $\hat{S}^{- 1}$ does not have a unique solution, then $\hat{θ}$ is defined as the midpoint of the interval constituting of the solutions.

However, the estimator $\hat{S} (t)$ over-estimate the true parameter, so linear smooth of the estimator $\hat{\hat{S}} (t)$ is used to estimate the $θ$ $\hat{\hat{θ}} = \hat{\hat{S}}^{- 1} (0.5)$

The asymptotic variance of the estimated parameter is obtained by $AVar (\hat{θ}) = \frac{AVar ( S ^ ( θ ))}{f ^{2} ( θ )}$ where $AVar (\hat{S} (θ))$ is estimated by Greenwood’s formula, and $f$ may estimated by Kernel Estimation.
Link to original

Bayes Estimator for Survival Function
Definition

The Bayes estimator of $S (t)$ with the Squared Error Loss and a Dirichlet process prior $P_{α}$ is given by $\hat{S}_{α} (t) = \frac{α ( t , \infty ) + N _{y} ( t )}{α ( 0 , \infty ) + n} \prod_{i : y_{i} \leq t} [\frac{α [ y _{i} , \infty ) + ( n - i + 1 )}{α [ y _{i} , \infty ) + ( n - i )}]^{1 - δ_{i}}$ where the squared loss is $L (\hat{S}, S) = \int_{0}^{\infty} (\hat{S} (t) - S (t))^{2} d w (t)$ where $w$ is any non-negative non-decreasing function, the parameter $α$ which is a finite non-negative measure on $(0, \infty)$ , and $N_{y} (t) = i = 1 \sum n I (Y_{i} > t) = # (Y_{i} > t)$ .

Examples

If $\frac{α ( t , \infty )}{α ( 0 , \infty )} = exp [- λ_{0} t]$ , then In many cases, $MLE [\hat{S}_{α} (t)] \leq MLE [\hat{S} (t)]$
Link to original

Nonparametric Density Estimation

Kernel Estimation for Survival Analysis
Definition

Kaplan-Meier estimator is a step function. So it is difficult to calculate its quantile function and Density Function. The Kernel Density Estimation is used to make it smooth function.

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time with a distribution $F$ , and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time with a distribution $G$ . We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator.

Without Censoring

For the complete data, the Kernel Density Estimation is defined as $\hat{f} (t) = \frac{1}{n} i = 1 \sum n K_{λ} (t - Y_{i}) = \frac{1}{nλ} i = 1 \sum n K (\frac{t - Y _{i}}{λ})$ where $K s.t. \int_{- \infty}^{\infty} K (t) d t = 1, \int_{- \infty}^{\infty} K^{2} (t) d t < \infty$ is the kernel, $K_{λ} (x) := \frac{1}{h} K (\frac{x}{h})$ is the scaled kernel, and $λ$ is a smoothing parameter.

The kernel estimator for the Distribution Function is defined as $\hat{F} (t) = \int_{- \infty}^{t} \hat{f} (u) d u = \frac{1}{n} i = 1 \sum n W (\frac{t - Y _{i}}{λ})$ where $W (t) = \int_{- \infty}^{t} K (u) d u$

With Censoring

For the censored data, the weights for each observation is defined as a jump size in Kaplan-Meier Estimator. $\hat{f} (t) = i = 1 \sum n S_{i} K_{λ} (t - Y_{i})$ $\hat{F} (t) = i = 1 \sum n S_{i} W (\frac{t - Y _{i}}{h})$ where $S_{i}$ is the jump size at $Y_{i}$ in Kaplan-Meier Estimator.

Thus, the Kernel Density Estimation for the Survival Function is $1 - i = 1 \sum n S_{i} W (\frac{t - Y _{i}}{h})$
Link to original

Nonparametric Methods: Two Samples

Gehan Test
Definition

For the first sample, let $T_{1}, T_{2}, \dots, T_{m}$ be i.i.d. survival time with a distribution $F_{1}$ , and $C_{1}, C_{2}, \dots, C_{m}$ be i.i.d. censoring time with a distribution $G_{1}$ . We can observe $(X_{i}, δ_{i}), i = 1, 2, \dots, m$ where $X_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator.

For the second sample, let $U_{1}, U_{2}, \dots, U_{m}$ be i.i.d. survival time with a distribution $F_{2}$ , and $D_{1}, D_{2}, \dots, D_{n}$ be i.i.d. censoring time with a distribution $G_{2}$ . We can observe $(Y_{i}, ϵ_{i}), i = 1, 2, \dots, n$ where $Y_{i} = min (U_{i}, D_{i})$ and $ϵ_{i} = I (U_{i} \leq D_{i})$ is censoring indicator.

Gehan’s test is an extension of Signed-Rank Wilcoxon Test. The test statistic of Gehen’s test is defined as $U = i = 1 \sum m j = 1 \sum n U_{ij}$ where $U_{ij} = ⎩ ⎨ ⎧ 10 - 1 t_{i} > u_{j} : (x_{i} > y_{j}, ϵ_{j} = 1) or (x_{i} = y_{j}, δ_{i} = 0, ϵ_{j} = 1) otherwise t_{i} < u_{j} : (x_{i} < y_{j}, δ_{i} = 1) or (x_{i} = y_{j}, δ_{i} = 1, ϵ_{j} = 0)$ .

Under the null hypothesis $H_{0} : F_{1} = F_{2}, G_{1} = G_{2}$ , let $(z_{i}, ζ_{i}), i = 1, 2, \dots, n + m$ be the combined samples and define $U_{k l}$ , $U_{k}^{*}$ , and $U$ $U_{k l} = ⎩ ⎨ ⎧ 10 - 1 (z_{k} > z_{l}, ζ_{l} = 1) or (z_{k} = z_{l}, ζ_{k} = 0, ζ_{l} = 1) otherwise (z_{k} < z_{l}, ζ_{k} = 1) or (z_{k} = z_{l}, ζ_{k} = 1, ζ_{l} = 0)$ $U_{k}^{*} = l = 1 \neq = k \sum m + n U_{k l}$ $U = k = 1 \sum m + n U_{k}^{*} I (k \in I_{1})$ where $I_{1}$ is the indicator set for sample 1.

Then, the variance of $U$ is calculated as $Var (U) = \frac{mn}{( m + n ) ( m + n - 1 )} k = 1 \sum m + n (U_{k}^{*})^{2}$ and $\frac{U}{Var ( U )} \to D N (0, 1)$
Link to original

Hypothesis Test for a 2 by 2 Contingency Table
Definition

The hypothesis test for a $2 \times 2$ contingency table is used to determine if there’s a significant association between two categorical variables

Consider a $2 \times 2$ Contingency Table

Dead Alive
Population 1 $a$ $b$ $n_{1}$
Population 2 $c$ $d$ $n_{2}$
$m_{1}$ $m_{2}$ $n$
i.e. $a \sim B (n_{1}, p_{1})$ and $c \sim B (n_{2}, p_{2})$

We want to test $H_{0} : p_{1} = p_{2}$ . Let $\overset{p}{^}_{1} = \frac{a}{n _{1}}$ , $\overset{p}{^}_{2} = \frac{c}{n _{2}}$ , $\overset{p}{^} = \frac{m _{1}}{n}$ , and $B (r, n, p) = (r n) p^{r} (1 - p)^{n - r}$

Uncorrelated Chi-squared Test

The test statistic is defined as $T = \frac{n ( a d - b c ) ^{2}}{n _{1} n _{2} m _{1} m _{2}} \sim χ^{2} (1)$

We reject $H_{0}$ if $T \geq χ_{α}^{2} (1)$

Yates’ Corrected Chi-squared Test

The test statistic is defined as $T = \frac{n ( ∣ a d - b c ∣ - \frac{n}{2} ) ^{2}}{n _{1} n _{2} m _{1} m _{2}} \sim χ^{2} (1)$

We reject $H_{0}$ if $T \geq χ_{α}^{2} (1)$

Fisher’s Exact Test

The test statistic is defined as $T = j = x_{1} \sum x_{2} \frac{( j n _{1} ) ( m _{1} - j n _{2} )}{( m _{1} n )}$ where $x_{1} = max (0, m_{1} - n_{2})$ , $x_{2} = min (n_{1}, m_{1})$

We reject $H_{0}$ if $T \leq α$

Corrected Chi-squared Test

The test statistic is defined as $T = \frac{n ( ∣ a d - b c ∣ - \frac{n}{4} ) ^{2}}{n _{1} n _{2} m _{1} m _{2}} \sim χ^{2} (1)$

We reject $H_{0}$ if $T \geq χ_{α}^{2} (1)$

Liddell’s Exact Test

The test statistic is defined as $T = x = 0 \sum n_{1} y = 0 \sum n_{2} B (x, n_{1}, \overset{p}{^}) B (y, n_{2}, \overset{p}{^}) I (\frac{x}{n _{1}} - \frac{y}{n _{2}} \geq ∣ a - c ∣)$

We reject $H_{0}$ if $T \leq α$

Exact Unconditional Test

The test statistic is defined as $T = 0 \leq p \leq 1 sup x = 0 \sum n_{1} y = 0 \sum n_{2} B (x, n_{1}, p) B (y, n_{2}, p) I (∣ z (x, y) ∣ \geq ∣ z (a, c) ∣)$ where $z (x, y) = \frac{( \frac{x}{n _{1}} - \frac{y}{n _{2}} )}{\frac{1}{n _{1}} ( \frac{x}{n _{1}} ) ( 1 - \frac{x}{n _{1}} ) + \frac{1}{n _{2}} ( \frac{y}{n _{2}} ) ( 1 - \frac{y}{n _{2}} )}$

We reject $H_{0}$ if $T \leq α$

Approximate Unconditional Test

The test statistic is defined as $T = x = 0 \sum n_{1} y = 0 \sum n_{2} B (x, n_{1}, \overset{p}{^}) B (y, n_{2}, \overset{p}{^}) I (∣ z^{*} (x, y) ∣ \geq ∣ z^{*} (a, c) ∣)$ where $z^{*} (x, y) = \frac{( \frac{x}{n _{1}} - \frac{y}{n _{2}} )}{( \frac{1}{n _{1}} + \frac{1}{n _{2}} ) p ^ ( 1 - p ^ )}$

We reject $H_{0}$ if $T \leq α$
Link to original

	Dead	Alive
Population 1	$a$	$b$	$n_{1}$
Population 2	$c$	$d$	$n_{2}$
	$m_{1}$	$m_{2}$	$n$
i.e. $a \sim B (n_{1}, p_{1})$ and $c \sim B (n_{2}, p_{2})$

Mantel-Haenszel Test
Definition

Consider a sequence of $2 \times 2$ contingency tables

Dead Alive
Treatment 1 $a_{i}$ $b_{i}$ $n_{i 1}$
Treatment 2 $c_{i}$ $d_{i}$ $n_{i 2}$
$m_{i 1}$ $m_{i 2}$ $n_{i}$
where $i = 1, 2, \dots, k$ is the indicator of hospital

Under the null hypothesis $H_{0} : p_{11} = p_{12}, p_{21} = p_{22}, \dots, p_{k 1} = p_{k 2}$ , where $p_{i 1} = \frac{a _{i}}{n _{i 1}}$ and $p_{i 2} = \frac{c _{i}}{n _{i 2}}$ , the test statistic for Mantel-Haenszel test is defined as $M H = \frac{i = 1 \sum k ( a _{i} - E ( a _{i} ))}{i = 1 \sum k Var ( a _{i} )} \to D N (0, 1)$ where $E (a_{i}) = \frac{m _{i} n _{i}}{n _{i}}$ and $Var (a_{i}) = \frac{n _{i 1} n _{i 2} m _{i 1} m _{i 2}}{n _{i}^{2} ( n _{i} - 1 )}$ .
Link to original

	Dead	Alive
Treatment 1	$a_{i}$	$b_{i}$	$n_{i 1}$
Treatment 2	$c_{i}$	$d_{i}$	$n_{i 2}$
	$m_{i 1}$	$m_{i 2}$	$n_{i}$
where $i = 1, 2, \dots, k$ is the indicator of hospital

Log-Rank Test
Definition

The log-rank test is a test to compare the survival functions of two samples. The test uses the sequence of Mantel-Haenszel statistics of the $2 \times 2$ tables at each uncensored event time.

Examples

$z$ $n$ $m_{1}$ $n_{1}$ $a$ $E (a)$ $a - E (a)$ $Var (a)$
$3$ $10$ $1$ $5$ $1$ $0.50$ $0.50$ $0.250$
$5$ $9$ $1$ $4$ $1$ $0.44$ $0.56$ $0.247$
$7$ $8$ $1$ $3$ $1$ $0.38$ $0.62$ $0.234$
$12$ $6$ $1$ $1$ $0$ $0.17$ $- 0.17$ $0.139$
$18$ $5$ $1$ $1$ $1$ $0.20$ $0.80$ $0.160$
$19$ $4$ $1$ $0$ $0$ $0$ $0$ $0$
$20$ $3$ $1$ $0$ $0$ $0$ $0$ $0$
The Mantel-Haenszel statistic of the data is calculated as
$M H = \frac{i = 1 \sum k [ a _{i} - E ( a _{i} )]}{i = 1 \sum k Var ( a _{i} )} \approx \frac{2.31}{1.03} \approx 2.26$
and the p-value is $0.12$
Link to original

$z$	$n$	$m_{1}$	$n_{1}$	$a$	$E (a)$	$a - E (a)$	$Var (a)$
$3$	$10$	$1$	$5$	$1$	$0.50$	$0.50$	$0.250$
$5$	$9$	$1$	$4$	$1$	$0.44$	$0.56$	$0.247$
$7$	$8$	$1$	$3$	$1$	$0.38$	$0.62$	$0.234$
$12$	$6$	$1$	$1$	$0$	$0.17$	$- 0.17$	$0.139$
$18$	$5$	$1$	$1$	$1$	$0.20$	$0.80$	$0.160$
$19$	$4$	$1$	$0$	$0$	$0$	$0$	$0$
$20$	$3$	$1$	$0$	$0$	$0$	$0$	$0$
The Mantel-Haenszel statistic of the data is calculated as
$M H = \frac{i = 1 \sum k [ a _{i} - E ( a _{i} )]}{i = 1 \sum k Var ( a _{i} )} \approx \frac{2.31}{1.03} \approx 2.26$
and the p-value is $0.12$

Tarone-Ware Test
Definition

The Tarone-Ware test is the generalization of the Mantel-Haenszel Test. $T = \frac{i = 1 \sum k w _{i} ( a _{i} - E ( a _{i} ))}{i = 1 \sum k Var ( a _{i} )} \to D N (0, 1)$ where $w_{i}$ is the weight for each table.

If $w_{i} = 1$ , then it is MH Statistic, if $w_{i} = n_{i}$ then it is Gehan statistic, and if $w_{i} = n_{i}$ then it is Tarone-Ware statistic.
Link to original

Nonparametric Methods: K Samples

Generalized Gehan Test
Definition

For the $i$ -th sample, let $T_{i 1}, T_{i 2}, \dots, T_{i n_{i}}$ , where $i = 1, 2, \dots, K$ , be i.i.d. survival time with a distribution $F_{i}$ , and $C_{i 1}, C_{i 2}, \dots, C_{i n_{i}}$ be i.i.d. censoring time with a distribution $G_{i}$ . We can observe $(X_{ij}, δ_{ij}), j = 1, 2, \dots, n_{i}$ where $X_{ij} = min (T_{ij}, C_{ij})$ and $δ_{ij} = I (T_{ij} \leq C_{ij})$ is the censoring indicator.

The generalized Gehan’s test is an extension of Gehan Test used for more than two sample case. Under the null hypothesis $H_{0} : F_{1} = F_{2} = \dots = F_{K}$ , the test statistic of generalized Gehen’s test is defined as $W_{i} = i^{'} = 1 \neq = i \sum K j = 1 \sum n_{i} j^{'} = 1 \sum n_{i^{'}} U ((X_{ij}, δ_{ij}), (X_{i^{'} j^{'}}, δ_{i^{'} j^{'}}))$ where $U$ is the statistic of thet Gehan Test.

Then, $W = (W_{1}, W_{2}, \dots, W_{K}) \to N_{K} (0, N^{3} Σ^{*})$ where $N = i = 1 \sum K n_{i}$ , $\hat{Σ}^{*} = \frac{1}{N ( N - 1 )} i = 1 \sum K j = 1 \sum n_{i} (W_{ij}^{*})^{2} n_{1} (N - n_{1}) ⋱ - n_{i} n_{j} n_{K} (N - n_{K})$ , and $w_{ij}^{*} = i^{'} = 1 \neq = i \sum K j^{'} = 1 \neq = j \sum K U ((X_{ij}, δ_{ij}), (X_{i^{'} j^{'}}, δ_{i^{'} j^{'}}))$
Link to original

Generalized Mantel-Haenszel Test
Definition

For the $i$ -th sample, let $T_{i 1}, T_{i 2}, \dots, T_{i n_{i}}$ , where $i = 1, 2, \dots, K$ , be i.i.d. survival time with a distribution $F_{i}$ , and $C_{i 1}, C_{i 2}, \dots, C_{i n_{i}}$ be i.i.d. censoring time with a distribution $G_{i}$ . We can observe $(X_{ij}, δ_{ij}), j = 1, 2, \dots, n_{i}$ where $X_{ij} = min (T_{ij}, C_{ij})$ and $δ_{ij} = I (T_{ij} \leq C_{ij})$ is the censoring indicator.

The generalized Mantel-Haenszel test is an extension of Mantel-Haenszel Test used for more than two sample case.

Let $(z_{i}, ζ_{i}), i = 1, 2, \dots, N$ be the combined samples

For each uncensored time point, construct a $2 \times K$ table.

$1$ $2$ $\dots$ $K$
Dead $a_{i 1}$ $a_{i 2}$ $\dots$ $a_{i K}$ $m_{i 1}$
Alive $\dots$ $m_{i 2}$
$n_{i 1}$ $n_{i 2}$ $\dots$ $n_{i} K$ $N_{i}$

Under the null hypothesis $H_{0} : F_{1} = F_{2} = \dots = F_{K}$ , the test statistic of generalized Mantel-Haenszel test is defined as $M H = (a_{(- 1)} - E (a_{(- 1)}))^{t} Σ_{(1)} (a_{(- 1)} - E (a_{(- 1)})) \to D χ^{2} (K - 1)$ where $a_{(- 1)}$ , $Σ_{(- 1)}$ are the $K - 1$ -length vector and $(K - 1) \times (K - 1)$ matrix, calculated with the first population is deleted data (corner-point constraint).

The $a$ and $Σ$ is defined as $a - E (a) = i = 1 \sum K (a_{i} - E (a_{i}))$ where $E (a_{i}) = (\frac{m _{i 1} n _{i 1}}{N _{i}}, \dots, \frac{m _{i K} n _{i K}}{N _{i}})$ $Σ = i = 1 \sum K Cov (a_{i})$ where $Cov (a_{i}) = \frac{m _{i 1} m _{i 2}}{N _{i} - 1} \frac{n _{i 1}}{N _{i}} (1 - \frac{n _{i 1}}{N _{i}}) ⋱ - \frac{n _{ik} n _{i l}}{N _{i}^{2}} \frac{n _{i K}}{N _{i}} (1 - \frac{n _{i K}}{N _{i}})$
Link to original

	$1$	$2$	$\dots$	$K$
Dead	$a_{i 1}$	$a_{i 2}$	$\dots$	$a_{i K}$	$m_{i 1}$
Alive			$\dots$		$m_{i 2}$
	$n_{i 1}$	$n_{i 2}$	$\dots$	$n_{i} K$	$N_{i}$

Nonparametric Methods: Regression

Cox Proportional Hazards Model

Cox Proportional Hazards Model
Definition

Cox proportional hazards model assume that covariates affect the Hazard Function.

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator, and have covariates $x_{i} = (x_{i 1}, \dots, x_{i p})$ . Then, the Cox proportional hazards model is defined as $λ (t : x) = λ_{0} (t) exp (x^{⊺} β)$ where $λ_{0}$ is called the baseline hazard function, i.e. hazard at $x = 0$

Conditional Likelihood

Let $Y_{1} < Y_{2} < \dots < Y_{n}$ (no ties case), and $R_{i}$ be the risk set. For each uncensored time $Y_{i}$ , $P {a death in [y_{i}, y_{i} + Δ) ∣ R_{i}} \approx j \in R_{i} \sum λ_{0} (y_{j}) exp (x_{j}^{⊺} β) Δ$ Therefore, $P {a death of i at time y_{i} ∣ one death in R_{i} at time y_{i}} = \frac{λ _{0} ( y _{i} ) e x p ( x _{i}^{⊺} β ) Δ}{j \in R _{i} \sum λ _{0} ( y _{i} ) e x p ( x _{j}^{⊺} β ) Δ} \approx \frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )}$ Taking the product of these conditional probabilities gives a conditional likelihood $L_{C} (β) = \prod_{i : I_{u}} \frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )} = \prod_{i = 1}^{n} [\frac{e x p ( x _{i}^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β )}]^{δ_{i}}$ where $I_{u}$ is the indicator set for uncensored samples.

The $L_{C} (β)$ is not a likelihood. However, Cox suggested treating the conditional likelihood as an ordinary likelihood to find the Maximum Likelihood Estimation.

Since there’s no analytic solution for the MLE, iterative methods such as Newton–Raphson method is used to estimate the coefficient $β$ .

The hazard ratio $exp (\hat{β}_{j})$ represents the relative change in Hazard Rate for a one-unit increase in the covariate $x_{j}$ .

Goodness-of-Fit Test

For testing the null hypothesis $H_{0} : β = 0$ , Cox suggested the Rao Test. $(\frac{\partial l n L _{C} ( 0 )}{\partial β})^{⊺} (- \frac{\partial ^{2} l n L _{C} ( 0 )}{\partial β ^{2}})^{- 1} (\frac{\partial l n L _{C} ( 0 )}{\partial β}) \sim a χ^{2} (p)$

Asymptotic Normality of MLE

$\hat{β} \to D N_{p} (β, I^{- 1} (β))$ where $I (β)$ is the observed Fisher Information

Estimation of Survival Function

Under the Cox proportional hazards model, $S (t; x) = S_{0} (t)^{e x p (x^{⊺} β)}$ To estimate $S (t; x)$ , we can use $\hat{β}$ for $β$ but we still need to estimate $S_{0} (t)$ , $λ_{0} (t)$ , or $Λ_{0} (t)$ .

Breslow suggested the estimators of $λ_{0} (t)$ and $S_{0, B} (t)$ as $\hat{λ}_{0, B} = \frac{1}{( y _{u_{i}} - y _{u_{i - 1}} ) j \in R _{i} \sum e x p ( x _{j}^{⊺} β ^ )}$ If $Y_{u_{i - 1}} < t < Y_{u_{i}}$

$\hat{S}_{0, B} (t) = \prod_{i : y_{i} \leq t} (1 - \frac{δ _{i}}{j \in R _{i} \sum e x p ( x _{j}^{⊺} β ^ )})$ If $β = 0$ , then $\hat{S}_{0, B}$ is the Kaplan-Meier Estimator

It has a few drawbacks

$\hat{S}_{0, B} (t) \neq = exp (- \hat{Λ}_{0} (t))$

$\hat{S}_{0, B} (t)$ can take negative values.

Tsiatis suggested a non-negative version of $\hat{S}_{0, B} (t)$ $\hat{S}_{0, T} (t) = exp (- \hat{Λ}_{0, T} (t))$ where $\hat{Λ}_{0, T} (t) = y_{i} \leq t \sum \frac{δ _{i}}{j \in R _{i} \sum exp ( x _{j}^{⊺} β ^ )}$

Link suggested using the linear smooth of $\hat{Λ}_{0, T} (t)$ .

Discrete on Grouped Data

When data is discrete or grouped, there are ties at each failure. Denote the ordered discrete failure time by $Y_{1} < Y_{2} < \dots < Y_{r}$ and let $R_{i}$ be the risk set at $Y_{i}^{-}$ , $D_{i}$ be the death set at $Y_{i}$ , and $d_{i} = # (D_{i})$ .

Cox suggested combining the all possible permutations. However, it is computationally infeasible. $L_{C} = \prod_{i = 1}^{r} [\frac{j \in D _{i} \prod ψ _{j}}{D _{i}^{*} \sum j \in D _{i}^{*} \prod ψ _{j}}]$ where $ψ_{j} = exp (x_{j}^{⊺} β)$ , and $D_{i}^{*}$ is the size $d_{i}$ subset of $R_{i}$

Peto suggested an alternative likelihood that instead of all possible permutations, use the same contribution. $L_{C} = \prod_{i = 1}^{r} [\frac{j \in D _{i} \prod ψ _{j}}{( j \in R _{i} \sum ψ _{j} ) ^{d_{i}}}]$

Time Dependent Covariates

In the case, the covariate depends on time. We observe $x_{i} (t)$ and the conditional likelihood defined as $L_{C} (β) = \prod_{i : I_{u}} \frac{e x p ( x _{i} ( y _{i} ) ^{⊺} β )}{j \in R _{i} \sum e x p ( x _{j} ( y _{i} ) ^{⊺} β )}$ where $I_{u}$ is the indicator set for uncensored samples, and $R_{i}$ is the risk set.

Facts

Any two individuals have hazard functions that are constant multiples of the one another.

The Survival Function of the Cox proportional hazard model is a family of Lehmann alternatives. $\forall S \in S, \exists S_{0} \in S s.t. S = S_{0}^{γ}$ where $γ \in R^{+}$ .

If $p = 1$ , $x_{i} = I (i \in I_{1})$ , where $I_{1}$ is the indicator set for sample 1, and there are no ties, then Cox test is exactly equal to the Mantel-Haenszel Test.

Link to original

Linear Models

Accelerated Life Model
Definition

Consider the random variable $T_{0}$ represents survival time with $x = 0$ Hazard Function with $f_{0} (t)$ , $λ_{0} (t)$ , and $S_{0} (t)$ And assume that the survival time of individual with covariate $x$ is defined as $T_{x} = T_{0} exp (x^{⊺} β)$ If $x^{⊺} β < 0$ then the covariate $x$ accelerates the time to failure. The model based on this assumption is called accelerated failure time (AFT) model.

Under AFT model $S (t : x) = S_{0} (t exp (- x^{⊺} β))$ $λ (t : x) = λ_{0} (t exp (- x^{⊺} β)) exp (- x^{⊺} β)$

Let $Y = ln T_{x}$ , then $E (Y) = E (ln T_{0}) + x^{⊺} β \equiv α + x^{⊺} β$ $Y = α + x^{⊺} β + ϵ \equiv E (ln T_{0}) + x^{⊺} β + ϵ ln T_{0} - E (ln T_{0})$ where $ϵ = ln T_{0} - E (ln T_{0})$ .

Assume that $ϵ = σW$ , where $W$ is a Random Variable represents error term. Then, the AFT model becomes $Y \equiv ln T_{x} = α + x^{⊺} β + σW$

Relationship between $Y = ln T_{x}$ and $W$ . $F_{T} (t) = P_{W} (\frac{l n t - μ}{σ})$ $f_{T} (t) = \frac{1}{σ t} f_{W} (\frac{l n t - μ}{σ})$ where $μ = α + x^{⊺} β$

If $W \sim N (0, 1)$ then $ln T \sim N (μ, σ^{2})$ , and if $W \sim Gumbel (0, 1)$ , then $T \sim Exp (λ)$ where $λ = exp [- (α + x^{⊺} β)]$ .
Link to original

Miller Estimator
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator

Suppose a Simple Linear Regression model $Y_{i} = α + β X_{i} + ϵ_{i}, i = 1, 2, \dots, n$

With no censoring present, the least squares estimators of the parameters $α, β$ are obtained by minimizing $\frac{1}{n} i = 1 \sum n ϵ_{i}^{2} = \frac{1}{n} i = 1 \sum n (T_{i} - α - β x_{i})^{2} = \int_{- \infty}^{\infty} z^{2} d F_{n} (z)$ where $F_{n} (z) := \frac{1}{n} i = 1 \sum n I (ϵ_{i} \leq z)$ is the Empirical Distribution Function of $z_{1}, z_{2}, \dots, z_{n}$ where $z_{i} = y_{i} - α - β x_{i}$ .

With censoring present, Miller proposed to minimize $\int_{- \infty}^{\infty} z^{2} d \hat{F}_{n} (z) = i = 1 \sum n \overset{w}{^}_{i} (β) (Y_{i} - α - β x_{i})^{2}$ where $\hat{F}$ is the Kaplan-Meier Estimator based on $(z_{i}, δ_{i})$ and the weights $\overset{w}{^}_{i} (β)$ is its jump size.

If the last observation is censored, then $i = 1 \sum n \overset{w}{^}_{i} (β) < 1$ . Hence, change the last observation to be uncensored, so that $i = 1 \sum n \overset{w}{^}_{i} (β) = 1$ .
Link to original

Buckley-James Estimator
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator

If we can observe the true survival time $T_{i}$ , we can make a model $E (T_{i}) = α + β x_{i}$ However, we can’t observe $T_{i}$ , but only censored $Y_{i}$ , and $E (Y_{i}) \neq = α + β x_{i}$ Buckley and James proposed an Unbiased Estimator for $α + β x_{i}$ $Y_{i}^{*} = Y_{i} δ_{i} + E (T_{i} ∣ T_{i} > Y_{i}) (1 - δ_{i})$ Since we also can not observe $y_{i}^{*}$ , we estimate it again. $\overset{y}{^}_{i}^{*} = y_{i} δ_{i} + [\hat{β} x_{i} + \frac{k : z ^ _{k} > z ^ _{i} \sum w ^ _{k} ( β ^ ) z ^ _{k}}{1 - F ^ ( z ^ _{i} )}] (1 - δ_{i})$ where $\overset{z}{^}_{i} = y_{i} - \hat{β} x_{i}$ , $\hat{F}$ is the Kaplan-Meier Estimator based on $(\overset{z}{^}_{i}, δ_{i})$ and the weights $\overset{w}{^}_{i} (\hat{β})$ is its jump size.

The variance of the estimator is estimated by $\hat{Var} (\hat{β}) = \frac{σ ^ _{u}^{2}}{i : δ _{i} = 1 \sum ( x _{i} - x ˉ _{u} ) ^{2}}$ where $\overset{σ}{^}_{u}^{2} = \frac{1}{n _{u} - 2} i : δ_{i} = 1 \sum [y_{i} - \overset{y}{ˉ}_{u} - \hat{β} (x_{i} - \overset{x}{ˉ}_{u})]^{2}$ , $n_{u} = i = 1 \sum n δ_{i}$ , $\overset{x}{ˉ}_{u} = i : δ_{i} = 1 \sum x_{i}$ , and $\overset{y}{ˉ}_{u} = i : δ_{i} = 1 \sum y_{i}$
Link to original

Koul-Susarla-Van Ryzin Estimator
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time. We can observe $(Y_{i}, δ_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ and $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator

Suppose a Simple Linear Regression model $E (T_{i}) = α + β x_{i}$

Koul-Susarla-Van Ryzin proposed an Unbiased Estimator for $α + β x_{i}$ $Y_{i}^{*} = \frac{δ _{i} Y _{i}}{1 - G ( Y _{i} )}$ Since $G (Y_{i})$ is unknown, it should be estimated. Authors suggested to use Kaplan-Meier Estimator of $G$ with data $(Y_{i}, 1 - δ_{i}), i = 1, 2, \dots, n$ as an estimator.
Link to original

Goodness-of-Fit Tests

Graphical Validity Tests for Survival Model
Definition

If the selected model holds, a plot of the data resembles a straight line, and if model fails, a plot resembles a curved line.

There are two types of plots, survival plots $(\hat{S} (t), t)$ and hazard plots $(\hat{Λ} (t), t)$ .

One Sample

Exponential Distribution

$ln S (t) = - λ t$ $Λ (t) = λ t$

Weibull Distribution

$ln Λ (t) = ln λ + α ln t$

Log-Normal Distribution

$Φ^{- 1} (1 - S (t)) = - \frac{μ}{σ} + \frac{1}{σ} ln t$ where $Φ^{- 1}$ is the Probit.

Two to K Samples

For parametric models, repeat one sample methods on each sample.

The validity of the Cox Proportional Hazards Model can be checked by the Lehmann alternatives property of the Survival Function. $ln S_{i} (t) = γ_{ij} ln S_{j} (t)$

Regression

Linear Model

Ordinary residual $e_{i} = y_{i} - \hat{β} x_{i}$ may be used in model checking.

Cox Proportional Hazard Model

$ln \hat{S} (t) \approx - λ t$ where $\hat{S}$ is the Kaplan-Meier Estimator based on $(\hat{Λ}_{i}, δ_{i}), i = 1, 2, \dots, n$ and $\hat{Λ}_{i} = exp (x_{i}^{⊺} \hat{β}) \int_{0}^{Y_{i}} \hat{λ_{0}} (u) d u$ .

Also, the estimated cumulative hazard function and the covariates shouldn’t have any systematic pattern.
Link to original

Goodness-of-Fit Tests for Survival Model
Definition

No Censoring Case

Kolmogorov–Smirnov Test
Definition

The Kolmogorov–Smirnov test (KS test) is a non-parametric test for the equality of continuous, distribution functions.

The Kolmogorov–Smirnov test statistic for a given CDF $F$ is defined as $K S = sup_{t} ∣ F_{n} (t) - F (t) ∣ = max_{1 \leq i \leq n} [\frac{1}{n} - F (t_{i}), F (t_{i}) - \frac{i - 1}{n}]$ where $F_{n}$ is the Empirical Distribution Function based on the i.i.d. random variables $X_{i}, i = 1, 2, \dots, n$ .
Link to original

Cramer-von Mises Test
Definition

The Cramer-von Mises test is a non-parametric test for the equality of continuous, distribution functions.

The Cramer-von Mises test statistic for a given CDF $F$ is defined as $C V = n \int_{- \infty}^{\infty} [F_{n} (t) - F (t)]^{2} d F (t) = \frac{1}{12 n} + i = 1 \sum n [\frac{2 i - 1}{2 n} - F (t_{i})]^{2}$ where $F_{n}$ is the Empirical Distribution Function based on the i.i.d. random variables $X_{i}, i = 1, 2, \dots, n$ .
Link to original

Censoring Case

Generalized Kolmogorov–Smirnov Test

The generalized Kolmogorov–Smirnov test uses Kaplan-Meier Estimator instead of Empirical Distribution Function used for Kolmogorov–Smirnov Test $G K S = n ∣ \hat{F} (t) - F (t) ∣$ where $\hat{F}$ is the Empirical Distribution Function based on the i.i.d. random variables $X_{i}, i = 1, 2, \dots, n$ .

Generalized Cramer-von Mises Test

The generalized Cramer-von Mises test uses Kaplan-Meier Estimator instead of Empirical Distribution Function used for Cramer-von Mises Test $GC V = n \int_{0}^{1} [\hat{F} (t) - F (t)]^{2} d t$ where $\hat{F}$ is the Empirical Distribution Function based on the i.i.d. random variables $X_{i}, i = 1, 2, \dots, n$ .
Link to original

Miscellaneous Topics

Multivariate Survival Model
Kinds

Copula Model
Definition

Let $T_{1}, T_{2}, \dots, T_{n}$ be i.i.d. survival time with CDF $F (t_{1}, t_{2}, \dots, t_{k})$ , PDF $f (t_{1}, t_{2}, \dots, t_{k})$ , and Survival Function $S (t_{1}, t_{2}, \dots, t_{k})$ , $S_{j} (t_{j}) = P (T_{j} > t_{j})$ be marginal survival function, $F_{j} (t_{j}) = P (T_{j} \leq t_{j})$ be marginal CDF, and $C_{1}, C_{2}, \dots, C_{n}$ be i.i.d. censoring time.

Copula model define the joint survival function as the copula function whose arguments are marginal survival functions.

When $k = 2$ , $S (t_{1}, t_{2}) = K (S_{1} (t_{1}), S_{2} (t_{2}))$ where $K$ is the copula function with uniform marginals.

Copula Functions

Clayton’s copula function $S (t_{1}, t_{2}) = (S_{1} (t_{1})^{- ϕ} + S_{2} (t_{2}) - 1)^{- 1/ ϕ}$ where $ϕ \in R^{+}$

Crowder’s copula function $S (t_{1}, t_{2}) = (1 + λ_{1} t_{1}^{γ_{1}} + λ_{2} t_{2}^{γ_{2}})^{- ν}$

Hougaard’s copula function $S (t_{1}, t_{2}) = (1 + ϕ Λ_{1} (t_{1}) + ϕ Λ_{2} (t_{2}))^{- 1/ ϕ}$

Link to original

Competing Risks Model
Definition

Competing risks (multiple modes of failures) model is designed to accommodate the multiple causes to the same event.

Let $(T_{i}, K_{i}), i = 1, 2, \dots, n$ be the survival time and failure mode (competing risk type) for each individual, where $K_{i} \in {1, 2, \dots, k}$

The mode-specific hazard function is defined as $λ_{j} (t) = lim_{Δ t \to 0} \frac{P ( T < t + Δ t , K = j ∣ T > t )}{Δ t}$ where $j = 1, 2, \dots, k$

The marginal hazard function and marginal cumulative hazard function are defined as $λ (t) = j = 1 \sum k λ_{j} (t)$ $Λ (t) = j = 1 \sum k Λ_{j} (t)$

Likelihood

We can observe $(Y_{i}, δ_{i}, K_{i}), i = 1, 2, \dots, n$ , where $Y_{i} = min (T_{i}, C_{i})$ , $δ_{i} = I (T_{i} \leq C_{i})$ is censoring indicator, and $K_{i}$ is failure mode.

The likelihood function is defined as $L = \prod_{i = 1}^{n} \prod_{j = 1}^{k} f_{j} (y_{i})^{δ_{ij}} S (y_{i})^{1 - δ_{i}}$ where $δ_{ij} = I (T_{i} \leq C_{i}, K_{i} = j)$

Estimators

The Nelson-Aalen Estimator for competing risk model is defined as $\hat{Λ}_{j} (t) = i : y_{i} \leq t \sum \frac{δ _{ij}}{n _{i}}$ where $j = 1, 2, \dots, k$

The survival function is estimated by $\hat{S} (t) = exp [- j = 1 \sum k \hat{Λ}_{j} (t)]$

The sub-distribution function is estimated by $\hat{F}_{j} (t) = i : y_{i} \leq t \sum \hat{S} (t_{i}) \frac{δ _{ij}}{n _{i}}$ where $j = 1, 2, \dots, k$
Link to original

Examples

When $k = 2$ , $λ_{j} (t) = lim_{Δ t \to 0} \frac{P ( T _{j} < t + Δ t ∣ T _{1} > t , T _{2} > t )}{Δ t}$ where $j = 1, 2$

$λ_{1∣2} (t_{1} ∣ t_{2}) = lim_{Δ t \to 0} \frac{P ( T _{1} < t _{1} + Δ t ∣ T _{1} > t _{1} , T _{2} = t _{2} )}{Δ t}$ where $t_{1} > t_{2}$

$λ_{2∣1} (t_{2} ∣ t_{1}) = lim_{Δ t \to 0} \frac{P ( T _{2} < t _{2} + Δ t ∣ T _{1} = t _{1} , T _{2} > t _{2} )}{Δ t}$ where $t_{1} < t_{2}$
Link to original

Basic Issues in Clinical Trials

Observational Study
Definition

An observational study draws inferences from a sample to population where the independent variable is not under the control of the researcher.

Types

Case-Control Study
Definition

A case-control study is a retrospective study that compares two groups of people: those with a specific outcome (cases) and similar people without the outcome (control)

Examples

Suppose that researchers want to study the relationship between smoking and lung cancer. They identify 100 lung cancer patients (cases) and 100 matched individuals without lung cancer (controls). They then collect data on past smoking habits for both groups and compare the prevalence of smoking between cases and controls.
Link to original

Cohort Study
Definition

A cohort study is a prospective study that follows a group of individuals (cohort) over time to determine the incidence of a specific outcome.

Examples

Suppose that researchers want to study the relationship between smoking and lung cancer. They follow a group of 10,000 people for 20 years, comparing smokers to non-smokers to determine the incidence of lung cancer.
Link to original

Cross-Sectional Study
Definition

A cross-sectional study analyzes data from a population at a specific point in time. It provides a snapshot of the prevalence in a population

Examples

Suppose that researchers want to study the relationship between smoking and lung cancer. They survey 1,000 people in a city, collecting data on their current smoking habits and presence of lung cancer.
Link to original
Link to original

Relative Risk
Definition

Consider a $2 \times 2$ Contingency Table

Event Non-event
Group 1 $n_{11}$ $n_{12}$ $n_{1.}$
Group 2 $n_{21}$ $n_{22}$ $n_{2.}$
$n_{.1}$ $n_{.2}$ $n_{..}$

Point Estimation

The relative risk (RR) is estimated by $RR = \frac{experimental event rate}{control event rate} = \frac{n _{11} / n _{1.}}{n _{21} / n _{2.}}$

Confidence Interval

The confidence interval for relative risk is estimated by Delta Method. $\hat{Var} (ln RR) = \frac{1}{n _{12}} (1 - \frac{n _{11}}{n _{1.}}) + \frac{1}{n _{21}} (1 - \frac{n _{21}}{n _{2.}})$ The $100 (1 - α) %$ confidence interval for $RR$ is defined as $exp [ln RR \pm z_{α /2} \hat{Var} (ln RR)]$

Facts

Relative risk must be used in Cohort Study or experimental study, can not be used for Case-Control Study.

Link to original

	Event	Non-event
Group 1	$n_{11}$	$n_{12}$	$n_{1.}$
Group 2	$n_{21}$	$n_{22}$	$n_{2.}$
	$n_{.1}$	$n_{.2}$	$n_{..}$

Odds Ratio
Definition

Consider a $2 \times 2$ Contingency Table

Event Non-event
Group 1 $n_{11}$ $n_{12}$ $n_{1.}$
Group 2 $n_{21}$ $n_{22}$ $n_{2.}$
$n_{.1}$ $n_{.2}$ $n_{..}$

Point Estimation

The odds ratio (OR) is estimated by $OR = \frac{odds in case group}{odds in control group} = \frac{( n _{11} / n _{.1} ) / ( n _{21} / n _{.1} )}{( n _{12} / n _{.2} ) / ( n _{22} / n _{.2} )} = \frac{n _{11} n _{22}}{n _{21} n _{12}}$

Confidence Interval

The confidence interval for odds ratio is estimated by Delta Method. $\hat{Var} (ln OR) = \frac{1}{n _{11}} + \frac{1}{n _{12}} + \frac{1}{n _{21}} + \frac{1}{n _{22}}$ The $100 (1 - α) %$ confidence interval for $OR$ is defined as $exp [ln OR \pm z_{α /2} \hat{Var} (ln OR)]$

Facts

Odds ratio can be used for Case-Control Study.

Link to original

	Event	Non-event
Group 1	$n_{11}$	$n_{12}$	$n_{1.}$
Group 2	$n_{21}$	$n_{22}$	$n_{2.}$
	$n_{.1}$	$n_{.2}$	$n_{..}$

Tests of Association
Kinds

Independence Test for Two Discrete Variables
Definition

Let category variables $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{c}$ and consider a null hypothesis $H_{0} : A, B$ are independent. Then, the test Statistic, which follows Chi-squared Distribution, is defined as $Q = j = 1 \sum b i = 1 \sum a \frac{( X _{ij} - n p ^ _{ij} ) ^{2}}{n p ^ _{ij}} \to D χ^{2} ((r - 1) (c - 1))$ where $n = j \sum i \sum X_{ij}$ , $\overset{p}{^}_{ij} = \overset{p}{^}_{i .} \overset{p}{^}_{. j} = \frac{X _{i .}}{n} \frac{X _{. j}}{n}$ , $X_{i} = j \sum X_{ij}$ , $X_{j} = i \sum X_{ij}$
Link to original

Fisher’s Exact Test

The test statistic is defined as $T = j = x_{1} \sum x_{2} \frac{( j n _{1} ) ( m _{1} - j n _{2} )}{( m _{1} n )}$ where $x_{1} = max (0, m_{1} - n_{2})$ , $x_{2} = min (n_{1}, m_{1})$

We reject $H_{0}$ if $T \leq α$
Link to original

McNemar's Test
Definition

McNemar’s test is used to analyze paired nominal data (same samples are used for both conditions), particularly in before-and-after studies or matched-pair designs.

The test statistic is defined as $Q = \frac{( n _{12} - n _{21} ) ^{2}}{n _{12} + n _{21}} \sim χ_{2} (1)$

Examples

After treatment / No insomnia After treatment / Insomnia
Before treatment / No insomnia 45 15
Before treatment / Insomnia 25 15

Consider a null hypothesis $H_{0} :$ The treatment is not effective. $Q = \frac{( 15 - 25 ) ^{2}}{15 + 25} = 2.5 < χ_{0.05}^{2} (1)$ We can not reject the null hypothesis.
Link to original
Link to original

	After treatment / No insomnia	After treatment / Insomnia
Before treatment / No insomnia	45	15
Before treatment / Insomnia	25	15

Confusion Matrix
Definition

Predicted Positive (PP) Predicted Negative (PN)
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Metrics

Accuracy
Definition

$\frac{TP + TN}{TP + TN + FP + FN}$
Link to original

Recall
Definition

$\frac{TP}{TP + FN}$

Recall, Sensitivity, or True positive rate means that the rate of correctly predicted cases out of all the actual positive cases..
Link to original

Precision
Definition

$\frac{TP}{TP + FP}$

Precision means that the rate of actually positive cases out of cases predicted as positive.
Link to original

Specificity
Definition

$\frac{TN}{FP + TN}$

Specificity means that the rate of correctly predicted cases out of all the actual negative cases.
Link to original

Type 1 Error
Definition

$\frac{FP}{FP + TN}$
Link to original

Type 2 Error
Definition

$\frac{FN}{TP + FN}$
Link to original

F-Score
Definition

F1 Score

$F_{1} = \frac{2}{recall ^{- 1} + precision ^{- 1}} = 2 \frac{precision \cdot recall}{precision + recall}$ The harmonic mean of Precision and Recall.

F-beta score

$F_{β} = (1 + β^{2}) \frac{precision \cdot recall}{( β ^{2} \cdot precision ) + recall}$ where $β > 0$

Recall is considered $β$ times as important as Precision.
Link to original

Positive Predictive Value
Definition

$\frac{Sensitivity \times Prevalence}{( Sensitivity \times Prevalence ) + (( 1 - Specificity ) \times ( 1 - Prevalence ))}$ where $Prevalence = \frac{positive cases}{total population}$

Positive predictive value (in medical statistics and epidemiology) means that the rate of actually positive cases out of cases predicted as positive.
Link to original
Link to original

	Predicted Positive (PP)	Predicted Negative (PN)
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Receiver Operating Characteristic Curve
Definition

A receiver operating characteristic curve (ROC curve) is a plot of the True Positive Rate and False Negative Rate at each threshold setting.

AUC

The area under the ROC curve is called AUC (Area under curve)
Link to original

My Knowledge Base

Explorer

Survival Analysis Note

Introduce to Survival Analysis

Survival Function and Hazard Rate

Survival Function

Definition

Hazard Function

Definition

Hazard Function

Cumulative Hazard Function

Facts

Types of Censoring

Right Censoring

Kinds

Type 1 Censoring

Definition

Type 1 Censoring

Likelihood of Type 1 Censoring Data

Type 2 Censoring

Definition

Type 2 Censoring

Likelihood of Type 2 Censoring Data

Random Censoring

Definition

Random Censoring

Likelihood of Random Censoring Data

Left Censoring

Definition

Interval-Censored Data

Definition

Case 1 Interval-Censored Data (Current Status Data)

Case 2 Interval-Censored Data

Double Censored Data

Panel Data

Mean Imputation Method

Definition

Parametric Models

Distributions

Survival Models based on Distributions

Kinds

Exponential Distribution

Gamma Distribution

Weibull Distribution

Rayleigh Distribution

Log-Normal Distribution

Gompertz Distribution

Gompertz-Makeham Distribution

Survival Models based on Log-Lifetime

Definition

Kinds

Standard Gumbel Distribution

Normal Distribution

Logistic Distribution

Gumbel Distribution

Special Cases

Exponential-F Distribution

Special Cases

Survival Models with Surviving Fractions

Definition

No Covariate Case﻿

With Covariates Case

Nonparametric Methods: One Sample

Life Tables

Empirical Survival Function

Definition

Reduced Sample Estimator

Definition

Notations

Estimation of Reduced Sample Estimator

Life Table Estimator

Definition

Notations

Estimation of Life Table Estimator

Variance of Life Table Estimator

Confidence Interval for Life Table Estimator

Log-log Transformation

Examples

Kaplan-Meier Estimator

Kaplan-Meier Estimator

No Covariate Case