Survival Analysis (Time-to-Event) with Cox Proportional Hazards
This notebook demonstrates how to perform survival analysis (time-to-event analysis) using Cox proportional hazards regression in MARVELous. Survival analysis is used when the outcome of interest is the time until an event occurs, such as:
Time to disease onset
Time to death (mortality)
Time to disease progression
Time to treatment response
Cox Proportional Hazards Model
The Cox proportional hazards model is the most commonly used method for analysing survival data. It estimates hazard ratios (HR) that quantify how genetic variants affect the instantaneous risk of an event occurring.
Key features:
Handles censored data (individuals who haven’t experienced the event yet)
Adjusts for covariates (age, sex, etc.)
Provides interpretable hazard ratios
Does not require specifying a baseline hazard function
[1]:
# imports
import os
import numpy as np
import pandas as pd
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.extraction.aggregation import extract_genes
test_mod = load_custom()
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
warnings.warn(
Understanding Survival Data
Survival data requires two pieces of information for each individual:
Event indicator: A binary variable (0/1) indicating whether the event occurred
0 = censored (event has not occurred by the end of follow-up)
1 = event occurred
Time-to-event: The time from baseline to either:
The event occurrence (for those with event = 1)
The end of follow-up (for those with event = 0, censored)
Censoring is a key concept: individuals who haven’t experienced the event by the end of the study are “censored.” The Cox model correctly accounts for this partial information rather than discarding these individuals.
[2]:
# Extract genetic information
summary, gen_df = extract_genes(
input_file=examples.get_path('plink_example', 'example_genotypes.bed'),
genes=examples.get_data('dummy_variants'),
cat_column='gene',
summarise=True,
neg_geno=None,
sum_geno=1,
incl_var=False,
)
gen_df
[2]:
| PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | |
|---|---|---|---|---|---|---|
| ID_000000 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000002 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000003 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN |
| ID_000004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| ID_000995 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000996 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
| ID_000997 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000998 | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 |
| ID_000999 | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 |
1000 rows × 6 columns
[3]:
# Create synthetic survival data for demonstration
rng = np.random.default_rng(42)
n_samples = len(gen_df)
# Simulate cardiovascular disease (CVD) survival data
survival_data = pd.DataFrame({
'cvd_event': rng.binomial(1, 0.3, n_samples), # ~30% event rate
'follow_up_time': rng.exponential(5.0, n_samples) + 0.5, # years of follow-up
}, index=gen_df.index)
# Round follow-up time for readability
survival_data['follow_up_time'] = survival_data['follow_up_time'].round(2)
survival_data.head(10)
[3]:
| cvd_event | follow_up_time | |
|---|---|---|
| ID_000000 | 1 | 0.74 |
| ID_000001 | 0 | 2.99 |
| ID_000002 | 1 | 1.16 |
| ID_000003 | 0 | 1.58 |
| ID_000004 | 0 | 3.13 |
| ID_000005 | 1 | 2.70 |
| ID_000006 | 1 | 14.95 |
| ID_000007 | 1 | 2.93 |
| ID_000008 | 0 | 15.54 |
| ID_000009 | 0 | 4.87 |
In this synthetic data:
cvd_event: 1 = CVD occurred, 0 = no CVD by end of follow-up (censored)follow_up_time: Time in years from baseline to event or censoring
Preparing Data for Analysis
For survival analysis, we need:
Genetic exposures (carrier status for variants/genes)
Event indicator (0/1)
Time-to-event (continuous, positive)
Covariates (optional, for adjusted analyses)
[4]:
# Create covariate data
cov_data = pd.DataFrame({
'age': rng.normal(60, 10, n_samples).round(1),
'sex': rng.binomial(1, 0.5, n_samples),
}, index=gen_df.index)
# Merge all data sources by index
full_df = gen_df.join(survival_data).join(cov_data)
# Reset index to create 'id' column
full_df = full_df.reset_index()
full_df.rename(columns={full_df.columns[0]: 'id'}, inplace=True)
# Remove any missingness (listwise deletion)
full_df.dropna(inplace=True)
# Separate covariates for the analysis
cov_df = full_df[['id', 'age', 'sex']].copy()
full_df.head()
[4]:
| id | PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | cvd_event | follow_up_time | age | sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ID_000001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 2.99 | 65.5 | 0 |
| 4 | ID_000004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 3.13 | 60.9 | 0 |
| 6 | ID_000006 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 14.95 | 63.0 | 0 |
| 8 | ID_000008 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0 | 15.54 | 54.7 | 0 |
| 10 | ID_000010 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 2.99 | 55.5 | 1 |
Running Cox Proportional Hazards Analysis
For survival analysis, we need to specify:
The event column in the
test_dictunder the ‘survival’ categoryA
survival_time_mapthat maps each event column to its corresponding time column
We’ll test whether PCSK9 variant carriers have different cardiovascular disease risk.
[5]:
# Initialize the tester
tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size=1)
# Define survival test - specify the event column
test_dict = {
'survival': {'cvd_event': 'Cox-PH'},
}
# Map event column to time column
survival_time_map = {
'cvd_event': 'follow_up_time'
}
# Define covariate models
models = {
'Unadjusted': 'None',
'Adjusted': 'age;sex'
}
# Run the analysis for PCSK9 (exposure must be a single string)
base_table, coef_table = tester.run(
data=full_df,
exposure='PCSK9',
models=models,
test_dict=test_dict,
cov=cov_df,
id_column='id',
survival_time_map=survival_time_map
)
coef_table
[5]:
| Model | Model name | Stratification name | Variable | Exposure | Cases | Samples | Exposed | Non-exposed | Point Estimates | Standard Error | Test statistic | P-values | Formatted | Formatted (exp) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cox-PH | Unadjusted | <NA> | cvd_event | PCSK9 | 133 | 425 | 28 | 397 | 0.155602 | 0.329591 | <NA> | 0.636851 | 0.16 (-0.49; 0.80) | 1.17 (0.61; 2.23) |
| 0 | Cox-PH | Adjusted | <NA> | cvd_event | PCSK9 | 133 | 425 | 28 | 397 | 0.031146 | 0.339110 | <NA> | 0.926821 | 0.03 (-0.63; 0.70) | 1.03 (0.53; 2.01) |
| 1 | Cox-PH | Adjusted | <NA> | cvd_event | age | 133 | 425 | 28 | 397 | 0.017668 | 0.008762 | <NA> | 0.043766 | 0.02 (0.00; 0.03) | 1.02 (1.00; 1.04) |
| 2 | Cox-PH | Adjusted | <NA> | cvd_event | sex | 133 | 425 | 28 | 397 | 0.097287 | 0.177686 | <NA> | 0.584020 | 0.10 (-0.25; 0.45) | 1.10 (0.78; 1.56) |
Interpreting Results
The output table contains:
Point Estimates: Log hazard ratio (coefficient)
Formatted (exp): Hazard ratio (HR) with 95% confidence interval
P-values: Statistical significance
Understanding Hazard Ratios:
HR = 1.0: No effect (carriers and non-carriers have the same risk)
HR > 1.0: Increased hazard (carriers have higher risk)
Example: HR = 1.5 means 50% increased risk for carriers
HR < 1.0: Decreased hazard (protective effect)
Example: HR = 0.7 means 30% reduced risk for carriers
Statistical significance:
Check if the 95% CI includes 1.0
If CI includes 1.0, the association is not statistically significant
P-value < 0.05 indicates statistical significance at the conventional threshold
Adjustment for covariates:
The “Adjusted” model controls for age and sex, providing the hazard ratio for genetic exposure independent of these factors.
Testing Multiple Genes
You can test multiple genetic exposures simultaneously:
[6]:
# Get all available genes
all_genes = list(gen_df.columns)
# Run analysis for each gene with age/sex adjustment
# DifferencesTester.run() takes a single exposure at a time
coef_tables = []
for gene in all_genes:
_, coef = tester.run(
data=full_df,
exposure=gene,
models={'Adjusted': 'age;sex'},
test_dict=test_dict,
cov=cov_df,
id_column='id',
survival_time_map=survival_time_map
)
coef_tables.append(coef)
coef_table_all = pd.concat(coef_tables, ignore_index=True)
# Show only gene-level results (not covariate coefficients)
coef_table_all[coef_table_all['Exposure'].isin(all_genes)]
[6]:
| Model | Model name | Stratification name | Variable | Exposure | Cases | Samples | Exposed | Non-exposed | Point Estimates | Standard Error | Test statistic | P-values | Formatted | Formatted (exp) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cox-PH | Adjusted | NaN | cvd_event | PCSK9 | 133 | 425 | 28 | 397 | 0.031146 | 0.339110 | NaN | 0.926821 | 0.03 (-0.63; 0.70) | 1.03 (0.53; 2.01) |
| 3 | Cox-PH | Adjusted | NaN | cvd_event | ANGPTL3 | 133 | 425 | 15 | 410 | -0.696982 | 0.592864 | NaN | 0.239747 | -0.70 (-1.86; 0.47) | 0.50 (0.16; 1.59) |
| 6 | Cox-PH | Adjusted | NaN | cvd_event | HMGCR | 133 | 425 | 27 | 398 | -0.107793 | 0.349959 | NaN | 0.758071 | -0.11 (-0.79; 0.58) | 0.90 (0.45; 1.78) |
| 9 | Cox-PH | Adjusted | NaN | cvd_event | NPC1L1 | 133 | 425 | 41 | 384 | 0.149359 | 0.316789 | NaN | 0.637300 | 0.15 (-0.47; 0.77) | 1.16 (0.62; 2.16) |
| 12 | Cox-PH | Adjusted | NaN | cvd_event | CETP | 133 | 425 | 17 | 408 | -1.277074 | 0.725259 | NaN | 0.078263 | -1.28 (-2.70; 0.14) | 0.28 (0.07; 1.16) |
| 15 | Cox-PH | Adjusted | NaN | cvd_event | PPARA | 133 | 425 | 23 | 402 | 0.364442 | 0.346378 | NaN | 0.292730 | 0.36 (-0.31; 1.04) | 1.44 (0.73; 2.84) |
Using Configuration Files
For production analyses, use configuration files instead of notebooks. Here’s a complete example:
# survival_analysis.cnf
[ExpInput]
carriers /path/to/extracted_carriers.tsv.gz
[PhenoInput]
phenotypes /path/to/phenotypes.tsv
covariates /path/to/covariates.tsv
[SurvTests]
death;follow_up_years Cox-PH
cvd_event;time_to_cvd Cox-PH
cancer;age_at_diagnosis Cox-PH
[Covs]
Unadjusted None
Age_Sex age;sex
Full age;sex;bmi;smoking;PC1;PC2;PC3;PC4
[Options]
extract_variants False
association_analysis True
id_column IID
output_path /results/survival
In the configuration file, the [SurvTests] section uses the format:
Format:
event_column;time_column<tab>test_nameNote the semicolon (
;) separator between event and time columnsThe configuration parser automatically creates the
survival_time_mapfrom this syntax
Then run:
python marvel/main.py survival_analysis.cnf -v
Results will be saved to compressed TSV files with hazard ratios and confidence intervals.