Survival Analysis (Time-to-Event) with Cox Proportional Hazards

This notebook demonstrates how to perform survival analysis (time-to-event analysis) using Cox proportional hazards regression in MARVELous. Survival analysis is used when the outcome of interest is the time until an event occurs, such as:

Time to disease onset
Time to death (mortality)
Time to disease progression
Time to treatment response

Cox Proportional Hazards Model

The Cox proportional hazards model is the most commonly used method for analysing survival data. It estimates hazard ratios (HR) that quantify how genetic variants affect the instantaneous risk of an event occurring.

Key features:

Handles censored data (individuals who haven’t experienced the event yet)
Adjusts for covariates (age, sex, etc.)
Provides interpretable hazard ratios
Does not require specifying a baseline hazard function

[1]:

# imports
import os
import numpy as np
import pandas as pd
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.extraction.aggregation import extract_genes

test_mod = load_custom()

/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Understanding Survival Data

Survival data requires two pieces of information for each individual:

Event indicator: A binary variable (0/1) indicating whether the event occurred
- 0 = censored (event has not occurred by the end of follow-up)
- 1 = event occurred
Time-to-event: The time from baseline to either:
- The event occurrence (for those with event = 1)
- The end of follow-up (for those with event = 0, censored)

Censoring is a key concept: individuals who haven’t experienced the event by the end of the study are “censored.” The Cox model correctly accounts for this partial information rather than discarding these individuals.

[2]:

# Extract genetic information
summary, gen_df = extract_genes(
    input_file=examples.get_path('plink_example', 'example_genotypes.bed'),
    genes=examples.get_data('dummy_variants'),
    cat_column='gene',
    summarise=True,
    neg_geno=None,
    sum_geno=1,
    incl_var=False,
)

gen_df

[2]:

	PCSK9	ANGPTL3	HMGCR	NPC1L1	CETP	PPARA
ID_000000	0.0	NaN	0.0	0.0	0.0	0.0
ID_000001	0.0	0.0	0.0	0.0	0.0	0.0
ID_000002	NaN	0.0	0.0	0.0	0.0	0.0
ID_000003	0.0	0.0	0.0	0.0	NaN	NaN
ID_000004	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...
ID_000995	0.0	1.0	0.0	0.0	0.0	0.0
ID_000996	0.0	0.0	NaN	0.0	0.0	0.0
ID_000997	0.0	0.0	0.0	0.0	0.0	0.0
ID_000998	0.0	NaN	0.0	1.0	NaN	0.0
ID_000999	0.0	NaN	0.0	0.0	NaN	0.0

1000 rows × 6 columns

[3]:

# Create synthetic survival data for demonstration
rng = np.random.default_rng(42)
n_samples = len(gen_df)

# Simulate cardiovascular disease (CVD) survival data
survival_data = pd.DataFrame({
    'cvd_event': rng.binomial(1, 0.3, n_samples),  # ~30% event rate
    'follow_up_time': rng.exponential(5.0, n_samples) + 0.5,  # years of follow-up
}, index=gen_df.index)

# Round follow-up time for readability
survival_data['follow_up_time'] = survival_data['follow_up_time'].round(2)

survival_data.head(10)

[3]:

	cvd_event	follow_up_time
ID_000000	1	0.74
ID_000001	0	2.99
ID_000002	1	1.16
ID_000003	0	1.58
ID_000004	0	3.13
ID_000005	1	2.70
ID_000006	1	14.95
ID_000007	1	2.93
ID_000008	0	15.54
ID_000009	0	4.87

In this synthetic data:

cvd_event: 1 = CVD occurred, 0 = no CVD by end of follow-up (censored)
follow_up_time: Time in years from baseline to event or censoring

Preparing Data for Analysis

For survival analysis, we need:

Genetic exposures (carrier status for variants/genes)
Event indicator (0/1)
Time-to-event (continuous, positive)
Covariates (optional, for adjusted analyses)

[4]:

# Create covariate data
cov_data = pd.DataFrame({
    'age': rng.normal(60, 10, n_samples).round(1),
    'sex': rng.binomial(1, 0.5, n_samples),
}, index=gen_df.index)

# Merge all data sources by index
full_df = gen_df.join(survival_data).join(cov_data)

# Reset index to create 'id' column
full_df = full_df.reset_index()
full_df.rename(columns={full_df.columns[0]: 'id'}, inplace=True)

# Remove any missingness (listwise deletion)
full_df.dropna(inplace=True)

# Separate covariates for the analysis
cov_df = full_df[['id', 'age', 'sex']].copy()

full_df.head()

[4]:

	id	ANGPTL3	HMGCR	cvd_event	follow_up_time	age	sex
1	ID_000001	0.0	0.0	0	2.99	65.5	0
4	ID_000004	0.0	0.0	0	3.13	60.9	0
6	ID_000006	0.0	0.0	1	14.95	63.0	0
8	ID_000008	0.0	1.0	0	15.54	54.7	0
10	ID_000010	1.0	0.0	0	2.99	55.5	1

Running Cox Proportional Hazards Analysis

For survival analysis, we need to specify:

The event column in the test_dict under the ‘survival’ category
A survival_time_map that maps each event column to its corresponding time column

We’ll test whether PCSK9 variant carriers have different cardiovascular disease risk.

[5]:

# Initialize the tester
tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size=1)

# Define survival test - specify the event column
test_dict = {
    'survival': {'cvd_event': 'Cox-PH'},
}

# Map event column to time column
survival_time_map = {
    'cvd_event': 'follow_up_time'
}

# Define covariate models
models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex'
}

# Run the analysis for PCSK9 (exposure must be a single string)
base_table, coef_table = tester.run(
    data=full_df,
    exposure='PCSK9',
    models=models,
    test_dict=test_dict,
    cov=cov_df,
    id_column='id',
    survival_time_map=survival_time_map
)

coef_table

[5]:

	Model	Model name	Stratification name	Variable	Exposure	Cases	Samples	Exposed	Non-exposed	Point Estimates	Standard Error	Test statistic	P-values	Formatted	Formatted (exp)
0	Cox-PH	Unadjusted	<NA>	cvd_event	PCSK9	133	425	28	397	0.155602	0.329591	<NA>	0.636851	0.16 (-0.49; 0.80)	1.17 (0.61; 2.23)
0	Cox-PH	Adjusted	<NA>	cvd_event	PCSK9	133	425	28	397	0.031146	0.339110	<NA>	0.926821	0.03 (-0.63; 0.70)	1.03 (0.53; 2.01)
1	Cox-PH	Adjusted	<NA>	cvd_event	age	133	425	28	397	0.017668	0.008762	<NA>	0.043766	0.02 (0.00; 0.03)	1.02 (1.00; 1.04)
2	Cox-PH	Adjusted	<NA>	cvd_event	sex	133	425	28	397	0.097287	0.177686	<NA>	0.584020	0.10 (-0.25; 0.45)	1.10 (0.78; 1.56)

Interpreting Results

The output table contains:

Point Estimates: Log hazard ratio (coefficient)
Formatted (exp): Hazard ratio (HR) with 95% confidence interval
P-values: Statistical significance

Understanding Hazard Ratios:

HR = 1.0: No effect (carriers and non-carriers have the same risk)
HR > 1.0: Increased hazard (carriers have higher risk)
- Example: HR = 1.5 means 50% increased risk for carriers
HR < 1.0: Decreased hazard (protective effect)
- Example: HR = 0.7 means 30% reduced risk for carriers

Statistical significance:

Check if the 95% CI includes 1.0
If CI includes 1.0, the association is not statistically significant
P-value < 0.05 indicates statistical significance at the conventional threshold

Adjustment for covariates:

The “Adjusted” model controls for age and sex, providing the hazard ratio for genetic exposure independent of these factors.

Testing Multiple Genes

You can test multiple genetic exposures simultaneously:

[6]:

# Get all available genes
all_genes = list(gen_df.columns)

# Run analysis for each gene with age/sex adjustment
# DifferencesTester.run() takes a single exposure at a time
coef_tables = []
for gene in all_genes:
    _, coef = tester.run(
        data=full_df,
        exposure=gene,
        models={'Adjusted': 'age;sex'},
        test_dict=test_dict,
        cov=cov_df,
        id_column='id',
        survival_time_map=survival_time_map
    )
    coef_tables.append(coef)

coef_table_all = pd.concat(coef_tables, ignore_index=True)

# Show only gene-level results (not covariate coefficients)
coef_table_all[coef_table_all['Exposure'].isin(all_genes)]

[6]:

	Model	Model name	Stratification name	Variable	Exposure	Cases	Samples	Exposed	Non-exposed	Point Estimates	Standard Error	Test statistic	P-values	Formatted	Formatted (exp)
0	Cox-PH	Adjusted	NaN	cvd_event	PCSK9	133	425	28	397	0.031146	0.339110	NaN	0.926821	0.03 (-0.63; 0.70)	1.03 (0.53; 2.01)
3	Cox-PH	Adjusted	NaN	cvd_event	ANGPTL3	133	425	15	410	-0.696982	0.592864	NaN	0.239747	-0.70 (-1.86; 0.47)	0.50 (0.16; 1.59)
6	Cox-PH	Adjusted	NaN	cvd_event	HMGCR	133	425	27	398	-0.107793	0.349959	NaN	0.758071	-0.11 (-0.79; 0.58)	0.90 (0.45; 1.78)
9	Cox-PH	Adjusted	NaN	cvd_event	NPC1L1	133	425	41	384	0.149359	0.316789	NaN	0.637300	0.15 (-0.47; 0.77)	1.16 (0.62; 2.16)
12	Cox-PH	Adjusted	NaN	cvd_event	CETP	133	425	17	408	-1.277074	0.725259	NaN	0.078263	-1.28 (-2.70; 0.14)	0.28 (0.07; 1.16)
15	Cox-PH	Adjusted	NaN	cvd_event	PPARA	133	425	23	402	0.364442	0.346378	NaN	0.292730	0.36 (-0.31; 1.04)	1.44 (0.73; 2.84)

Using Configuration Files

For production analyses, use configuration files instead of notebooks. Here’s a complete example:

# survival_analysis.cnf

[ExpInput]
carriers    /path/to/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes  /path/to/phenotypes.tsv
covariates  /path/to/covariates.tsv

[SurvTests]
death;follow_up_years   Cox-PH
cvd_event;time_to_cvd   Cox-PH
cancer;age_at_diagnosis Cox-PH

[Covs]
Unadjusted  None
Age_Sex age;sex
Full    age;sex;bmi;smoking;PC1;PC2;PC3;PC4

[Options]
extract_variants    False
association_analysis    True
id_column   IID
output_path /results/survival

In the configuration file, the [SurvTests] section uses the format:

Format: event_column;time_column<tab>test_name
Note the semicolon (;) separator between event and time columns
The configuration parser automatically creates the survival_time_map from this syntax

Then run:

python marvel/main.py survival_analysis.cnf -v

Results will be saved to compressed TSV files with hazard ratios and confidence intervals.