Survival Analysis (Time-to-Event) with Cox Proportional Hazards

This notebook demonstrates how to perform survival analysis (time-to-event analysis) using Cox proportional hazards regression in MARVELous. Survival analysis is used when the outcome of interest is the time until an event occurs, such as:

  • Time to disease onset

  • Time to death (mortality)

  • Time to disease progression

  • Time to treatment response

Cox Proportional Hazards Model

The Cox proportional hazards model is the most commonly used method for analysing survival data. It estimates hazard ratios (HR) that quantify how genetic variants affect the instantaneous risk of an event occurring.

Key features:

  • Handles censored data (individuals who haven’t experienced the event yet)

  • Adjusts for covariates (age, sex, etc.)

  • Provides interpretable hazard ratios

  • Does not require specifying a baseline hazard function

[1]:
# imports
import os
import numpy as np
import pandas as pd
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.extraction.aggregation import extract_genes

test_mod = load_custom()
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Understanding Survival Data

Survival data requires two pieces of information for each individual:

  1. Event indicator: A binary variable (0/1) indicating whether the event occurred

    • 0 = censored (event has not occurred by the end of follow-up)

    • 1 = event occurred

  2. Time-to-event: The time from baseline to either:

    • The event occurrence (for those with event = 1)

    • The end of follow-up (for those with event = 0, censored)

Censoring is a key concept: individuals who haven’t experienced the event by the end of the study are “censored.” The Cox model correctly accounts for this partial information rather than discarding these individuals.

[2]:
# Extract genetic information
summary, gen_df = extract_genes(
    input_file=examples.get_path('plink_example', 'example_genotypes.bed'),
    genes=examples.get_data('dummy_variants'),
    cat_column='gene',
    summarise=True,
    neg_geno=None,
    sum_geno=1,
    incl_var=False,
)

gen_df
[2]:
PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA
ID_000000 0.0 NaN 0.0 0.0 0.0 0.0
ID_000001 0.0 0.0 0.0 0.0 0.0 0.0
ID_000002 NaN 0.0 0.0 0.0 0.0 0.0
ID_000003 0.0 0.0 0.0 0.0 NaN NaN
ID_000004 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
ID_000995 0.0 1.0 0.0 0.0 0.0 0.0
ID_000996 0.0 0.0 NaN 0.0 0.0 0.0
ID_000997 0.0 0.0 0.0 0.0 0.0 0.0
ID_000998 0.0 NaN 0.0 1.0 NaN 0.0
ID_000999 0.0 NaN 0.0 0.0 NaN 0.0

1000 rows × 6 columns

[3]:
# Create synthetic survival data for demonstration
rng = np.random.default_rng(42)
n_samples = len(gen_df)

# Simulate cardiovascular disease (CVD) survival data
survival_data = pd.DataFrame({
    'cvd_event': rng.binomial(1, 0.3, n_samples),  # ~30% event rate
    'follow_up_time': rng.exponential(5.0, n_samples) + 0.5,  # years of follow-up
}, index=gen_df.index)

# Round follow-up time for readability
survival_data['follow_up_time'] = survival_data['follow_up_time'].round(2)

survival_data.head(10)
[3]:
cvd_event follow_up_time
ID_000000 1 0.74
ID_000001 0 2.99
ID_000002 1 1.16
ID_000003 0 1.58
ID_000004 0 3.13
ID_000005 1 2.70
ID_000006 1 14.95
ID_000007 1 2.93
ID_000008 0 15.54
ID_000009 0 4.87

In this synthetic data:

  • cvd_event: 1 = CVD occurred, 0 = no CVD by end of follow-up (censored)

  • follow_up_time: Time in years from baseline to event or censoring

Preparing Data for Analysis

For survival analysis, we need:

  1. Genetic exposures (carrier status for variants/genes)

  2. Event indicator (0/1)

  3. Time-to-event (continuous, positive)

  4. Covariates (optional, for adjusted analyses)

[4]:
# Create covariate data
cov_data = pd.DataFrame({
    'age': rng.normal(60, 10, n_samples).round(1),
    'sex': rng.binomial(1, 0.5, n_samples),
}, index=gen_df.index)

# Merge all data sources by index
full_df = gen_df.join(survival_data).join(cov_data)

# Reset index to create 'id' column
full_df = full_df.reset_index()
full_df.rename(columns={full_df.columns[0]: 'id'}, inplace=True)

# Remove any missingness (listwise deletion)
full_df.dropna(inplace=True)

# Separate covariates for the analysis
cov_df = full_df[['id', 'age', 'sex']].copy()

full_df.head()
[4]:
id PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA cvd_event follow_up_time age sex
1 ID_000001 0.0 0.0 0.0 0.0 0.0 0.0 0 2.99 65.5 0
4 ID_000004 0.0 0.0 0.0 0.0 0.0 0.0 0 3.13 60.9 0
6 ID_000006 0.0 0.0 0.0 0.0 0.0 0.0 1 14.95 63.0 0
8 ID_000008 0.0 0.0 1.0 0.0 0.0 0.0 0 15.54 54.7 0
10 ID_000010 0.0 1.0 0.0 0.0 0.0 0.0 0 2.99 55.5 1

Running Cox Proportional Hazards Analysis

For survival analysis, we need to specify:

  1. The event column in the test_dict under the ‘survival’ category

  2. A survival_time_map that maps each event column to its corresponding time column

We’ll test whether PCSK9 variant carriers have different cardiovascular disease risk.

[5]:
# Initialize the tester
tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size=1)

# Define survival test - specify the event column
test_dict = {
    'survival': {'cvd_event': 'Cox-PH'},
}

# Map event column to time column
survival_time_map = {
    'cvd_event': 'follow_up_time'
}

# Define covariate models
models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex'
}

# Run the analysis for PCSK9 (exposure must be a single string)
base_table, coef_table = tester.run(
    data=full_df,
    exposure='PCSK9',
    models=models,
    test_dict=test_dict,
    cov=cov_df,
    id_column='id',
    survival_time_map=survival_time_map
)

coef_table
[5]:
Model Model name Stratification name Variable Exposure Cases Samples Exposed Non-exposed Point Estimates Standard Error Test statistic P-values Formatted Formatted (exp)
0 Cox-PH Unadjusted <NA> cvd_event PCSK9 133 425 28 397 0.155602 0.329591 <NA> 0.636851 0.16 (-0.49; 0.80) 1.17 (0.61; 2.23)
0 Cox-PH Adjusted <NA> cvd_event PCSK9 133 425 28 397 0.031146 0.339110 <NA> 0.926821 0.03 (-0.63; 0.70) 1.03 (0.53; 2.01)
1 Cox-PH Adjusted <NA> cvd_event age 133 425 28 397 0.017668 0.008762 <NA> 0.043766 0.02 (0.00; 0.03) 1.02 (1.00; 1.04)
2 Cox-PH Adjusted <NA> cvd_event sex 133 425 28 397 0.097287 0.177686 <NA> 0.584020 0.10 (-0.25; 0.45) 1.10 (0.78; 1.56)

Interpreting Results

The output table contains:

  • Point Estimates: Log hazard ratio (coefficient)

  • Formatted (exp): Hazard ratio (HR) with 95% confidence interval

  • P-values: Statistical significance

Understanding Hazard Ratios:

  • HR = 1.0: No effect (carriers and non-carriers have the same risk)

  • HR > 1.0: Increased hazard (carriers have higher risk)

    • Example: HR = 1.5 means 50% increased risk for carriers

  • HR < 1.0: Decreased hazard (protective effect)

    • Example: HR = 0.7 means 30% reduced risk for carriers

Statistical significance:

  • Check if the 95% CI includes 1.0

  • If CI includes 1.0, the association is not statistically significant

  • P-value < 0.05 indicates statistical significance at the conventional threshold

Adjustment for covariates:

The “Adjusted” model controls for age and sex, providing the hazard ratio for genetic exposure independent of these factors.

Testing Multiple Genes

You can test multiple genetic exposures simultaneously:

[6]:
# Get all available genes
all_genes = list(gen_df.columns)

# Run analysis for each gene with age/sex adjustment
# DifferencesTester.run() takes a single exposure at a time
coef_tables = []
for gene in all_genes:
    _, coef = tester.run(
        data=full_df,
        exposure=gene,
        models={'Adjusted': 'age;sex'},
        test_dict=test_dict,
        cov=cov_df,
        id_column='id',
        survival_time_map=survival_time_map
    )
    coef_tables.append(coef)

coef_table_all = pd.concat(coef_tables, ignore_index=True)

# Show only gene-level results (not covariate coefficients)
coef_table_all[coef_table_all['Exposure'].isin(all_genes)]
[6]:
Model Model name Stratification name Variable Exposure Cases Samples Exposed Non-exposed Point Estimates Standard Error Test statistic P-values Formatted Formatted (exp)
0 Cox-PH Adjusted NaN cvd_event PCSK9 133 425 28 397 0.031146 0.339110 NaN 0.926821 0.03 (-0.63; 0.70) 1.03 (0.53; 2.01)
3 Cox-PH Adjusted NaN cvd_event ANGPTL3 133 425 15 410 -0.696982 0.592864 NaN 0.239747 -0.70 (-1.86; 0.47) 0.50 (0.16; 1.59)
6 Cox-PH Adjusted NaN cvd_event HMGCR 133 425 27 398 -0.107793 0.349959 NaN 0.758071 -0.11 (-0.79; 0.58) 0.90 (0.45; 1.78)
9 Cox-PH Adjusted NaN cvd_event NPC1L1 133 425 41 384 0.149359 0.316789 NaN 0.637300 0.15 (-0.47; 0.77) 1.16 (0.62; 2.16)
12 Cox-PH Adjusted NaN cvd_event CETP 133 425 17 408 -1.277074 0.725259 NaN 0.078263 -1.28 (-2.70; 0.14) 0.28 (0.07; 1.16)
15 Cox-PH Adjusted NaN cvd_event PPARA 133 425 23 402 0.364442 0.346378 NaN 0.292730 0.36 (-0.31; 1.04) 1.44 (0.73; 2.84)

Using Configuration Files

For production analyses, use configuration files instead of notebooks. Here’s a complete example:

# survival_analysis.cnf

[ExpInput]
carriers    /path/to/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes  /path/to/phenotypes.tsv
covariates  /path/to/covariates.tsv

[SurvTests]
death;follow_up_years   Cox-PH
cvd_event;time_to_cvd   Cox-PH
cancer;age_at_diagnosis Cox-PH

[Covs]
Unadjusted  None
Age_Sex age;sex
Full    age;sex;bmi;smoking;PC1;PC2;PC3;PC4

[Options]
extract_variants    False
association_analysis    True
id_column   IID
output_path /results/survival

In the configuration file, the [SurvTests] section uses the format:

  • Format: event_column;time_column<tab>test_name

  • Note the semicolon (;) separator between event and time columns

  • The configuration parser automatically creates the survival_time_map from this syntax

Then run:

python marvel/main.py survival_analysis.cnf -v

Results will be saved to compressed TSV files with hazard ratios and confidence intervals.