Association Testing

This notebook demonstrates the DifferencesTester API for associating genetic variant carriership with phenotypic outcomes. It covers:

  • Continuous, categorical, and binary outcome tests

  • Multi-model covariate adjustment

  • Listwise vs pairwise missingness handling

  • Survival (time-to-event) analysis with Cox proportional hazards regression

[1]:
# imports
import os
import numpy as np
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.utils.missingness import MissingnessHandler
from marvel.extraction.aggregation import extract_genes
test_mod = load_custom()
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data by extracting gene-level carrier counts from the PLINK example files, similar to the process shown in the aggregation notebook. We want to determine whether carriers differ from non-carriers for a few outcomes, which are loaded as well. In addition, we will correct for age and sex in the association analyses.

[ ]:
# Extract genetic information
summary, gen_df = extract_genes(
    input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
    genes = examples.get_data('dummy_variants'),
    cat_column = 'gene',
    summarise = True,
    neg_geno = None,
    sum_geno = 1,
    incl_var = False,
)

# Show example
summary
[3]:
# Load outcomes
pheno = examples.get_data('outcome_data')
pheno
[3]:
id hypertension diabetes cvd sbp ldl_cholesterol crp_level disease_severity treatment_response risk_category follow_up_time
0 ID_000000 0.0 0.0 0.0 135.899046 4.628664 11.963852 medium severe B 12.02
1 ID_000001 1.0 0.0 0.0 129.255089 4.825001 33.911247 high control B 11.68
2 ID_000002 1.0 1.0 1.0 NaN 5.671148 29.237849 high control B 11.92
3 ID_000003 0.0 NaN 0.0 99.510664 5.284872 14.347175 high severe C 1.40
4 ID_000004 0.0 0.0 0.0 138.179160 4.546807 40.532594 high control A 0.43
... ... ... ... ... ... ... ... ... ... ... ...
995 ID_000995 0.0 0.0 1.0 144.427028 5.230045 76.510642 medium control B 0.77
996 ID_000996 1.0 1.0 0.0 115.734230 5.147892 28.864107 low control A 7.98
997 ID_000997 0.0 0.0 0.0 NaN 3.459606 11.016325 low severe A 12.29
998 ID_000998 1.0 0.0 0.0 130.183969 4.165561 12.189264 high control B 7.17
999 ID_000999 0.0 0.0 1.0 117.945068 4.583163 33.683725 low NaN D 1.93

1000 rows × 11 columns

[4]:
# Load covariates
cov_df = examples.get_data('cov_data')
cov_df
[4]:
id age sex bmi smoking_status alcohol_consumption PC1 PC2 PC3 PC4 batch
0 ID_000000 59.720649 0.0 21.422981 1.0 0.0 -0.792276 0.763319 -2.848527 0.187020 NaN
1 ID_000001 67.088467 0.0 23.285183 1.0 2.0 -0.412052 0.068425 -0.075304 0.586625 batch1
2 ID_000002 54.541104 0.0 29.997524 0.0 1.0 1.960718 -0.946065 0.990386 0.445479 batch1
3 ID_000003 58.368992 1.0 25.503487 0.0 2.0 2.672157 0.904750 0.105501 -0.049818 batch1
4 ID_000004 75.751361 0.0 21.167230 0.0 1.0 1.562669 0.210285 NaN -0.424643 batch1
... ... ... ... ... ... ... ... ... ... ... ...
995 ID_000995 37.323449 1.0 29.385406 1.0 2.0 -0.916293 -2.119301 0.574164 1.812890 NaN
996 ID_000996 48.741477 0.0 20.147595 0.0 1.0 1.039441 0.662346 -1.723685 -0.396235 batch1
997 ID_000997 75.561878 1.0 26.974480 1.0 2.0 0.290605 2.128468 0.855002 -0.091420 batch1
998 ID_000998 56.062975 0.0 25.329474 2.0 0.0 1.982513 -2.600536 -0.741593 0.990803 batch2
999 ID_000999 83.959874 0.0 30.452003 2.0 0.0 2.392298 -1.235706 -0.374105 -0.400736 batch2

1000 rows × 11 columns

Handle missingness and prepare data

While missing data can be handled during the association analyses, it is recommended to address it beforehand where possible. Categorical string outcomes must also be encoded as integers before being passed to DifferencesTester.

The cell below merges the genetic, phenotype, and covariate data, encodes the categorical outcomes, and splits off the covariate dataframe used in subsequent tests.

[5]:
# merge the data
full_df = gen_df.merge(pheno, how='inner', left_index=True, right_on='id')
full_df = full_df.merge(cov_df, how='inner', on='id')

# Preserve raw merged dataframe BEFORE any dropna — used for missingness demo below
raw_merged = full_df.copy()

# remove missingness
full_df.dropna(inplace=True)

# Refactor categorical outcomes
full_df['disease_severity'] = full_df['disease_severity'].map({'low': 0, 'medium': 1, 'high': 2})
full_df['treatment_response'] = full_df['treatment_response'].map({'control': 0, 'mild': 1, 'severe': 2})

# Separate covariates
cov_df = full_df[['id', 'age', 'sex']].copy()

# show subset of data
variables = ['id'] + list(gen_df.columns) + ['sbp', 'ldl_cholesterol', 'disease_severity', 'treatment_response', 'diabetes', 'cvd']
full_df[variables]
[5]:
id PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA sbp ldl_cholesterol disease_severity treatment_response diabetes cvd
1 ID_000001 0.0 0.0 0.0 0.0 0.0 0.0 129.255089 4.825001 2 0 0.0 0.0
6 ID_000006 0.0 0.0 0.0 0.0 0.0 0.0 114.460843 5.286435 2 0 0.0 0.0
10 ID_000010 0.0 1.0 0.0 0.0 0.0 0.0 104.559427 3.931334 0 1 0.0 0.0
18 ID_000018 0.0 0.0 0.0 0.0 0.0 0.0 103.252133 5.640530 0 2 1.0 0.0
19 ID_000019 1.0 0.0 0.0 0.0 0.0 0.0 138.537174 5.344172 0 0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
974 ID_000974 0.0 0.0 0.0 0.0 0.0 0.0 136.558160 4.361229 2 0 0.0 0.0
979 ID_000979 0.0 0.0 0.0 0.0 0.0 0.0 132.521270 4.758025 1 2 0.0 1.0
987 ID_000987 0.0 0.0 0.0 0.0 0.0 0.0 117.765460 5.753577 1 0 0.0 0.0
988 ID_000988 0.0 0.0 1.0 0.0 0.0 0.0 127.541766 2.930257 0 0 0.0 1.0
990 ID_000990 0.0 0.0 0.0 1.0 0.0 0.0 130.591184 6.235300 0 2 0.0 0.0

209 rows × 13 columns

[6]:
tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size = 1)

test_dict = {
    'continuous': {'sbp': 'OLS;T', 'ldl_cholesterol' : 'OLS;T'},
    'binary': {'diabetes': 'GLM-Binom', 'cvd' : 'GLM-Binom;FISHER'},
    'categorical' : {'disease_severity' : 'CHISQ', 'treatment_response' : 'CHISQ'},
}

models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex'
}

base_table, coef_table = tester.run(
    data=full_df,
    exposure=['PCSK9'],
    models=models,
    test_dict=test_dict,
    cov=cov_df,
    id_column = 'id'
)
base_table.drop_duplicates()
[6]:
0.0 1.0 Difference
N (%) or Mean (SD) Median (Q1; Q3) Missing (%) N (%) or Mean (SD) Median (Q1; Q3) Missing (%) OLS OLS_Stat T T_Stat GLM-Binom GLM-Binom_Stat FISHER FISHER_Stat CHISQ CHISQ_Stat
Variable
Total sample size_Unadjusted 190 NaN NaN 19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
sbp_Unadjusted 119.35 (15.17) 120.46 (110.23; 129.89) 0 (0.00) 120.98 (16.94) 122.12 (110.83; 127.83) 0 (0.00) 0.661163 5.832272e-183 0.661163 -0.438940 NaN NaN NaN NaN NaN NaN
ldl_cholesterol_Unadjusted 5.11 (1.11) 5.18 (4.42; 5.75) 0 (0.00) 5.21 (0.79) 5.21 (4.76; 5.88) 0 (0.00) 0.711937 2.623666e-139 0.711937 -0.369763 NaN NaN NaN NaN NaN NaN
diabetes_Unadjusted 29 (15.26) NaN 0 (0.00) 5 (26.32) NaN 0 (0.00) NaN NaN NaN NaN 0.220503 1.942967e-17 NaN NaN NaN NaN
cvd_Unadjusted 83 (43.68) NaN 0 (0.00) 10 (52.63) NaN 0 (0.00) NaN NaN NaN NaN 0.456123 8.248046e-02 0.477321 1.432396 NaN NaN
disease_severity: 0_Unadjusted 92 (48.42) NaN 0 (0.00) 11 (57.89) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN 0.272478 2.600396
disease_severity: 1_Unadjusted 59 (31.05) NaN 0 (0.00) 7 (36.84) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
disease_severity: 2_Unadjusted 39 (20.53) NaN 0 (0.00) 1 (5.26) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
treatment_response: 0_Unadjusted 122 (64.21) NaN 0 (0.00) 12 (63.16) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN 0.636175 0.904563
treatment_response: 1_Unadjusted 36 (18.95) NaN 0 (0.00) 5 (26.32) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
treatment_response: 2_Unadjusted 32 (16.84) NaN 0 (0.00) 2 (10.53) NaN 0 (0.00) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
sbp_Adjusted 119.35 (15.17) 120.46 (110.23; 129.89) 0 (0.00) 120.98 (16.94) 122.12 (110.83; 127.83) 0 (0.00) 0.641524 6.937978e-70 0.661163 -0.438940 NaN NaN NaN NaN NaN NaN
ldl_cholesterol_Adjusted 5.11 (1.11) 5.18 (4.42; 5.75) 0 (0.00) 5.21 (0.79) 5.21 (4.76; 5.88) 0 (0.00) 0.752053 9.567225e-38 0.711937 -0.369763 NaN NaN NaN NaN NaN NaN
diabetes_Adjusted 29 (15.26) NaN 0 (0.00) 5 (26.32) NaN 0 (0.00) NaN NaN NaN NaN 0.226903 3.512443e-03 NaN NaN NaN NaN
cvd_Adjusted 83 (43.68) NaN 0 (0.00) 10 (52.63) NaN 0 (0.00) NaN NaN NaN NaN 0.457350 6.220539e-01 0.477321 1.432396 NaN NaN
[7]:
coef_table
[7]:
Model Model name Stratification name Variable Exposure Cases Samples Exposed Non-exposed Point Estimates Standard Error Test statistic P-values Formatted Formatted (exp)
0 OLS Unadjusted <NA> sbp Intercept 0.0 209.0 19.0 190.0 119.351553 1.118009 NaN 5.832272e-183 119.35 (117.16; 121.54) 6819002940268249165055820291392844669167365204...
1 OLS Unadjusted <NA> sbp PCSK9 0.0 209.0 19.0 190.0 1.627597 3.708015 NaN 6.611626e-01 1.63 (-5.64; 8.90) 5.09 (0.00; 7296.66)
0 T Unadjusted <NA> sbp PCSK9 0.0 209.0 19.0 190.0 NaN NaN -0.438940 6.611626e-01 NaN NaN
0 OLS Unadjusted <NA> ldl_cholesterol Intercept 0.0 209.0 19.0 190.0 5.109830 0.078967 NaN 2.623666e-139 5.11 (4.96; 5.26) 165.64 (141.89; 193.37)
1 OLS Unadjusted <NA> ldl_cholesterol PCSK9 0.0 209.0 19.0 190.0 0.096843 0.261904 NaN 7.119368e-01 0.10 (-0.42; 0.61) 1.10 (0.66; 1.84)
0 T Unadjusted <NA> ldl_cholesterol PCSK9 0.0 209.0 19.0 190.0 NaN NaN -0.369763 7.119368e-01 NaN NaN
0 GLM-Binom Unadjusted <NA> diabetes Intercept 34.0 209.0 19.0 190.0 -1.714109 0.201727 NaN 1.942967e-17 -1.71 (-2.11; -1.32) 0.18 (0.12; 0.27)
1 GLM-Binom Unadjusted <NA> diabetes PCSK9 34.0 209.0 19.0 190.0 0.684489 0.558679 NaN 2.205027e-01 0.68 (-0.41; 1.78) 1.98 (0.66; 5.93)
0 GLM-Binom Unadjusted <NA> cvd Intercept 93.0 209.0 19.0 190.0 -0.253988 0.146267 NaN 8.248046e-02 -0.25 (-0.54; 0.03) 0.78 (0.58; 1.03)
1 GLM-Binom Unadjusted <NA> cvd PCSK9 93.0 209.0 19.0 190.0 0.359349 0.482188 NaN 4.561228e-01 0.36 (-0.59; 1.30) 1.43 (0.56; 3.69)
0 FISHER Unadjusted <NA> cvd PCSK9 93.0 209.0 19.0 190.0 NaN NaN 1.432396 4.773212e-01 NaN NaN
0 CHISQ Unadjusted <NA> disease_severity PCSK9 NaN NaN NaN NaN NaN NaN 2.600396 2.724778e-01 NaN NaN
0 CHISQ Unadjusted <NA> treatment_response PCSK9 NaN NaN NaN NaN NaN NaN 0.904563 6.361750e-01 NaN NaN
0 OLS Adjusted <NA> sbp Intercept 0.0 209.0 19.0 190.0 119.196798 4.386169 NaN 6.937978e-70 119.20 (110.60; 127.79) 5841327625673460048707196120176087049213956698...
1 OLS Adjusted <NA> sbp PCSK9 0.0 209.0 19.0 190.0 1.731601 3.713803 NaN 6.415239e-01 1.73 (-5.55; 9.01) 5.65 (0.00; 8188.79)
2 OLS Adjusted <NA> sbp age 0.0 209.0 19.0 190.0 -0.027879 0.072927 NaN 7.026475e-01 -0.03 (-0.17; 0.12) 0.97 (0.84; 1.12)
3 OLS Adjusted <NA> sbp sex 0.0 209.0 19.0 190.0 2.883533 2.166641 NaN 1.847077e-01 2.88 (-1.36; 7.13) 17.88 (0.26; 1248.96)
0 T Adjusted <NA> sbp PCSK9 0.0 209.0 19.0 190.0 NaN NaN -0.438940 6.611626e-01 NaN NaN
0 OLS Adjusted <NA> ldl_cholesterol Intercept 0.0 209.0 19.0 190.0 4.940407 0.309923 NaN 9.567225e-38 4.94 (4.33; 5.55) 139.83 (76.17; 256.68)
1 OLS Adjusted <NA> ldl_cholesterol PCSK9 0.0 209.0 19.0 190.0 0.083016 0.262414 NaN 7.520534e-01 0.08 (-0.43; 0.60) 1.09 (0.65; 1.82)
2 OLS Adjusted <NA> ldl_cholesterol age 0.0 209.0 19.0 190.0 0.004640 0.005153 NaN 3.688913e-01 0.00 (-0.01; 0.01) 1.00 (0.99; 1.01)
3 OLS Adjusted <NA> ldl_cholesterol sex 0.0 209.0 19.0 190.0 -0.148852 0.153093 NaN 3.320483e-01 -0.15 (-0.45; 0.15) 0.86 (0.64; 1.16)
0 T Adjusted <NA> ldl_cholesterol PCSK9 0.0 209.0 19.0 190.0 NaN NaN -0.369763 7.119368e-01 NaN NaN
0 GLM-Binom Adjusted <NA> diabetes Intercept 34.0 209.0 19.0 190.0 -2.390569 0.818990 NaN 3.512443e-03 -2.39 (-4.00; -0.79) 0.09 (0.02; 0.46)
1 GLM-Binom Adjusted <NA> diabetes PCSK9 34.0 209.0 19.0 190.0 0.679312 0.562170 NaN 2.269031e-01 0.68 (-0.42; 1.78) 1.97 (0.66; 5.94)
2 GLM-Binom Adjusted <NA> diabetes age 34.0 209.0 19.0 190.0 0.006939 0.013147 NaN 5.976202e-01 0.01 (-0.02; 0.03) 1.01 (0.98; 1.03)
3 GLM-Binom Adjusted <NA> diabetes sex 34.0 209.0 19.0 190.0 0.459180 0.399071 NaN 2.498874e-01 0.46 (-0.32; 1.24) 1.58 (0.72; 3.46)
0 GLM-Binom Adjusted <NA> cvd Intercept 93.0 209.0 19.0 190.0 -0.282808 0.573716 NaN 6.220539e-01 -0.28 (-1.41; 0.84) 0.75 (0.24; 2.32)
1 GLM-Binom Adjusted <NA> cvd PCSK9 93.0 209.0 19.0 190.0 0.358869 0.482859 NaN 4.573501e-01 0.36 (-0.59; 1.31) 1.43 (0.56; 3.69)
2 GLM-Binom Adjusted <NA> cvd age 93.0 209.0 19.0 190.0 0.000277 0.009536 NaN 9.768179e-01 0.00 (-0.02; 0.02) 1.00 (0.98; 1.02)
3 GLM-Binom Adjusted <NA> cvd sex 93.0 209.0 19.0 190.0 0.022824 0.283299 NaN 9.357882e-01 0.02 (-0.53; 0.58) 1.02 (0.59; 1.78)
0 FISHER Adjusted <NA> cvd PCSK9 93.0 209.0 19.0 190.0 NaN NaN 1.432396 4.773212e-01 NaN NaN
0 CHISQ Adjusted <NA> disease_severity PCSK9 NaN NaN NaN NaN NaN NaN 2.600396 2.724778e-01 NaN NaN
0 CHISQ Adjusted <NA> treatment_response PCSK9 NaN NaN NaN NaN NaN NaN 0.904563 6.361750e-01 NaN NaN

Multi-model covariate comparison

The coef_table above contains results for both the Unadjusted (no covariates) and Adjusted (age + sex) models. Sorting by exposure, outcome, and model places the two rows for each exposure–outcome pair adjacent to each other, making it easy to see whether adjusting for covariates changes the association.

Columns of interest:

  • Model nameUnadjusted or Adjusted

  • coef / Point Estimates — effect size

  • P-values — statistical significance

A smaller p-value after adjustment may indicate that the raw association was confounded by age or sex. A larger p-value after adjustment is common when the covariate explains some of the outcome variance.

[8]:
# Exposure name used in this analysis
exposure_col_name = 'PCSK9'

# Filter to the exposure rows (exclude intercept and covariate rows),
# then sort by outcome (Variable) and model so that Unadjusted and Adjusted
# rows for the same outcome appear adjacent.
(
    coef_table
    .rename(columns={'Model name': 'model', 'Variable': 'outcome', 'Point Estimates': 'coef', 'P-values': 'pvalue'})
    .query("Exposure == @exposure_col_name")
    .sort_values(['outcome', 'model'])
    [['model', 'outcome', 'coef', 'pvalue']]
    .head(12)
)
[8]:
model outcome coef pvalue
1 Adjusted cvd 0.358869 0.457350
0 Adjusted cvd NaN 0.477321
1 Unadjusted cvd 0.359349 0.456123
0 Unadjusted cvd NaN 0.477321
1 Adjusted diabetes 0.679312 0.226903
1 Unadjusted diabetes 0.684489 0.220503
0 Adjusted disease_severity NaN 0.272478
0 Unadjusted disease_severity NaN 0.272478
1 Adjusted ldl_cholesterol 0.083016 0.752053
0 Adjusted ldl_cholesterol NaN 0.711937
1 Unadjusted ldl_cholesterol 0.096843 0.711937
0 Unadjusted ldl_cholesterol NaN 0.711937

Pairwise vs listwise missingness

MARVEL’s MissingnessHandler supports two strategies for dealing with missing data:

  • Listwise deletion (conservative): drops every sample that has a missing value in any of the specified columns. The same reduced sample set is used for all downstream tests — consistent but potentially wasteful.

  • Pairwise deletion (maximises N): drops only the samples that are missing the specific columns required for each individual test (exposure + outcome + covariates). Different tests may therefore use different sample counts.

The demo below operates on raw_merged — the merged dataframe before any dropna call — so that the comparison is meaningful. If the data happen to have identical missingness patterns across all variables (making listwise == pairwise for every outcome), synthetic missingness is introduced for illustration.

[9]:
# Variables used for the missingness demo
exposure_miss = 'PCSK9'
outcomes_miss = ['sbp', 'hypertension']
covariates_miss = ['age', 'sex']

all_cols = [exposure_miss] + outcomes_miss + covariates_miss

# Check listwise vs pairwise counts on the raw merged data
n_listwise_raw = len(
    MissingnessHandler(listwise=True, verbose=False).handle_listwise(
        data=raw_merged, columns=all_cols
    )
)
n_pairwise_raw = {
    outcome: len(
        MissingnessHandler(listwise=False, verbose=False).handle_pairwise(
            data=raw_merged,
            exposure=exposure_miss,
            outcome=outcome,
            covariates=covariates_miss,
        )
    )
    for outcome in outcomes_miss
}

# Check whether pairwise counts differ from listwise — if not, introduce synthetic
# missingness so the comparison is meaningful
counts_differ = any(n != n_listwise_raw for n in n_pairwise_raw.values())

if counts_differ:
    raw_merged_demo = raw_merged
    print("Using real data (pairwise and listwise counts already differ).")
else:
    # --- Introducing synthetic missingness for demo purposes ---
    # Existing data had identical listwise/pairwise counts, so we inject
    # controlled NaNs to create a divergent comparison.
    rng = np.random.default_rng(42)
    raw_merged_demo = raw_merged.copy()
    raw_merged_demo.loc[raw_merged_demo.index[::5], 'sbp'] = np.nan
    raw_merged_demo.loc[raw_merged_demo.index[::7], 'hypertension'] = np.nan
    print("Synthetic missingness introduced for demo purposes.")
    print("  sbp NaN every 5th row, hypertension NaN every 7th row.")
print()
Using real data (pairwise and listwise counts already differ).

[10]:
# --- Listwise deletion ---
# Drops any sample missing in ANY of the specified columns.
# One fixed sample size applies to all tests.
n_listwise = len(
    MissingnessHandler(listwise=True, verbose=False).handle_listwise(
        data=raw_merged_demo,
        columns=all_cols,
    )
)
print(f"Listwise retained N (applied once for all outcomes): {n_listwise}")
print()

# --- Pairwise deletion ---
# Drops only samples missing for the specific exposure + outcome + covariates.
# Each outcome may retain a different N.
pairwise_handler = MissingnessHandler(listwise=False, verbose=False)
print("Pairwise retained N (per outcome):")
for outcome in outcomes_miss:
    n_pw = len(
        pairwise_handler.handle_pairwise(
            data=raw_merged_demo,
            exposure=exposure_miss,
            outcome=outcome,
            covariates=covariates_miss,
        )
    )
    print(f"  {outcome}: {n_pw}")
Listwise retained N (applied once for all outcomes): 743

Pairwise retained N (per outcome):
  sbp: 788
  hypertension: 787

Survival analysis

MARVEL supports Cox proportional hazards regression via the 'Cox-PH' test key. This requires:

  • An event indicator column (binary 0/1) — here cvd (cardiovascular disease event).

  • A follow-up time column containing positive values — here follow_up_time (years, clipped to a minimum of 0.1).

  • A survival_time_map mapping the event column to the time column, passed to tester.run().

The survival_df below is assembled from the raw merged data to ensure both cvd and follow_up_time are present alongside the genetic exposure and covariates. Rows with missing values in any required column are dropped before running.

[11]:
# Build a survival dataframe that includes the time column.
# raw_merged contains all columns from the merge (genotypes + phenotypes + covariates),
# including follow_up_time from phenotypes.tsv.
survival_cols = ['id', 'PCSK9', 'cvd', 'follow_up_time', 'age', 'sex']
survival_df = raw_merged[survival_cols].dropna().copy()

print(f"Survival analysis dataframe: {len(survival_df)} samples")
print(f"CVD events: {int(survival_df['cvd'].sum())} ({survival_df['cvd'].mean():.1%})")
print(f"Follow-up time range: {survival_df['follow_up_time'].min():.2f}{survival_df['follow_up_time'].max():.2f} years")
survival_df.head()
Survival analysis dataframe: 804 samples
CVD events: 355 (44.2%)
Follow-up time range: 0.10 – 37.35 years
[11]:
id PCSK9 cvd follow_up_time age sex
0 ID_000000 0.0 0.0 12.02 59.720649 0.0
1 ID_000001 0.0 0.0 11.68 67.088467 0.0
3 ID_000003 0.0 0.0 1.40 58.368992 1.0
4 ID_000004 0.0 0.0 0.43 75.751361 0.0
5 ID_000005 0.0 1.0 7.26 62.097949 0.0
[12]:
# Covariates dataframe for the survival model
survival_cov_df = survival_df[['id', 'age', 'sex']].copy()

# survival_time_map: event column -> time column
survival_time_map = {'cvd': 'follow_up_time'}

# Test dictionary: 'survival' type with Cox-PH test
survival_test_dict = {
    'survival': {'cvd': 'Cox-PH'},
}

# Covariate models
survival_models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex',
}

# Run survival analysis
survival_base, survival_coef = tester.run(
    data=survival_df,
    exposure='PCSK9',
    models=survival_models,
    test_dict=survival_test_dict,
    cov=survival_cov_df,
    id_column='id',
    survival_time_map=survival_time_map,
)
survival_base
[12]:
0.0 1.0 Difference
N (%) or Mean (SD) Median (Q1; Q3) Missing (%) N (%) or Mean (SD) Median (Q1; Q3) Missing (%) Cox-PH
Variable
Total sample size_Unadjusted 742 NaN NaN 62 NaN NaN NaN
follow_up_time_Unadjusted 5.11 (5.15) 3.45 (1.48; 7.01) 0 (0.00) 6.29 (6.59) 4.12 (1.81; 8.53) 0 (0.00) NaN
cvd_Unadjusted 334 (45.01) NaN 0 (0.00) 21 (33.87) NaN 0 (0.00) 0.046489
cvd_Adjusted 334 (45.01) NaN 0 (0.00) 21 (33.87) NaN 0 (0.00) 0.053219
[13]:
# Cox-PH coefficient table: hazard ratios and p-values per model
survival_coef[['Model name', 'Variable', 'Point Estimates', 'P-values', 'Formatted (exp)']]
[13]:
Model name Variable Point Estimates P-values Formatted (exp)
0 Unadjusted cvd -0.448305 0.046489 0.64 (0.41; 0.99)
0 Adjusted cvd -0.435733 0.053219 0.65 (0.42; 1.01)
1 Adjusted cvd -0.001516 0.679030 1.00 (0.99; 1.01)
2 Adjusted cvd 0.145628 0.172169 1.16 (0.94; 1.43)