Association Testing
This notebook demonstrates the DifferencesTester API for associating genetic variant carriership with phenotypic outcomes. It covers:
Continuous, categorical, and binary outcome tests
Multi-model covariate adjustment
Listwise vs pairwise missingness handling
Survival (time-to-event) analysis with Cox proportional hazards regression
[1]:
# imports
import os
import numpy as np
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.utils.missingness import MissingnessHandler
from marvel.extraction.aggregation import extract_genes
test_mod = load_custom()
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
warnings.warn(
Load the data
First, we’ll load the example data by extracting gene-level carrier counts from the PLINK example files, similar to the process shown in the aggregation notebook. We want to determine whether carriers differ from non-carriers for a few outcomes, which are loaded as well. In addition, we will correct for age and sex in the association analyses.
[ ]:
# Extract genetic information
summary, gen_df = extract_genes(
input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
genes = examples.get_data('dummy_variants'),
cat_column = 'gene',
summarise = True,
neg_geno = None,
sum_geno = 1,
incl_var = False,
)
# Show example
summary
[3]:
# Load outcomes
pheno = examples.get_data('outcome_data')
pheno
[3]:
| id | hypertension | diabetes | cvd | sbp | ldl_cholesterol | crp_level | disease_severity | treatment_response | risk_category | follow_up_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ID_000000 | 0.0 | 0.0 | 0.0 | 135.899046 | 4.628664 | 11.963852 | medium | severe | B | 12.02 |
| 1 | ID_000001 | 1.0 | 0.0 | 0.0 | 129.255089 | 4.825001 | 33.911247 | high | control | B | 11.68 |
| 2 | ID_000002 | 1.0 | 1.0 | 1.0 | NaN | 5.671148 | 29.237849 | high | control | B | 11.92 |
| 3 | ID_000003 | 0.0 | NaN | 0.0 | 99.510664 | 5.284872 | 14.347175 | high | severe | C | 1.40 |
| 4 | ID_000004 | 0.0 | 0.0 | 0.0 | 138.179160 | 4.546807 | 40.532594 | high | control | A | 0.43 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | ID_000995 | 0.0 | 0.0 | 1.0 | 144.427028 | 5.230045 | 76.510642 | medium | control | B | 0.77 |
| 996 | ID_000996 | 1.0 | 1.0 | 0.0 | 115.734230 | 5.147892 | 28.864107 | low | control | A | 7.98 |
| 997 | ID_000997 | 0.0 | 0.0 | 0.0 | NaN | 3.459606 | 11.016325 | low | severe | A | 12.29 |
| 998 | ID_000998 | 1.0 | 0.0 | 0.0 | 130.183969 | 4.165561 | 12.189264 | high | control | B | 7.17 |
| 999 | ID_000999 | 0.0 | 0.0 | 1.0 | 117.945068 | 4.583163 | 33.683725 | low | NaN | D | 1.93 |
1000 rows × 11 columns
[4]:
# Load covariates
cov_df = examples.get_data('cov_data')
cov_df
[4]:
| id | age | sex | bmi | smoking_status | alcohol_consumption | PC1 | PC2 | PC3 | PC4 | batch | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ID_000000 | 59.720649 | 0.0 | 21.422981 | 1.0 | 0.0 | -0.792276 | 0.763319 | -2.848527 | 0.187020 | NaN |
| 1 | ID_000001 | 67.088467 | 0.0 | 23.285183 | 1.0 | 2.0 | -0.412052 | 0.068425 | -0.075304 | 0.586625 | batch1 |
| 2 | ID_000002 | 54.541104 | 0.0 | 29.997524 | 0.0 | 1.0 | 1.960718 | -0.946065 | 0.990386 | 0.445479 | batch1 |
| 3 | ID_000003 | 58.368992 | 1.0 | 25.503487 | 0.0 | 2.0 | 2.672157 | 0.904750 | 0.105501 | -0.049818 | batch1 |
| 4 | ID_000004 | 75.751361 | 0.0 | 21.167230 | 0.0 | 1.0 | 1.562669 | 0.210285 | NaN | -0.424643 | batch1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | ID_000995 | 37.323449 | 1.0 | 29.385406 | 1.0 | 2.0 | -0.916293 | -2.119301 | 0.574164 | 1.812890 | NaN |
| 996 | ID_000996 | 48.741477 | 0.0 | 20.147595 | 0.0 | 1.0 | 1.039441 | 0.662346 | -1.723685 | -0.396235 | batch1 |
| 997 | ID_000997 | 75.561878 | 1.0 | 26.974480 | 1.0 | 2.0 | 0.290605 | 2.128468 | 0.855002 | -0.091420 | batch1 |
| 998 | ID_000998 | 56.062975 | 0.0 | 25.329474 | 2.0 | 0.0 | 1.982513 | -2.600536 | -0.741593 | 0.990803 | batch2 |
| 999 | ID_000999 | 83.959874 | 0.0 | 30.452003 | 2.0 | 0.0 | 2.392298 | -1.235706 | -0.374105 | -0.400736 | batch2 |
1000 rows × 11 columns
Handle missingness and prepare data
While missing data can be handled during the association analyses, it is recommended to address it beforehand where possible. Categorical string outcomes must also be encoded as integers before being passed to DifferencesTester.
The cell below merges the genetic, phenotype, and covariate data, encodes the categorical outcomes, and splits off the covariate dataframe used in subsequent tests.
[5]:
# merge the data
full_df = gen_df.merge(pheno, how='inner', left_index=True, right_on='id')
full_df = full_df.merge(cov_df, how='inner', on='id')
# Preserve raw merged dataframe BEFORE any dropna — used for missingness demo below
raw_merged = full_df.copy()
# remove missingness
full_df.dropna(inplace=True)
# Refactor categorical outcomes
full_df['disease_severity'] = full_df['disease_severity'].map({'low': 0, 'medium': 1, 'high': 2})
full_df['treatment_response'] = full_df['treatment_response'].map({'control': 0, 'mild': 1, 'severe': 2})
# Separate covariates
cov_df = full_df[['id', 'age', 'sex']].copy()
# show subset of data
variables = ['id'] + list(gen_df.columns) + ['sbp', 'ldl_cholesterol', 'disease_severity', 'treatment_response', 'diabetes', 'cvd']
full_df[variables]
[5]:
| id | PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | sbp | ldl_cholesterol | disease_severity | treatment_response | diabetes | cvd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ID_000001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 129.255089 | 4.825001 | 2 | 0 | 0.0 | 0.0 |
| 6 | ID_000006 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 114.460843 | 5.286435 | 2 | 0 | 0.0 | 0.0 |
| 10 | ID_000010 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 104.559427 | 3.931334 | 0 | 1 | 0.0 | 0.0 |
| 18 | ID_000018 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 103.252133 | 5.640530 | 0 | 2 | 1.0 | 0.0 |
| 19 | ID_000019 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 138.537174 | 5.344172 | 0 | 0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 974 | ID_000974 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 136.558160 | 4.361229 | 2 | 0 | 0.0 | 0.0 |
| 979 | ID_000979 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 132.521270 | 4.758025 | 1 | 2 | 0.0 | 1.0 |
| 987 | ID_000987 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117.765460 | 5.753577 | 1 | 0 | 0.0 | 0.0 |
| 988 | ID_000988 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 127.541766 | 2.930257 | 0 | 0 | 0.0 | 1.0 |
| 990 | ID_000990 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 130.591184 | 6.235300 | 0 | 2 | 0.0 | 0.0 |
209 rows × 13 columns
[6]:
tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size = 1)
test_dict = {
'continuous': {'sbp': 'OLS;T', 'ldl_cholesterol' : 'OLS;T'},
'binary': {'diabetes': 'GLM-Binom', 'cvd' : 'GLM-Binom;FISHER'},
'categorical' : {'disease_severity' : 'CHISQ', 'treatment_response' : 'CHISQ'},
}
models = {
'Unadjusted': 'None',
'Adjusted': 'age;sex'
}
base_table, coef_table = tester.run(
data=full_df,
exposure=['PCSK9'],
models=models,
test_dict=test_dict,
cov=cov_df,
id_column = 'id'
)
base_table.drop_duplicates()
[6]:
| 0.0 | 1.0 | Difference | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N (%) or Mean (SD) | Median (Q1; Q3) | Missing (%) | N (%) or Mean (SD) | Median (Q1; Q3) | Missing (%) | OLS | OLS_Stat | T | T_Stat | GLM-Binom | GLM-Binom_Stat | FISHER | FISHER_Stat | CHISQ | CHISQ_Stat | |
| Variable | ||||||||||||||||
| Total sample size_Unadjusted | 190 | NaN | NaN | 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| sbp_Unadjusted | 119.35 (15.17) | 120.46 (110.23; 129.89) | 0 (0.00) | 120.98 (16.94) | 122.12 (110.83; 127.83) | 0 (0.00) | 0.661163 | 5.832272e-183 | 0.661163 | -0.438940 | NaN | NaN | NaN | NaN | NaN | NaN |
| ldl_cholesterol_Unadjusted | 5.11 (1.11) | 5.18 (4.42; 5.75) | 0 (0.00) | 5.21 (0.79) | 5.21 (4.76; 5.88) | 0 (0.00) | 0.711937 | 2.623666e-139 | 0.711937 | -0.369763 | NaN | NaN | NaN | NaN | NaN | NaN |
| diabetes_Unadjusted | 29 (15.26) | NaN | 0 (0.00) | 5 (26.32) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | 0.220503 | 1.942967e-17 | NaN | NaN | NaN | NaN |
| cvd_Unadjusted | 83 (43.68) | NaN | 0 (0.00) | 10 (52.63) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | 0.456123 | 8.248046e-02 | 0.477321 | 1.432396 | NaN | NaN |
| disease_severity: 0_Unadjusted | 92 (48.42) | NaN | 0 (0.00) | 11 (57.89) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.272478 | 2.600396 |
| disease_severity: 1_Unadjusted | 59 (31.05) | NaN | 0 (0.00) | 7 (36.84) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| disease_severity: 2_Unadjusted | 39 (20.53) | NaN | 0 (0.00) | 1 (5.26) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| treatment_response: 0_Unadjusted | 122 (64.21) | NaN | 0 (0.00) | 12 (63.16) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.636175 | 0.904563 |
| treatment_response: 1_Unadjusted | 36 (18.95) | NaN | 0 (0.00) | 5 (26.32) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| treatment_response: 2_Unadjusted | 32 (16.84) | NaN | 0 (0.00) | 2 (10.53) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| sbp_Adjusted | 119.35 (15.17) | 120.46 (110.23; 129.89) | 0 (0.00) | 120.98 (16.94) | 122.12 (110.83; 127.83) | 0 (0.00) | 0.641524 | 6.937978e-70 | 0.661163 | -0.438940 | NaN | NaN | NaN | NaN | NaN | NaN |
| ldl_cholesterol_Adjusted | 5.11 (1.11) | 5.18 (4.42; 5.75) | 0 (0.00) | 5.21 (0.79) | 5.21 (4.76; 5.88) | 0 (0.00) | 0.752053 | 9.567225e-38 | 0.711937 | -0.369763 | NaN | NaN | NaN | NaN | NaN | NaN |
| diabetes_Adjusted | 29 (15.26) | NaN | 0 (0.00) | 5 (26.32) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | 0.226903 | 3.512443e-03 | NaN | NaN | NaN | NaN |
| cvd_Adjusted | 83 (43.68) | NaN | 0 (0.00) | 10 (52.63) | NaN | 0 (0.00) | NaN | NaN | NaN | NaN | 0.457350 | 6.220539e-01 | 0.477321 | 1.432396 | NaN | NaN |
[7]:
coef_table
[7]:
| Model | Model name | Stratification name | Variable | Exposure | Cases | Samples | Exposed | Non-exposed | Point Estimates | Standard Error | Test statistic | P-values | Formatted | Formatted (exp) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OLS | Unadjusted | <NA> | sbp | Intercept | 0.0 | 209.0 | 19.0 | 190.0 | 119.351553 | 1.118009 | NaN | 5.832272e-183 | 119.35 (117.16; 121.54) | 6819002940268249165055820291392844669167365204... |
| 1 | OLS | Unadjusted | <NA> | sbp | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | 1.627597 | 3.708015 | NaN | 6.611626e-01 | 1.63 (-5.64; 8.90) | 5.09 (0.00; 7296.66) |
| 0 | T | Unadjusted | <NA> | sbp | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | -0.438940 | 6.611626e-01 | NaN | NaN |
| 0 | OLS | Unadjusted | <NA> | ldl_cholesterol | Intercept | 0.0 | 209.0 | 19.0 | 190.0 | 5.109830 | 0.078967 | NaN | 2.623666e-139 | 5.11 (4.96; 5.26) | 165.64 (141.89; 193.37) |
| 1 | OLS | Unadjusted | <NA> | ldl_cholesterol | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | 0.096843 | 0.261904 | NaN | 7.119368e-01 | 0.10 (-0.42; 0.61) | 1.10 (0.66; 1.84) |
| 0 | T | Unadjusted | <NA> | ldl_cholesterol | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | -0.369763 | 7.119368e-01 | NaN | NaN |
| 0 | GLM-Binom | Unadjusted | <NA> | diabetes | Intercept | 34.0 | 209.0 | 19.0 | 190.0 | -1.714109 | 0.201727 | NaN | 1.942967e-17 | -1.71 (-2.11; -1.32) | 0.18 (0.12; 0.27) |
| 1 | GLM-Binom | Unadjusted | <NA> | diabetes | PCSK9 | 34.0 | 209.0 | 19.0 | 190.0 | 0.684489 | 0.558679 | NaN | 2.205027e-01 | 0.68 (-0.41; 1.78) | 1.98 (0.66; 5.93) |
| 0 | GLM-Binom | Unadjusted | <NA> | cvd | Intercept | 93.0 | 209.0 | 19.0 | 190.0 | -0.253988 | 0.146267 | NaN | 8.248046e-02 | -0.25 (-0.54; 0.03) | 0.78 (0.58; 1.03) |
| 1 | GLM-Binom | Unadjusted | <NA> | cvd | PCSK9 | 93.0 | 209.0 | 19.0 | 190.0 | 0.359349 | 0.482188 | NaN | 4.561228e-01 | 0.36 (-0.59; 1.30) | 1.43 (0.56; 3.69) |
| 0 | FISHER | Unadjusted | <NA> | cvd | PCSK9 | 93.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | 1.432396 | 4.773212e-01 | NaN | NaN |
| 0 | CHISQ | Unadjusted | <NA> | disease_severity | PCSK9 | NaN | NaN | NaN | NaN | NaN | NaN | 2.600396 | 2.724778e-01 | NaN | NaN |
| 0 | CHISQ | Unadjusted | <NA> | treatment_response | PCSK9 | NaN | NaN | NaN | NaN | NaN | NaN | 0.904563 | 6.361750e-01 | NaN | NaN |
| 0 | OLS | Adjusted | <NA> | sbp | Intercept | 0.0 | 209.0 | 19.0 | 190.0 | 119.196798 | 4.386169 | NaN | 6.937978e-70 | 119.20 (110.60; 127.79) | 5841327625673460048707196120176087049213956698... |
| 1 | OLS | Adjusted | <NA> | sbp | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | 1.731601 | 3.713803 | NaN | 6.415239e-01 | 1.73 (-5.55; 9.01) | 5.65 (0.00; 8188.79) |
| 2 | OLS | Adjusted | <NA> | sbp | age | 0.0 | 209.0 | 19.0 | 190.0 | -0.027879 | 0.072927 | NaN | 7.026475e-01 | -0.03 (-0.17; 0.12) | 0.97 (0.84; 1.12) |
| 3 | OLS | Adjusted | <NA> | sbp | sex | 0.0 | 209.0 | 19.0 | 190.0 | 2.883533 | 2.166641 | NaN | 1.847077e-01 | 2.88 (-1.36; 7.13) | 17.88 (0.26; 1248.96) |
| 0 | T | Adjusted | <NA> | sbp | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | -0.438940 | 6.611626e-01 | NaN | NaN |
| 0 | OLS | Adjusted | <NA> | ldl_cholesterol | Intercept | 0.0 | 209.0 | 19.0 | 190.0 | 4.940407 | 0.309923 | NaN | 9.567225e-38 | 4.94 (4.33; 5.55) | 139.83 (76.17; 256.68) |
| 1 | OLS | Adjusted | <NA> | ldl_cholesterol | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | 0.083016 | 0.262414 | NaN | 7.520534e-01 | 0.08 (-0.43; 0.60) | 1.09 (0.65; 1.82) |
| 2 | OLS | Adjusted | <NA> | ldl_cholesterol | age | 0.0 | 209.0 | 19.0 | 190.0 | 0.004640 | 0.005153 | NaN | 3.688913e-01 | 0.00 (-0.01; 0.01) | 1.00 (0.99; 1.01) |
| 3 | OLS | Adjusted | <NA> | ldl_cholesterol | sex | 0.0 | 209.0 | 19.0 | 190.0 | -0.148852 | 0.153093 | NaN | 3.320483e-01 | -0.15 (-0.45; 0.15) | 0.86 (0.64; 1.16) |
| 0 | T | Adjusted | <NA> | ldl_cholesterol | PCSK9 | 0.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | -0.369763 | 7.119368e-01 | NaN | NaN |
| 0 | GLM-Binom | Adjusted | <NA> | diabetes | Intercept | 34.0 | 209.0 | 19.0 | 190.0 | -2.390569 | 0.818990 | NaN | 3.512443e-03 | -2.39 (-4.00; -0.79) | 0.09 (0.02; 0.46) |
| 1 | GLM-Binom | Adjusted | <NA> | diabetes | PCSK9 | 34.0 | 209.0 | 19.0 | 190.0 | 0.679312 | 0.562170 | NaN | 2.269031e-01 | 0.68 (-0.42; 1.78) | 1.97 (0.66; 5.94) |
| 2 | GLM-Binom | Adjusted | <NA> | diabetes | age | 34.0 | 209.0 | 19.0 | 190.0 | 0.006939 | 0.013147 | NaN | 5.976202e-01 | 0.01 (-0.02; 0.03) | 1.01 (0.98; 1.03) |
| 3 | GLM-Binom | Adjusted | <NA> | diabetes | sex | 34.0 | 209.0 | 19.0 | 190.0 | 0.459180 | 0.399071 | NaN | 2.498874e-01 | 0.46 (-0.32; 1.24) | 1.58 (0.72; 3.46) |
| 0 | GLM-Binom | Adjusted | <NA> | cvd | Intercept | 93.0 | 209.0 | 19.0 | 190.0 | -0.282808 | 0.573716 | NaN | 6.220539e-01 | -0.28 (-1.41; 0.84) | 0.75 (0.24; 2.32) |
| 1 | GLM-Binom | Adjusted | <NA> | cvd | PCSK9 | 93.0 | 209.0 | 19.0 | 190.0 | 0.358869 | 0.482859 | NaN | 4.573501e-01 | 0.36 (-0.59; 1.31) | 1.43 (0.56; 3.69) |
| 2 | GLM-Binom | Adjusted | <NA> | cvd | age | 93.0 | 209.0 | 19.0 | 190.0 | 0.000277 | 0.009536 | NaN | 9.768179e-01 | 0.00 (-0.02; 0.02) | 1.00 (0.98; 1.02) |
| 3 | GLM-Binom | Adjusted | <NA> | cvd | sex | 93.0 | 209.0 | 19.0 | 190.0 | 0.022824 | 0.283299 | NaN | 9.357882e-01 | 0.02 (-0.53; 0.58) | 1.02 (0.59; 1.78) |
| 0 | FISHER | Adjusted | <NA> | cvd | PCSK9 | 93.0 | 209.0 | 19.0 | 190.0 | NaN | NaN | 1.432396 | 4.773212e-01 | NaN | NaN |
| 0 | CHISQ | Adjusted | <NA> | disease_severity | PCSK9 | NaN | NaN | NaN | NaN | NaN | NaN | 2.600396 | 2.724778e-01 | NaN | NaN |
| 0 | CHISQ | Adjusted | <NA> | treatment_response | PCSK9 | NaN | NaN | NaN | NaN | NaN | NaN | 0.904563 | 6.361750e-01 | NaN | NaN |
Multi-model covariate comparison
The coef_table above contains results for both the Unadjusted (no covariates) and Adjusted (age + sex) models. Sorting by exposure, outcome, and model places the two rows for each exposure–outcome pair adjacent to each other, making it easy to see whether adjusting for covariates changes the association.
Columns of interest:
Model name —
UnadjustedorAdjustedcoef / Point Estimates — effect size
P-values — statistical significance
A smaller p-value after adjustment may indicate that the raw association was confounded by age or sex. A larger p-value after adjustment is common when the covariate explains some of the outcome variance.
[8]:
# Exposure name used in this analysis
exposure_col_name = 'PCSK9'
# Filter to the exposure rows (exclude intercept and covariate rows),
# then sort by outcome (Variable) and model so that Unadjusted and Adjusted
# rows for the same outcome appear adjacent.
(
coef_table
.rename(columns={'Model name': 'model', 'Variable': 'outcome', 'Point Estimates': 'coef', 'P-values': 'pvalue'})
.query("Exposure == @exposure_col_name")
.sort_values(['outcome', 'model'])
[['model', 'outcome', 'coef', 'pvalue']]
.head(12)
)
[8]:
| model | outcome | coef | pvalue | |
|---|---|---|---|---|
| 1 | Adjusted | cvd | 0.358869 | 0.457350 |
| 0 | Adjusted | cvd | NaN | 0.477321 |
| 1 | Unadjusted | cvd | 0.359349 | 0.456123 |
| 0 | Unadjusted | cvd | NaN | 0.477321 |
| 1 | Adjusted | diabetes | 0.679312 | 0.226903 |
| 1 | Unadjusted | diabetes | 0.684489 | 0.220503 |
| 0 | Adjusted | disease_severity | NaN | 0.272478 |
| 0 | Unadjusted | disease_severity | NaN | 0.272478 |
| 1 | Adjusted | ldl_cholesterol | 0.083016 | 0.752053 |
| 0 | Adjusted | ldl_cholesterol | NaN | 0.711937 |
| 1 | Unadjusted | ldl_cholesterol | 0.096843 | 0.711937 |
| 0 | Unadjusted | ldl_cholesterol | NaN | 0.711937 |
Pairwise vs listwise missingness
MARVEL’s MissingnessHandler supports two strategies for dealing with missing data:
Listwise deletion (conservative): drops every sample that has a missing value in any of the specified columns. The same reduced sample set is used for all downstream tests — consistent but potentially wasteful.
Pairwise deletion (maximises N): drops only the samples that are missing the specific columns required for each individual test (exposure + outcome + covariates). Different tests may therefore use different sample counts.
The demo below operates on raw_merged — the merged dataframe before any dropna call — so that the comparison is meaningful. If the data happen to have identical missingness patterns across all variables (making listwise == pairwise for every outcome), synthetic missingness is introduced for illustration.
[9]:
# Variables used for the missingness demo
exposure_miss = 'PCSK9'
outcomes_miss = ['sbp', 'hypertension']
covariates_miss = ['age', 'sex']
all_cols = [exposure_miss] + outcomes_miss + covariates_miss
# Check listwise vs pairwise counts on the raw merged data
n_listwise_raw = len(
MissingnessHandler(listwise=True, verbose=False).handle_listwise(
data=raw_merged, columns=all_cols
)
)
n_pairwise_raw = {
outcome: len(
MissingnessHandler(listwise=False, verbose=False).handle_pairwise(
data=raw_merged,
exposure=exposure_miss,
outcome=outcome,
covariates=covariates_miss,
)
)
for outcome in outcomes_miss
}
# Check whether pairwise counts differ from listwise — if not, introduce synthetic
# missingness so the comparison is meaningful
counts_differ = any(n != n_listwise_raw for n in n_pairwise_raw.values())
if counts_differ:
raw_merged_demo = raw_merged
print("Using real data (pairwise and listwise counts already differ).")
else:
# --- Introducing synthetic missingness for demo purposes ---
# Existing data had identical listwise/pairwise counts, so we inject
# controlled NaNs to create a divergent comparison.
rng = np.random.default_rng(42)
raw_merged_demo = raw_merged.copy()
raw_merged_demo.loc[raw_merged_demo.index[::5], 'sbp'] = np.nan
raw_merged_demo.loc[raw_merged_demo.index[::7], 'hypertension'] = np.nan
print("Synthetic missingness introduced for demo purposes.")
print(" sbp NaN every 5th row, hypertension NaN every 7th row.")
print()
Using real data (pairwise and listwise counts already differ).
[10]:
# --- Listwise deletion ---
# Drops any sample missing in ANY of the specified columns.
# One fixed sample size applies to all tests.
n_listwise = len(
MissingnessHandler(listwise=True, verbose=False).handle_listwise(
data=raw_merged_demo,
columns=all_cols,
)
)
print(f"Listwise retained N (applied once for all outcomes): {n_listwise}")
print()
# --- Pairwise deletion ---
# Drops only samples missing for the specific exposure + outcome + covariates.
# Each outcome may retain a different N.
pairwise_handler = MissingnessHandler(listwise=False, verbose=False)
print("Pairwise retained N (per outcome):")
for outcome in outcomes_miss:
n_pw = len(
pairwise_handler.handle_pairwise(
data=raw_merged_demo,
exposure=exposure_miss,
outcome=outcome,
covariates=covariates_miss,
)
)
print(f" {outcome}: {n_pw}")
Listwise retained N (applied once for all outcomes): 743
Pairwise retained N (per outcome):
sbp: 788
hypertension: 787
Survival analysis
MARVEL supports Cox proportional hazards regression via the 'Cox-PH' test key. This requires:
An event indicator column (binary 0/1) — here
cvd(cardiovascular disease event).A follow-up time column containing positive values — here
follow_up_time(years, clipped to a minimum of 0.1).A
survival_time_mapmapping the event column to the time column, passed totester.run().
The survival_df below is assembled from the raw merged data to ensure both cvd and follow_up_time are present alongside the genetic exposure and covariates. Rows with missing values in any required column are dropped before running.
[11]:
# Build a survival dataframe that includes the time column.
# raw_merged contains all columns from the merge (genotypes + phenotypes + covariates),
# including follow_up_time from phenotypes.tsv.
survival_cols = ['id', 'PCSK9', 'cvd', 'follow_up_time', 'age', 'sex']
survival_df = raw_merged[survival_cols].dropna().copy()
print(f"Survival analysis dataframe: {len(survival_df)} samples")
print(f"CVD events: {int(survival_df['cvd'].sum())} ({survival_df['cvd'].mean():.1%})")
print(f"Follow-up time range: {survival_df['follow_up_time'].min():.2f} – {survival_df['follow_up_time'].max():.2f} years")
survival_df.head()
Survival analysis dataframe: 804 samples
CVD events: 355 (44.2%)
Follow-up time range: 0.10 – 37.35 years
[11]:
| id | PCSK9 | cvd | follow_up_time | age | sex | |
|---|---|---|---|---|---|---|
| 0 | ID_000000 | 0.0 | 0.0 | 12.02 | 59.720649 | 0.0 |
| 1 | ID_000001 | 0.0 | 0.0 | 11.68 | 67.088467 | 0.0 |
| 3 | ID_000003 | 0.0 | 0.0 | 1.40 | 58.368992 | 1.0 |
| 4 | ID_000004 | 0.0 | 0.0 | 0.43 | 75.751361 | 0.0 |
| 5 | ID_000005 | 0.0 | 1.0 | 7.26 | 62.097949 | 0.0 |
[12]:
# Covariates dataframe for the survival model
survival_cov_df = survival_df[['id', 'age', 'sex']].copy()
# survival_time_map: event column -> time column
survival_time_map = {'cvd': 'follow_up_time'}
# Test dictionary: 'survival' type with Cox-PH test
survival_test_dict = {
'survival': {'cvd': 'Cox-PH'},
}
# Covariate models
survival_models = {
'Unadjusted': 'None',
'Adjusted': 'age;sex',
}
# Run survival analysis
survival_base, survival_coef = tester.run(
data=survival_df,
exposure='PCSK9',
models=survival_models,
test_dict=survival_test_dict,
cov=survival_cov_df,
id_column='id',
survival_time_map=survival_time_map,
)
survival_base
[12]:
| 0.0 | 1.0 | Difference | |||||
|---|---|---|---|---|---|---|---|
| N (%) or Mean (SD) | Median (Q1; Q3) | Missing (%) | N (%) or Mean (SD) | Median (Q1; Q3) | Missing (%) | Cox-PH | |
| Variable | |||||||
| Total sample size_Unadjusted | 742 | NaN | NaN | 62 | NaN | NaN | NaN |
| follow_up_time_Unadjusted | 5.11 (5.15) | 3.45 (1.48; 7.01) | 0 (0.00) | 6.29 (6.59) | 4.12 (1.81; 8.53) | 0 (0.00) | NaN |
| cvd_Unadjusted | 334 (45.01) | NaN | 0 (0.00) | 21 (33.87) | NaN | 0 (0.00) | 0.046489 |
| cvd_Adjusted | 334 (45.01) | NaN | 0 (0.00) | 21 (33.87) | NaN | 0 (0.00) | 0.053219 |
[13]:
# Cox-PH coefficient table: hazard ratios and p-values per model
survival_coef[['Model name', 'Variable', 'Point Estimates', 'P-values', 'Formatted (exp)']]
[13]:
| Model name | Variable | Point Estimates | P-values | Formatted (exp) | |
|---|---|---|---|---|---|
| 0 | Unadjusted | cvd | -0.448305 | 0.046489 | 0.64 (0.41; 0.99) |
| 0 | Adjusted | cvd | -0.435733 | 0.053219 | 0.65 (0.42; 1.01) |
| 1 | Adjusted | cvd | -0.001516 | 0.679030 | 1.00 (0.99; 1.01) |
| 2 | Adjusted | cvd | 0.145628 | 0.172169 | 1.16 (0.94; 1.43) |