Association Testing

This notebook demonstrates the DifferencesTester API for associating genetic variant carriership with phenotypic outcomes. It covers:

Continuous, categorical, and binary outcome tests
Multi-model covariate adjustment
Listwise vs pairwise missingness handling
Survival (time-to-event) analysis with Cox proportional hazards regression

[1]:

# imports
import os
import numpy as np
from marvel.example_data import examples
from marvel.association import class_test
from marvel.utils.utils import load_custom
from marvel.utils.missingness import MissingnessHandler
from marvel.extraction.aggregation import extract_genes
test_mod = load_custom()

/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data by extracting gene-level carrier counts from the PLINK example files, similar to the process shown in the aggregation notebook. We want to determine whether carriers differ from non-carriers for a few outcomes, which are loaded as well. In addition, we will correct for age and sex in the association analyses.

[ ]:

# Extract genetic information
summary, gen_df = extract_genes(
    input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
    genes = examples.get_data('dummy_variants'),
    cat_column = 'gene',
    summarise = True,
    neg_geno = None,
    sum_geno = 1,
    incl_var = False,
)

# Show example
summary

[3]:

# Load outcomes
pheno = examples.get_data('outcome_data')
pheno

[3]:

	id	hypertension	diabetes	cvd	sbp	ldl_cholesterol	crp_level	disease_severity	treatment_response	risk_category	follow_up_time
0	ID_000000	0.0	0.0	0.0	135.899046	4.628664	11.963852	medium	severe	B	12.02
1	ID_000001	1.0	0.0	0.0	129.255089	4.825001	33.911247	high	control	B	11.68
2	ID_000002	1.0	1.0	1.0	NaN	5.671148	29.237849	high	control	B	11.92
3	ID_000003	0.0	NaN	0.0	99.510664	5.284872	14.347175	high	severe	C	1.40
4	ID_000004	0.0	0.0	0.0	138.179160	4.546807	40.532594	high	control	A	0.43
...	...	...	...	...	...	...	...	...	...	...	...
995	ID_000995	0.0	0.0	1.0	144.427028	5.230045	76.510642	medium	control	B	0.77
996	ID_000996	1.0	1.0	0.0	115.734230	5.147892	28.864107	low	control	A	7.98
997	ID_000997	0.0	0.0	0.0	NaN	3.459606	11.016325	low	severe	A	12.29
998	ID_000998	1.0	0.0	0.0	130.183969	4.165561	12.189264	high	control	B	7.17
999	ID_000999	0.0	0.0	1.0	117.945068	4.583163	33.683725	low	NaN	D	1.93

1000 rows × 11 columns

[4]:

# Load covariates
cov_df = examples.get_data('cov_data')
cov_df

[4]:

	id	age	sex	bmi	smoking_status	alcohol_consumption	PC1	PC2	PC3	PC4	batch
0	ID_000000	59.720649	0.0	21.422981	1.0	0.0	-0.792276	0.763319	-2.848527	0.187020	NaN
1	ID_000001	67.088467	0.0	23.285183	1.0	2.0	-0.412052	0.068425	-0.075304	0.586625	batch1
2	ID_000002	54.541104	0.0	29.997524	0.0	1.0	1.960718	-0.946065	0.990386	0.445479	batch1
3	ID_000003	58.368992	1.0	25.503487	0.0	2.0	2.672157	0.904750	0.105501	-0.049818	batch1
4	ID_000004	75.751361	0.0	21.167230	0.0	1.0	1.562669	0.210285	NaN	-0.424643	batch1
...	...	...	...	...	...	...	...	...	...	...	...
995	ID_000995	37.323449	1.0	29.385406	1.0	2.0	-0.916293	-2.119301	0.574164	1.812890	NaN
996	ID_000996	48.741477	0.0	20.147595	0.0	1.0	1.039441	0.662346	-1.723685	-0.396235	batch1
997	ID_000997	75.561878	1.0	26.974480	1.0	2.0	0.290605	2.128468	0.855002	-0.091420	batch1
998	ID_000998	56.062975	0.0	25.329474	2.0	0.0	1.982513	-2.600536	-0.741593	0.990803	batch2
999	ID_000999	83.959874	0.0	30.452003	2.0	0.0	2.392298	-1.235706	-0.374105	-0.400736	batch2

1000 rows × 11 columns

Handle missingness and prepare data

While missing data can be handled during the association analyses, it is recommended to address it beforehand where possible. Categorical string outcomes must also be encoded as integers before being passed to DifferencesTester.

The cell below merges the genetic, phenotype, and covariate data, encodes the categorical outcomes, and splits off the covariate dataframe used in subsequent tests.

[5]:

# merge the data
full_df = gen_df.merge(pheno, how='inner', left_index=True, right_on='id')
full_df = full_df.merge(cov_df, how='inner', on='id')

# Preserve raw merged dataframe BEFORE any dropna — used for missingness demo below
raw_merged = full_df.copy()

# remove missingness
full_df.dropna(inplace=True)

# Refactor categorical outcomes
full_df['disease_severity'] = full_df['disease_severity'].map({'low': 0, 'medium': 1, 'high': 2})
full_df['treatment_response'] = full_df['treatment_response'].map({'control': 0, 'mild': 1, 'severe': 2})

# Separate covariates
cov_df = full_df[['id', 'age', 'sex']].copy()

# show subset of data
variables = ['id'] + list(gen_df.columns) + ['sbp', 'ldl_cholesterol', 'disease_severity', 'treatment_response', 'diabetes', 'cvd']
full_df[variables]

[5]:

	id	PCSK9	ANGPTL3	HMGCR	NPC1L1	CETP	PPARA	sbp	ldl_cholesterol	disease_severity	treatment_response	diabetes	cvd
1	ID_000001	0.0	0.0	0.0	0.0	0.0	0.0	129.255089	4.825001	2	0	0.0	0.0
6	ID_000006	0.0	0.0	0.0	0.0	0.0	0.0	114.460843	5.286435	2	0	0.0	0.0
10	ID_000010	0.0	1.0	0.0	0.0	0.0	0.0	104.559427	3.931334	0	1	0.0	0.0
18	ID_000018	0.0	0.0	0.0	0.0	0.0	0.0	103.252133	5.640530	0	2	1.0	0.0
19	ID_000019	1.0	0.0	0.0	0.0	0.0	0.0	138.537174	5.344172	0	0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
974	ID_000974	0.0	0.0	0.0	0.0	0.0	0.0	136.558160	4.361229	2	0	0.0	0.0
979	ID_000979	0.0	0.0	0.0	0.0	0.0	0.0	132.521270	4.758025	1	2	0.0	1.0
987	ID_000987	0.0	0.0	0.0	0.0	0.0	0.0	117.765460	5.753577	1	0	0.0	0.0
988	ID_000988	0.0	0.0	1.0	0.0	0.0	0.0	127.541766	2.930257	0	0	0.0	1.0
990	ID_000990	0.0	0.0	0.0	1.0	0.0	0.0	130.591184	6.235300	0	2	0.0	0.0

209 rows × 13 columns

[6]:

tester = class_test.DifferencesTester(test_module=test_mod.AllTests(), min_group_size = 1)

test_dict = {
    'continuous': {'sbp': 'OLS;T', 'ldl_cholesterol' : 'OLS;T'},
    'binary': {'diabetes': 'GLM-Binom', 'cvd' : 'GLM-Binom;FISHER'},
    'categorical' : {'disease_severity' : 'CHISQ', 'treatment_response' : 'CHISQ'},
}

models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex'
}

base_table, coef_table = tester.run(
    data=full_df,
    exposure=['PCSK9'],
    models=models,
    test_dict=test_dict,
    cov=cov_df,
    id_column = 'id'
)
base_table.drop_duplicates()

[6]:

	0.0			1.0			Difference
	N (%) or Mean (SD)	Median (Q1; Q3)	Missing (%)	N (%) or Mean (SD)	Median (Q1; Q3)	Missing (%)	OLS	OLS_Stat	T	T_Stat	GLM-Binom	GLM-Binom_Stat	FISHER	FISHER_Stat	CHISQ	CHISQ_Stat
Variable
Total sample size_Unadjusted	190	NaN	NaN	19	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
sbp_Unadjusted	119.35 (15.17)	120.46 (110.23; 129.89)	0 (0.00)	120.98 (16.94)	122.12 (110.83; 127.83)	0 (0.00)	0.661163	5.832272e-183	0.661163	-0.438940	NaN	NaN	NaN	NaN	NaN	NaN
ldl_cholesterol_Unadjusted	5.11 (1.11)	5.18 (4.42; 5.75)	0 (0.00)	5.21 (0.79)	5.21 (4.76; 5.88)	0 (0.00)	0.711937	2.623666e-139	0.711937	-0.369763	NaN	NaN	NaN	NaN	NaN	NaN
diabetes_Unadjusted	29 (15.26)	NaN	0 (0.00)	5 (26.32)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	0.220503	1.942967e-17	NaN	NaN	NaN	NaN
cvd_Unadjusted	83 (43.68)	NaN	0 (0.00)	10 (52.63)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	0.456123	8.248046e-02	0.477321	1.432396	NaN	NaN
disease_severity: 0_Unadjusted	92 (48.42)	NaN	0 (0.00)	11 (57.89)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.272478	2.600396
disease_severity: 1_Unadjusted	59 (31.05)	NaN	0 (0.00)	7 (36.84)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
disease_severity: 2_Unadjusted	39 (20.53)	NaN	0 (0.00)	1 (5.26)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
treatment_response: 0_Unadjusted	122 (64.21)	NaN	0 (0.00)	12 (63.16)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.636175	0.904563
treatment_response: 1_Unadjusted	36 (18.95)	NaN	0 (0.00)	5 (26.32)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
treatment_response: 2_Unadjusted	32 (16.84)	NaN	0 (0.00)	2 (10.53)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
sbp_Adjusted	119.35 (15.17)	120.46 (110.23; 129.89)	0 (0.00)	120.98 (16.94)	122.12 (110.83; 127.83)	0 (0.00)	0.641524	6.937978e-70	0.661163	-0.438940	NaN	NaN	NaN	NaN	NaN	NaN
ldl_cholesterol_Adjusted	5.11 (1.11)	5.18 (4.42; 5.75)	0 (0.00)	5.21 (0.79)	5.21 (4.76; 5.88)	0 (0.00)	0.752053	9.567225e-38	0.711937	-0.369763	NaN	NaN	NaN	NaN	NaN	NaN
diabetes_Adjusted	29 (15.26)	NaN	0 (0.00)	5 (26.32)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	0.226903	3.512443e-03	NaN	NaN	NaN	NaN
cvd_Adjusted	83 (43.68)	NaN	0 (0.00)	10 (52.63)	NaN	0 (0.00)	NaN	NaN	NaN	NaN	0.457350	6.220539e-01	0.477321	1.432396	NaN	NaN

[7]:

coef_table

[7]:

	Model	Model name	Stratification name	Variable	Exposure	Cases	Samples	Exposed	Non-exposed	Point Estimates	Standard Error	Test statistic	P-values	Formatted	Formatted (exp)
0	OLS	Unadjusted	<NA>	sbp	Intercept	0.0	209.0	19.0	190.0	119.351553	1.118009	NaN	5.832272e-183	119.35 (117.16; 121.54)	6819002940268249165055820291392844669167365204...
1	OLS	Unadjusted	<NA>	sbp	PCSK9	0.0	209.0	19.0	190.0	1.627597	3.708015	NaN	6.611626e-01	1.63 (-5.64; 8.90)	5.09 (0.00; 7296.66)
0	T	Unadjusted	<NA>	sbp	PCSK9	0.0	209.0	19.0	190.0	NaN	NaN	-0.438940	6.611626e-01	NaN	NaN
0	OLS	Unadjusted	<NA>	ldl_cholesterol	Intercept	0.0	209.0	19.0	190.0	5.109830	0.078967	NaN	2.623666e-139	5.11 (4.96; 5.26)	165.64 (141.89; 193.37)
1	OLS	Unadjusted	<NA>	ldl_cholesterol	PCSK9	0.0	209.0	19.0	190.0	0.096843	0.261904	NaN	7.119368e-01	0.10 (-0.42; 0.61)	1.10 (0.66; 1.84)
0	T	Unadjusted	<NA>	ldl_cholesterol	PCSK9	0.0	209.0	19.0	190.0	NaN	NaN	-0.369763	7.119368e-01	NaN	NaN
0	GLM-Binom	Unadjusted	<NA>	diabetes	Intercept	34.0	209.0	19.0	190.0	-1.714109	0.201727	NaN	1.942967e-17	-1.71 (-2.11; -1.32)	0.18 (0.12; 0.27)
1	GLM-Binom	Unadjusted	<NA>	diabetes	PCSK9	34.0	209.0	19.0	190.0	0.684489	0.558679	NaN	2.205027e-01	0.68 (-0.41; 1.78)	1.98 (0.66; 5.93)
0	GLM-Binom	Unadjusted	<NA>	cvd	Intercept	93.0	209.0	19.0	190.0	-0.253988	0.146267	NaN	8.248046e-02	-0.25 (-0.54; 0.03)	0.78 (0.58; 1.03)
1	GLM-Binom	Unadjusted	<NA>	cvd	PCSK9	93.0	209.0	19.0	190.0	0.359349	0.482188	NaN	4.561228e-01	0.36 (-0.59; 1.30)	1.43 (0.56; 3.69)
0	FISHER	Unadjusted	<NA>	cvd	PCSK9	93.0	209.0	19.0	190.0	NaN	NaN	1.432396	4.773212e-01	NaN	NaN
0	CHISQ	Unadjusted	<NA>	disease_severity	PCSK9	NaN	NaN	NaN	NaN	NaN	NaN	2.600396	2.724778e-01	NaN	NaN
0	CHISQ	Unadjusted	<NA>	treatment_response	PCSK9	NaN	NaN	NaN	NaN	NaN	NaN	0.904563	6.361750e-01	NaN	NaN
0	OLS	Adjusted	<NA>	sbp	Intercept	0.0	209.0	19.0	190.0	119.196798	4.386169	NaN	6.937978e-70	119.20 (110.60; 127.79)	5841327625673460048707196120176087049213956698...
1	OLS	Adjusted	<NA>	sbp	PCSK9	0.0	209.0	19.0	190.0	1.731601	3.713803	NaN	6.415239e-01	1.73 (-5.55; 9.01)	5.65 (0.00; 8188.79)
2	OLS	Adjusted	<NA>	sbp	age	0.0	209.0	19.0	190.0	-0.027879	0.072927	NaN	7.026475e-01	-0.03 (-0.17; 0.12)	0.97 (0.84; 1.12)
3	OLS	Adjusted	<NA>	sbp	sex	0.0	209.0	19.0	190.0	2.883533	2.166641	NaN	1.847077e-01	2.88 (-1.36; 7.13)	17.88 (0.26; 1248.96)
0	T	Adjusted	<NA>	sbp	PCSK9	0.0	209.0	19.0	190.0	NaN	NaN	-0.438940	6.611626e-01	NaN	NaN
0	OLS	Adjusted	<NA>	ldl_cholesterol	Intercept	0.0	209.0	19.0	190.0	4.940407	0.309923	NaN	9.567225e-38	4.94 (4.33; 5.55)	139.83 (76.17; 256.68)
1	OLS	Adjusted	<NA>	ldl_cholesterol	PCSK9	0.0	209.0	19.0	190.0	0.083016	0.262414	NaN	7.520534e-01	0.08 (-0.43; 0.60)	1.09 (0.65; 1.82)
2	OLS	Adjusted	<NA>	ldl_cholesterol	age	0.0	209.0	19.0	190.0	0.004640	0.005153	NaN	3.688913e-01	0.00 (-0.01; 0.01)	1.00 (0.99; 1.01)
3	OLS	Adjusted	<NA>	ldl_cholesterol	sex	0.0	209.0	19.0	190.0	-0.148852	0.153093	NaN	3.320483e-01	-0.15 (-0.45; 0.15)	0.86 (0.64; 1.16)
0	T	Adjusted	<NA>	ldl_cholesterol	PCSK9	0.0	209.0	19.0	190.0	NaN	NaN	-0.369763	7.119368e-01	NaN	NaN
0	GLM-Binom	Adjusted	<NA>	diabetes	Intercept	34.0	209.0	19.0	190.0	-2.390569	0.818990	NaN	3.512443e-03	-2.39 (-4.00; -0.79)	0.09 (0.02; 0.46)
1	GLM-Binom	Adjusted	<NA>	diabetes	PCSK9	34.0	209.0	19.0	190.0	0.679312	0.562170	NaN	2.269031e-01	0.68 (-0.42; 1.78)	1.97 (0.66; 5.94)
2	GLM-Binom	Adjusted	<NA>	diabetes	age	34.0	209.0	19.0	190.0	0.006939	0.013147	NaN	5.976202e-01	0.01 (-0.02; 0.03)	1.01 (0.98; 1.03)
3	GLM-Binom	Adjusted	<NA>	diabetes	sex	34.0	209.0	19.0	190.0	0.459180	0.399071	NaN	2.498874e-01	0.46 (-0.32; 1.24)	1.58 (0.72; 3.46)
0	GLM-Binom	Adjusted	<NA>	cvd	Intercept	93.0	209.0	19.0	190.0	-0.282808	0.573716	NaN	6.220539e-01	-0.28 (-1.41; 0.84)	0.75 (0.24; 2.32)
1	GLM-Binom	Adjusted	<NA>	cvd	PCSK9	93.0	209.0	19.0	190.0	0.358869	0.482859	NaN	4.573501e-01	0.36 (-0.59; 1.31)	1.43 (0.56; 3.69)
2	GLM-Binom	Adjusted	<NA>	cvd	age	93.0	209.0	19.0	190.0	0.000277	0.009536	NaN	9.768179e-01	0.00 (-0.02; 0.02)	1.00 (0.98; 1.02)
3	GLM-Binom	Adjusted	<NA>	cvd	sex	93.0	209.0	19.0	190.0	0.022824	0.283299	NaN	9.357882e-01	0.02 (-0.53; 0.58)	1.02 (0.59; 1.78)
0	FISHER	Adjusted	<NA>	cvd	PCSK9	93.0	209.0	19.0	190.0	NaN	NaN	1.432396	4.773212e-01	NaN	NaN
0	CHISQ	Adjusted	<NA>	disease_severity	PCSK9	NaN	NaN	NaN	NaN	NaN	NaN	2.600396	2.724778e-01	NaN	NaN
0	CHISQ	Adjusted	<NA>	treatment_response	PCSK9	NaN	NaN	NaN	NaN	NaN	NaN	0.904563	6.361750e-01	NaN	NaN

Multi-model covariate comparison

The coef_table above contains results for both the Unadjusted (no covariates) and Adjusted (age + sex) models. Sorting by exposure, outcome, and model places the two rows for each exposure–outcome pair adjacent to each other, making it easy to see whether adjusting for covariates changes the association.

Columns of interest:

Model name — Unadjusted or Adjusted
coef / Point Estimates — effect size
P-values — statistical significance

A smaller p-value after adjustment may indicate that the raw association was confounded by age or sex. A larger p-value after adjustment is common when the covariate explains some of the outcome variance.

[8]:

# Exposure name used in this analysis
exposure_col_name = 'PCSK9'

# Filter to the exposure rows (exclude intercept and covariate rows),
# then sort by outcome (Variable) and model so that Unadjusted and Adjusted
# rows for the same outcome appear adjacent.
(
    coef_table
    .rename(columns={'Model name': 'model', 'Variable': 'outcome', 'Point Estimates': 'coef', 'P-values': 'pvalue'})
    .query("Exposure == @exposure_col_name")
    .sort_values(['outcome', 'model'])
    [['model', 'outcome', 'coef', 'pvalue']]
    .head(12)
)

[8]:

	model	outcome	coef	pvalue
1	Adjusted	cvd	0.358869	0.457350
0	Adjusted	cvd	NaN	0.477321
1	Unadjusted	cvd	0.359349	0.456123
0	Unadjusted	cvd	NaN	0.477321
1	Adjusted	diabetes	0.679312	0.226903
1	Unadjusted	diabetes	0.684489	0.220503
0	Adjusted	disease_severity	NaN	0.272478
0	Unadjusted	disease_severity	NaN	0.272478
1	Adjusted	ldl_cholesterol	0.083016	0.752053
0	Adjusted	ldl_cholesterol	NaN	0.711937
1	Unadjusted	ldl_cholesterol	0.096843	0.711937
0	Unadjusted	ldl_cholesterol	NaN	0.711937

Pairwise vs listwise missingness

MARVEL’s MissingnessHandler supports two strategies for dealing with missing data:

Listwise deletion (conservative): drops every sample that has a missing value in any of the specified columns. The same reduced sample set is used for all downstream tests — consistent but potentially wasteful.
Pairwise deletion (maximises N): drops only the samples that are missing the specific columns required for each individual test (exposure + outcome + covariates). Different tests may therefore use different sample counts.

The demo below operates on raw_merged — the merged dataframe before any dropna call — so that the comparison is meaningful. If the data happen to have identical missingness patterns across all variables (making listwise == pairwise for every outcome), synthetic missingness is introduced for illustration.

[9]:

# Variables used for the missingness demo
exposure_miss = 'PCSK9'
outcomes_miss = ['sbp', 'hypertension']
covariates_miss = ['age', 'sex']

all_cols = [exposure_miss] + outcomes_miss + covariates_miss

# Check listwise vs pairwise counts on the raw merged data
n_listwise_raw = len(
    MissingnessHandler(listwise=True, verbose=False).handle_listwise(
        data=raw_merged, columns=all_cols
    )
)
n_pairwise_raw = {
    outcome: len(
        MissingnessHandler(listwise=False, verbose=False).handle_pairwise(
            data=raw_merged,
            exposure=exposure_miss,
            outcome=outcome,
            covariates=covariates_miss,
        )
    )
    for outcome in outcomes_miss
}

# Check whether pairwise counts differ from listwise — if not, introduce synthetic
# missingness so the comparison is meaningful
counts_differ = any(n != n_listwise_raw for n in n_pairwise_raw.values())

if counts_differ:
    raw_merged_demo = raw_merged
    print("Using real data (pairwise and listwise counts already differ).")
else:
    # --- Introducing synthetic missingness for demo purposes ---
    # Existing data had identical listwise/pairwise counts, so we inject
    # controlled NaNs to create a divergent comparison.
    rng = np.random.default_rng(42)
    raw_merged_demo = raw_merged.copy()
    raw_merged_demo.loc[raw_merged_demo.index[::5], 'sbp'] = np.nan
    raw_merged_demo.loc[raw_merged_demo.index[::7], 'hypertension'] = np.nan
    print("Synthetic missingness introduced for demo purposes.")
    print("  sbp NaN every 5th row, hypertension NaN every 7th row.")
print()

Using real data (pairwise and listwise counts already differ).

[10]:

# --- Listwise deletion ---
# Drops any sample missing in ANY of the specified columns.
# One fixed sample size applies to all tests.
n_listwise = len(
    MissingnessHandler(listwise=True, verbose=False).handle_listwise(
        data=raw_merged_demo,
        columns=all_cols,
    )
)
print(f"Listwise retained N (applied once for all outcomes): {n_listwise}")
print()

# --- Pairwise deletion ---
# Drops only samples missing for the specific exposure + outcome + covariates.
# Each outcome may retain a different N.
pairwise_handler = MissingnessHandler(listwise=False, verbose=False)
print("Pairwise retained N (per outcome):")
for outcome in outcomes_miss:
    n_pw = len(
        pairwise_handler.handle_pairwise(
            data=raw_merged_demo,
            exposure=exposure_miss,
            outcome=outcome,
            covariates=covariates_miss,
        )
    )
    print(f"  {outcome}: {n_pw}")

Listwise retained N (applied once for all outcomes): 743

Pairwise retained N (per outcome):
  sbp: 788
  hypertension: 787

Survival analysis

MARVEL supports Cox proportional hazards regression via the 'Cox-PH' test key. This requires:

An event indicator column (binary 0/1) — here cvd (cardiovascular disease event).
A follow-up time column containing positive values — here follow_up_time (years, clipped to a minimum of 0.1).
A survival_time_map mapping the event column to the time column, passed to tester.run().

The survival_df below is assembled from the raw merged data to ensure both cvd and follow_up_time are present alongside the genetic exposure and covariates. Rows with missing values in any required column are dropped before running.

[11]:

# Build a survival dataframe that includes the time column.
# raw_merged contains all columns from the merge (genotypes + phenotypes + covariates),
# including follow_up_time from phenotypes.tsv.
survival_cols = ['id', 'PCSK9', 'cvd', 'follow_up_time', 'age', 'sex']
survival_df = raw_merged[survival_cols].dropna().copy()

print(f"Survival analysis dataframe: {len(survival_df)} samples")
print(f"CVD events: {int(survival_df['cvd'].sum())} ({survival_df['cvd'].mean():.1%})")
print(f"Follow-up time range: {survival_df['follow_up_time'].min():.2f} – {survival_df['follow_up_time'].max():.2f} years")
survival_df.head()

Survival analysis dataframe: 804 samples
CVD events: 355 (44.2%)
Follow-up time range: 0.10 – 37.35 years

[11]:

	id	cvd	follow_up_time	age	sex
0	ID_000000	0.0	12.02	59.720649	0.0
1	ID_000001	0.0	11.68	67.088467	0.0
3	ID_000003	0.0	1.40	58.368992	1.0
4	ID_000004	0.0	0.43	75.751361	0.0
5	ID_000005	1.0	7.26	62.097949	0.0

[12]:

# Covariates dataframe for the survival model
survival_cov_df = survival_df[['id', 'age', 'sex']].copy()

# survival_time_map: event column -> time column
survival_time_map = {'cvd': 'follow_up_time'}

# Test dictionary: 'survival' type with Cox-PH test
survival_test_dict = {
    'survival': {'cvd': 'Cox-PH'},
}

# Covariate models
survival_models = {
    'Unadjusted': 'None',
    'Adjusted': 'age;sex',
}

# Run survival analysis
survival_base, survival_coef = tester.run(
    data=survival_df,
    exposure='PCSK9',
    models=survival_models,
    test_dict=survival_test_dict,
    cov=survival_cov_df,
    id_column='id',
    survival_time_map=survival_time_map,
)
survival_base

[12]:

	0.0			1.0			Difference
	N (%) or Mean (SD)	Median (Q1; Q3)	Missing (%)	N (%) or Mean (SD)	Median (Q1; Q3)	Missing (%)	Cox-PH
Variable
Total sample size_Unadjusted	742	NaN	NaN	62	NaN	NaN	NaN
follow_up_time_Unadjusted	5.11 (5.15)	3.45 (1.48; 7.01)	0 (0.00)	6.29 (6.59)	4.12 (1.81; 8.53)	0 (0.00)	NaN
cvd_Unadjusted	334 (45.01)	NaN	0 (0.00)	21 (33.87)	NaN	0 (0.00)	0.046489
cvd_Adjusted	334 (45.01)	NaN	0 (0.00)	21 (33.87)	NaN	0 (0.00)	0.053219

[13]:

# Cox-PH coefficient table: hazard ratios and p-values per model
survival_coef[['Model name', 'Variable', 'Point Estimates', 'P-values', 'Formatted (exp)']]

[13]:

	Model name	Variable	Point Estimates	P-values	Formatted (exp)
0	Unadjusted	cvd	-0.448305	0.046489	0.64 (0.41; 0.99)
0	Adjusted	cvd	-0.435733	0.053219	0.65 (0.42; 1.01)
1	Adjusted	cvd	-0.001516	0.679030	1.00 (0.99; 1.01)
2	Adjusted	cvd	0.145628	0.172169	1.16 (0.94; 1.43)