Command-line interface example
This notebook demonstrates the primary way to run MARVELous: via the marvelous command-line interface. The steps are:
Prepare input files (bundled example data)
Create a configuration file using
create_configRun
marvelous <config_file>from the shell (shown here using Jupyter’s!magic)Inspect the output files
Clean up the temporary directory
Outside a notebook you would run the same command directly in your terminal:
marvelous path/to/example.cnf --verbose
Setup
[1]:
import glob
import os
import re
import shutil
import tempfile
from marvel.example_data import examples
from marvel.constants import CONFIG_LHS as CFLhs
from marvel.utils.config_tools import create_config
# Trigger generation and get the directory containing all example data files
input_dir = os.path.dirname(examples.get_path('vcf_example', 'example_genotypes_chr1.vcf.gz'))
# Temporary working directory — removed at the end of this notebook
output_dir = tempfile.mkdtemp(prefix="marvel_cli_example_")
config_file = os.path.join(output_dir, "example.cnf")
print(f"Working directory: {output_dir}")
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
warnings.warn(
Working directory: /tmp/marvel_cli_example_ixe91nub
Step 1: Prepare input files
MARVELous ships with a small set of synthetic example datasets. Here we locate the VCF files, variant list, and phenotype/covariate files that will be used for the analysis.
[2]:
# One VCF file per chromosome — build a {chrom: path} mapping
geno_files = glob.glob(os.path.join(input_dir, "example*vcf.gz"))
geno_input = {
re.search(r"chr\d+", f, re.IGNORECASE).group(0): f
for f in geno_files
}
pheno_file = os.path.join(input_dir, "phenotypes.tsv")
cov_file = os.path.join(input_dir, "covariates.tsv")
# Statistical tests per outcome type
con_tests = {
"sbp" : ["OLS", "KW"],
"ldl_cholesterol" : ["OLS", "AOV"],
"crp_level" : ["OLS", "KW", "AOV"],
}
cat_tests = {
"disease_severity" : ["CHISQ"],
"treatment_response" : ["CHISQ"],
"risk_category" : ["CHISQ"],
}
bin_tests = {
"hypertension" : ["GLM-Binom", "CHISQ"],
"diabetes" : ["GLM-Binom", "FISHER"],
"cvd" : ["GLM-Binom", "CHISQ", "FISHER"],
}
print(f"Found {len(geno_input)} VCF file(s): {sorted(geno_input)}")
Found 5 VCF file(s): ['chr1', 'chr16', 'chr22', 'chr5', 'chr7']
Step 2: Create the configuration file
create_config writes a .cnf file with all the paths and options that marvelous will read. Using this function is recommended because it performs basic validation before writing.
[3]:
create_config(
path=config_file,
extract_variants=True,
geno_input=geno_input,
variant_input={"variants": os.path.join(input_dir, "gene_variants.tsv")},
variant_output={CFLhs.VarOutput: os.path.join(output_dir, "variant_carriers")},
association_analysis=True,
pheno_input={CFLhs.PhenoFile: pheno_file, CFLhs.CovFile: cov_file},
con_tests=con_tests,
cat_tests=cat_tests,
bin_tests=bin_tests,
covs={"Unadjusted": "None", "Adjusted": ["age", "sex"]},
id_column="id",
raise_on_error=True,
ref_column="a1",
alt_column="a2",
cat_column="gene",
incl_var=False,
output_path=output_dir,
)
print(f"Configuration file written to: {config_file}")
# Display the raw configuration file so you can see what was written
with open(config_file) as fh:
print(fh.read())
Configuration file written to: /tmp/marvel_cli_example_ixe91nub/example.cnf
[GenoInput]
chr7 /tmp/marvel_example_zc9hxuff/example_genotypes_chr7.vcf.gz
chr5 /tmp/marvel_example_zc9hxuff/example_genotypes_chr5.vcf.gz
chr16 /tmp/marvel_example_zc9hxuff/example_genotypes_chr16.vcf.gz
chr22 /tmp/marvel_example_zc9hxuff/example_genotypes_chr22.vcf.gz
chr1 /tmp/marvel_example_zc9hxuff/example_genotypes_chr1.vcf.gz
[VarInput]
variants /tmp/marvel_example_zc9hxuff/gene_variants.tsv
[Output]
VarOutput /tmp/marvel_cli_example_ixe91nub/variant_carriers
[ExpInput]
VarOutput /tmp/marvel_cli_example_ixe91nub/variant_carriers
[PhenoInput]
phenotypes /tmp/marvel_example_zc9hxuff/phenotypes.tsv
covariates /tmp/marvel_example_zc9hxuff/covariates.tsv
[ConTests]
sbp OLS;KW
ldl_cholesterol OLS;AOV
crp_level OLS;KW;AOV
[CatTests]
disease_severity CHISQ
treatment_response CHISQ
risk_category CHISQ
[BinTests]
hypertension GLM-Binom;CHISQ
diabetes GLM-Binom;FISHER
cvd GLM-Binom;CHISQ;FISHER
[SurvTests]
[Covs]
Unadjusted None
Adjusted age;sex
[Stratification]
[Options]
id_column id
raise_on_error True
ref_column a1
alt_column a2
cat_column gene
incl_var False
output_path /tmp/marvel_cli_example_ixe91nub
extract_variants True
association_analysis True
Step 3: Run the pipeline from the command line
The ! prefix in Jupyter executes the command in a shell, exactly as if you typed it in your terminal. Outside a notebook you would run:
marvelous path/to/example.cnf --verbose
MARVELous will:
Extract variant carriers from the VCF files
Aggregate carriers to gene level
Run association tests for all specified outcomes and covariate models
[4]:
!marvelous "{config_file}" --verbose
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
warnings.warn(
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
warnings.warn(
=== MARVEL (marvel v0.4.1a0) ===
[info] 22:46:10 > config_file value: /tmp/marvel_cli_example_ixe91nub/example.cnf
[info] 22:46:10 > dry_run value: False
[info] 22:46:10 > log_file value: None
[info] 22:46:10 > outpath value: None
[info] 22:46:11 > verbose value: True
[info] 22:46:11 > ================================================================================
[info] 22:46:11 > MARVELous Pipeline Starting
[info] 22:46:11 > ================================================================================
[info] 22:46:11 >
[info] 22:46:11 > STEP 1: VARIANT EXTRACTION
[info] 22:46:11 > --------------------------------------------------------------------------------
[info] 22:46:11 > Starting variant extraction
[info] 22:46:11 > Set up 5 extraction task(s)
[info] 22:46:11 > Gene aggregation enabled - using two-phase approach
[info] 22:46:11 > Checkpoint directory: ./tmp_check
[info] 22:46:11 > Will skip tasks with existing output files
[info] 22:46:11 > Starting parallel extraction with 2 workers
[info] 22:46:12 > [1/5] Completed: chr7_variants
[info] 22:46:12 > [2/5] Completed: chr5_variants
[info] 22:46:12 > [3/5] Completed: chr16_variants
[info] 22:46:12 > [4/5] Completed: chr22_variants
[info] 22:46:13 > [5/5] Completed: chr1_variants
[info] 22:46:13 > Aggregating 30 variants to genes...
[info] 22:46:13 > Successful: 5
[info] 22:46:13 > Saved summary: /tmp/marvel_cli_example_ixe91nub/variant_carriers_summary.tsv.gz
[info] 22:46:13 > Saved carriers: /tmp/marvel_cli_example_ixe91nub/variant_carriers_carriers.tsv.gz
[info] 22:46:13 > Extraction complete: 1 file(s) created
[info] 22:46:13 >
[info] 22:46:13 > STEP 2: ASSOCIATION TESTING
[info] 22:46:13 > --------------------------------------------------------------------------------
[info] 22:46:13 > Starting association testing
[info] 22:46:13 > Testing 6 exposure(s)
[info] 22:46:13 > Checking missingness thresholds (max allowed: 50.0%)
[info] 22:46:13 > Missingness filtering complete. Retained 9 outcomes, 6 exposures
[info] 22:46:13 > Pre-computing stratification model: 'overall'
[info] 22:46:13 > Computing valid IDs for listwise deletion
[info] 22:46:13 > Processing file: /tmp/marvel_example_zc9hxuff/phenotypes.tsv
[info] 22:46:14 > ID extraction: 625 valid IDs from 1000 samples (375 removed, 37.50%)
[info] 22:46:14 > Final valid IDs across all files: 625
[info] 22:46:14 > Phenotype file: 625 valid samples (checked 9 outcomes + 0 time columns)
[info] 22:46:14 > Processing file: /tmp/marvel_example_zc9hxuff/covariates.tsv
[info] 22:46:14 > ID extraction: 950 valid IDs from 1000 samples (50 removed, 5.00%)
[info] 22:46:14 > Final valid IDs across all files: 950
[info] 22:46:14 > Covariate file: 950 valid samples
[info] 22:46:14 > Processing file: /tmp/marvel_cli_example_ixe91nub/variant_carriers_carriers.tsv.gz
[info] 22:46:14 > Final valid IDs across all files: 1000
[info] 22:46:14 > Exposure files: 1000 valid samples
[info] 22:46:14 > Listwise deletion: 595 valid samples (intersection)
[info] 22:46:14 > Pre-computed 1 stratification model(s)
[info] 22:46:14 > Testing exposure: ANGPTL3
[info] 22:46:14 > Loaded: {0: 956, 1: 44}
[info] 22:46:14 > Stratification model: 'overall'
[info] 22:46:14 > Cached exposure 'ANGPTL3' in DataManager
[info] 22:46:25 > Completed: 58 results rows generated
[info] 22:46:25 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/ANGPTL3_overall_baseline.tsv
[info] 22:46:25 > Results saved: /tmp/marvel_cli_example_ixe91nub/ANGPTL3_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:25 > Testing exposure: CETP
[info] 22:46:25 > Loaded: {0: 950, 1: 50}
[info] 22:46:25 > Stratification model: 'overall'
[info] 22:46:25 > Cached exposure 'CETP' in DataManager
[info] 22:46:36 > Completed: 58 results rows generated
[info] 22:46:36 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/CETP_overall_baseline.tsv
[info] 22:46:36 > Results saved: /tmp/marvel_cli_example_ixe91nub/CETP_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:36 > Testing exposure: HMGCR
[info] 22:46:36 > Loaded: {0: 947, 1: 53}
[info] 22:46:36 > Stratification model: 'overall'
[info] 22:46:36 > Cached exposure 'HMGCR' in DataManager
[info] 22:46:47 > Completed: 58 results rows generated
[info] 22:46:47 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/HMGCR_overall_baseline.tsv
[info] 22:46:47 > Results saved: /tmp/marvel_cli_example_ixe91nub/HMGCR_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:47 > Testing exposure: NPC1L1
[info] 22:46:47 > Loaded: {0: 926, 1: 72, 2: 2}
[info] 22:46:47 > Stratification model: 'overall'
[info] 22:46:47 > Cached exposure 'NPC1L1' in DataManager
[info] 22:46:57 > Completed: 58 results rows generated
[info] 22:46:57 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/NPC1L1_overall_baseline.tsv
[info] 22:46:57 > Results saved: /tmp/marvel_cli_example_ixe91nub/NPC1L1_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:57 > Testing exposure: PCSK9
[info] 22:46:57 > Loaded: {0: 955, 1: 44, 2: 1}
[info] 22:46:57 > Stratification model: 'overall'
[info] 22:46:57 > Cached exposure 'PCSK9' in DataManager
[info] 22:47:09 > Completed: 58 results rows generated
[info] 22:47:09 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/PCSK9_overall_baseline.tsv
[info] 22:47:09 > Results saved: /tmp/marvel_cli_example_ixe91nub/PCSK9_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:47:09 > Testing exposure: PPARA
[info] 22:47:10 > Loaded: {0: 947, 1: 52, 2: 1}
[info] 22:47:10 > Stratification model: 'overall'
[info] 22:47:10 > Cached exposure 'PPARA' in DataManager
[info] 22:47:22 > Completed: 58 results rows generated
[info] 22:47:22 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/PPARA_overall_baseline.tsv
[info] 22:47:22 > Results saved: /tmp/marvel_cli_example_ixe91nub/PPARA_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:47:22 > DataManager cache statistics (stratification 'overall'):
[info] 22:47:22 > Outcomes: 99 hits, 9 misses, hit rate: 91.67%
[info] 22:47:22 > Covariates: 53 hits, 1 misses, hit rate: 98.15%
[info] 22:47:22 > Exposure reuses: 0
[info] 22:47:22 > Association testing complete
[info] 22:47:22 >
[info] 22:47:22 > ================================================================================
[info] 22:47:22 > MARVELous Pipeline Completed Successfully
[info] 22:47:22 > ================================================================================
[info] 22:47:22 >
[info] 22:47:22 > Pipeline Summary:
[info] 22:47:22 > config: {'extract_variants': True, 'association_analysis': True, 'output_path': '/tmp/marvel_cli_example_ixe91nub'}
[info] 22:47:22 > extraction: {'files_created': 1, 'successful_tasks': 5, 'failed_tasks': 0}
[info] 22:47:22 > association: {'completed': 6, 'failed': 0, 'skipped': 0}
Step 4: Inspect the output
MARVELous writes two types of output files:
``*_carriers.tsv.gz`` — one row per sample, one column per gene (carrier counts)
``*_results.tsv.gz`` — association test results (one row per gene × outcome × model)
See the Output files section of the documentation for a full description of every column.
[5]:
import pandas as pd
# List all files written to the output directory
output_files = sorted(
os.path.relpath(p, output_dir)
for p in glob.glob(os.path.join(output_dir, "**"), recursive=True)
if os.path.isfile(p)
)
print("Output files:")
for f in output_files:
print(f" {f}")
Output files:
ANGPTL3_overall_baseline.tsv
ANGPTL3_results.tsv.gz
CETP_overall_baseline.tsv
CETP_results.tsv.gz
HMGCR_overall_baseline.tsv
HMGCR_results.tsv.gz
NPC1L1_overall_baseline.tsv
NPC1L1_results.tsv.gz
PCSK9_overall_baseline.tsv
PCSK9_results.tsv.gz
PPARA_overall_baseline.tsv
PPARA_results.tsv.gz
example.cnf
variant_carriers_carriers.tsv.gz
variant_carriers_summary.tsv.gz
[6]:
# Load the gene-level carrier matrix
carriers_file = glob.glob(os.path.join(output_dir, "*_carriers.tsv.gz"))[0]
carriers = pd.read_csv(carriers_file, sep="\t")
print(f"Carrier matrix shape: {carriers.shape}")
carriers.head()
Carrier matrix shape: (1000, 7)
[6]:
| id | PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | |
|---|---|---|---|---|---|---|---|
| 0 | ID_000000 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | ID_000001 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | ID_000002 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | ID_000003 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | ID_000004 | 0 | 0 | 0 | 0 | 0 | 0 |
[7]:
# Load one association results file and show the top hits
results_files = glob.glob(os.path.join(output_dir, "*_results.tsv.gz"))
results = pd.read_csv(results_files[0], sep="\t")
print(f"Results shape: {results.shape}")
results.head(10)
Results shape: (58, 15)
[7]:
| Model | Model name | Stratification name | Variable | Exposure | Cases | Samples | Exposed | Non-exposed | Point Estimates | Standard Error | Test statistic | P-values | Formatted | Formatted (exp) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OLS | Unadjusted | overall | sbp | Intercept | 0.0 | 595.0 | 43.0 | 552.0 | 120.180721 | 0.651921 | NaN | 0.000000e+00 | 120.18 (118.90; 121.46) | 1562513444283125989701933543434768202131569939... |
| 1 | OLS | Unadjusted | overall | sbp | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | 0.696543 | 2.425043 | NaN | 7.740373e-01 | 0.70 (-4.06; 5.45) | 2.01 (0.02; 232.65) |
| 2 | KW | Unadjusted | overall | sbp | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | NaN | NaN | 0.086862 | 7.682057e-01 | NaN | NaN |
| 3 | OLS | Unadjusted | overall | ldl_cholesterol | Intercept | 0.0 | 595.0 | 43.0 | 552.0 | 5.132331 | 0.047475 | NaN | 0.000000e+00 | 5.13 (5.04; 5.23) | 169.41 (154.36; 185.93) |
| 4 | OLS | Unadjusted | overall | ldl_cholesterol | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | -0.016067 | 0.176601 | NaN | 9.275404e-01 | -0.02 (-0.36; 0.33) | 0.98 (0.70; 1.39) |
| 5 | AOV | Unadjusted | overall | ldl_cholesterol | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | NaN | NaN | 0.008277 | 9.275404e-01 | NaN | NaN |
| 6 | OLS | Unadjusted | overall | crp_level | Intercept | 0.0 | 595.0 | 43.0 | 552.0 | 37.746194 | 0.852583 | NaN | 3.878451e-190 | 37.75 (36.08; 39.42) | 24715186084767916.00 (4647764053864573.00; 131... |
| 7 | OLS | Unadjusted | overall | crp_level | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | -0.471179 | 3.171472 | NaN | 8.819452e-01 | -0.47 (-6.69; 5.74) | 0.62 (0.00; 312.56) |
| 8 | KW | Unadjusted | overall | crp_level | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | NaN | NaN | 0.003367 | 9.537299e-01 | NaN | NaN |
| 9 | AOV | Unadjusted | overall | crp_level | NPC1L1 | 0.0 | 595.0 | 43.0 | 552.0 | NaN | NaN | 0.022072 | 8.819452e-01 | NaN | NaN |
Step 5: Cleanup
Remove the temporary directory created at the start of this notebook.
[8]:
shutil.rmtree(output_dir)
print(f"Removed temporary directory: {output_dir}")
Removed temporary directory: /tmp/marvel_cli_example_ixe91nub