Command-line interface example

This notebook demonstrates the primary way to run MARVELous: via the marvelous command-line interface. The steps are:

  1. Prepare input files (bundled example data)

  2. Create a configuration file using create_config

  3. Run marvelous <config_file> from the shell (shown here using Jupyter’s ! magic)

  4. Inspect the output files

  5. Clean up the temporary directory

Outside a notebook you would run the same command directly in your terminal:

marvelous path/to/example.cnf --verbose

Setup

[1]:
import glob
import os
import re
import shutil
import tempfile

from marvel.example_data import examples
from marvel.constants import CONFIG_LHS as CFLhs
from marvel.utils.config_tools import create_config

# Trigger generation and get the directory containing all example data files
input_dir = os.path.dirname(examples.get_path('vcf_example', 'example_genotypes_chr1.vcf.gz'))

# Temporary working directory — removed at the end of this notebook
output_dir = tempfile.mkdtemp(prefix="marvel_cli_example_")
config_file = os.path.join(output_dir, "example.cnf")

print(f"Working directory: {output_dir}")
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
  warnings.warn(
Working directory: /tmp/marvel_cli_example_ixe91nub

Step 1: Prepare input files

MARVELous ships with a small set of synthetic example datasets. Here we locate the VCF files, variant list, and phenotype/covariate files that will be used for the analysis.

[2]:
# One VCF file per chromosome — build a {chrom: path} mapping
geno_files = glob.glob(os.path.join(input_dir, "example*vcf.gz"))
geno_input = {
    re.search(r"chr\d+", f, re.IGNORECASE).group(0): f
    for f in geno_files
}

pheno_file = os.path.join(input_dir, "phenotypes.tsv")
cov_file   = os.path.join(input_dir, "covariates.tsv")

# Statistical tests per outcome type
con_tests = {
    "sbp"             : ["OLS", "KW"],
    "ldl_cholesterol" : ["OLS", "AOV"],
    "crp_level"       : ["OLS", "KW", "AOV"],
}
cat_tests = {
    "disease_severity"   : ["CHISQ"],
    "treatment_response" : ["CHISQ"],
    "risk_category"      : ["CHISQ"],
}
bin_tests = {
    "hypertension" : ["GLM-Binom", "CHISQ"],
    "diabetes"     : ["GLM-Binom", "FISHER"],
    "cvd"          : ["GLM-Binom", "CHISQ", "FISHER"],
}

print(f"Found {len(geno_input)} VCF file(s): {sorted(geno_input)}")
Found 5 VCF file(s): ['chr1', 'chr16', 'chr22', 'chr5', 'chr7']

Step 2: Create the configuration file

create_config writes a .cnf file with all the paths and options that marvelous will read. Using this function is recommended because it performs basic validation before writing.

[3]:
create_config(
    path=config_file,
    extract_variants=True,
    geno_input=geno_input,
    variant_input={"variants": os.path.join(input_dir, "gene_variants.tsv")},
    variant_output={CFLhs.VarOutput: os.path.join(output_dir, "variant_carriers")},
    association_analysis=True,
    pheno_input={CFLhs.PhenoFile: pheno_file, CFLhs.CovFile: cov_file},
    con_tests=con_tests,
    cat_tests=cat_tests,
    bin_tests=bin_tests,
    covs={"Unadjusted": "None", "Adjusted": ["age", "sex"]},
    id_column="id",
    raise_on_error=True,
    ref_column="a1",
    alt_column="a2",
    cat_column="gene",
    incl_var=False,
    output_path=output_dir,
)

print(f"Configuration file written to: {config_file}")

# Display the raw configuration file so you can see what was written
with open(config_file) as fh:
    print(fh.read())
Configuration file written to: /tmp/marvel_cli_example_ixe91nub/example.cnf
[GenoInput]
chr7    /tmp/marvel_example_zc9hxuff/example_genotypes_chr7.vcf.gz
chr5    /tmp/marvel_example_zc9hxuff/example_genotypes_chr5.vcf.gz
chr16   /tmp/marvel_example_zc9hxuff/example_genotypes_chr16.vcf.gz
chr22   /tmp/marvel_example_zc9hxuff/example_genotypes_chr22.vcf.gz
chr1    /tmp/marvel_example_zc9hxuff/example_genotypes_chr1.vcf.gz
[VarInput]
variants        /tmp/marvel_example_zc9hxuff/gene_variants.tsv
[Output]
VarOutput       /tmp/marvel_cli_example_ixe91nub/variant_carriers
[ExpInput]
VarOutput       /tmp/marvel_cli_example_ixe91nub/variant_carriers
[PhenoInput]
phenotypes      /tmp/marvel_example_zc9hxuff/phenotypes.tsv
covariates      /tmp/marvel_example_zc9hxuff/covariates.tsv
[ConTests]
sbp     OLS;KW
ldl_cholesterol OLS;AOV
crp_level       OLS;KW;AOV
[CatTests]
disease_severity        CHISQ
treatment_response      CHISQ
risk_category   CHISQ
[BinTests]
hypertension    GLM-Binom;CHISQ
diabetes        GLM-Binom;FISHER
cvd     GLM-Binom;CHISQ;FISHER
[SurvTests]
[Covs]
Unadjusted      None
Adjusted        age;sex
[Stratification]
[Options]
id_column       id
raise_on_error  True
ref_column      a1
alt_column      a2
cat_column      gene
incl_var        False
output_path     /tmp/marvel_cli_example_ixe91nub
extract_variants        True
association_analysis    True

Step 3: Run the pipeline from the command line

The ! prefix in Jupyter executes the command in a shell, exactly as if you typed it in your terminal. Outside a notebook you would run:

marvelous path/to/example.cnf --verbose

MARVELous will:

  1. Extract variant carriers from the VCF files

  2. Aggregate carriers to gene level

  3. Run association tests for all specified outcomes and covariate models

[4]:
!marvelous "{config_file}" --verbose
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
  warnings.warn(
/usr/local/lib/python3.12/site-packages/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /usr/local/lib/python3.12/site-packages/marvel/association/tests.py.
  warnings.warn(
=== MARVEL (marvel v0.4.1a0) ===
[info] 22:46:10 > config_file value: /tmp/marvel_cli_example_ixe91nub/example.cnf
[info] 22:46:10 > dry_run value: False
[info] 22:46:10 > log_file value: None
[info] 22:46:10 > outpath value: None
[info] 22:46:11 > verbose value: True
[info] 22:46:11 > ================================================================================
[info] 22:46:11 > MARVELous Pipeline Starting
[info] 22:46:11 > ================================================================================
[info] 22:46:11 >
[info] 22:46:11 > STEP 1: VARIANT EXTRACTION
[info] 22:46:11 > --------------------------------------------------------------------------------
[info] 22:46:11 > Starting variant extraction
[info] 22:46:11 > Set up 5 extraction task(s)
[info] 22:46:11 > Gene aggregation enabled - using two-phase approach
[info] 22:46:11 > Checkpoint directory: ./tmp_check
[info] 22:46:11 > Will skip tasks with existing output files
[info] 22:46:11 > Starting parallel extraction with 2 workers
[info] 22:46:12 > [1/5] Completed: chr7_variants
[info] 22:46:12 > [2/5] Completed: chr5_variants
[info] 22:46:12 > [3/5] Completed: chr16_variants
[info] 22:46:12 > [4/5] Completed: chr22_variants
[info] 22:46:13 > [5/5] Completed: chr1_variants
[info] 22:46:13 > Aggregating 30 variants to genes...
[info] 22:46:13 > Successful: 5
[info] 22:46:13 > Saved summary: /tmp/marvel_cli_example_ixe91nub/variant_carriers_summary.tsv.gz
[info] 22:46:13 > Saved carriers: /tmp/marvel_cli_example_ixe91nub/variant_carriers_carriers.tsv.gz
[info] 22:46:13 > Extraction complete: 1 file(s) created
[info] 22:46:13 >
[info] 22:46:13 > STEP 2: ASSOCIATION TESTING
[info] 22:46:13 > --------------------------------------------------------------------------------
[info] 22:46:13 > Starting association testing
[info] 22:46:13 > Testing 6 exposure(s)
[info] 22:46:13 > Checking missingness thresholds (max allowed: 50.0%)
[info] 22:46:13 > Missingness filtering complete. Retained 9 outcomes, 6 exposures
[info] 22:46:13 > Pre-computing stratification model: 'overall'
[info] 22:46:13 > Computing valid IDs for listwise deletion
[info] 22:46:13 > Processing file: /tmp/marvel_example_zc9hxuff/phenotypes.tsv
[info] 22:46:14 > ID extraction: 625 valid IDs from 1000 samples (375 removed, 37.50%)
[info] 22:46:14 > Final valid IDs across all files: 625
[info] 22:46:14 > Phenotype file: 625 valid samples (checked 9 outcomes + 0 time columns)
[info] 22:46:14 > Processing file: /tmp/marvel_example_zc9hxuff/covariates.tsv
[info] 22:46:14 > ID extraction: 950 valid IDs from 1000 samples (50 removed, 5.00%)
[info] 22:46:14 > Final valid IDs across all files: 950
[info] 22:46:14 > Covariate file: 950 valid samples
[info] 22:46:14 > Processing file: /tmp/marvel_cli_example_ixe91nub/variant_carriers_carriers.tsv.gz
[info] 22:46:14 > Final valid IDs across all files: 1000
[info] 22:46:14 > Exposure files: 1000 valid samples
[info] 22:46:14 > Listwise deletion: 595 valid samples (intersection)
[info] 22:46:14 > Pre-computed 1 stratification model(s)
[info] 22:46:14 > Testing exposure: ANGPTL3
[info] 22:46:14 >   Loaded: {0: 956, 1: 44}
[info] 22:46:14 >   Stratification model: 'overall'
[info] 22:46:14 > Cached exposure 'ANGPTL3' in DataManager
[info] 22:46:25 >    Completed: 58 results rows generated
[info] 22:46:25 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/ANGPTL3_overall_baseline.tsv
[info] 22:46:25 > Results saved: /tmp/marvel_cli_example_ixe91nub/ANGPTL3_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:25 > Testing exposure: CETP
[info] 22:46:25 >   Loaded: {0: 950, 1: 50}
[info] 22:46:25 >   Stratification model: 'overall'
[info] 22:46:25 > Cached exposure 'CETP' in DataManager
[info] 22:46:36 >    Completed: 58 results rows generated
[info] 22:46:36 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/CETP_overall_baseline.tsv
[info] 22:46:36 > Results saved: /tmp/marvel_cli_example_ixe91nub/CETP_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:36 > Testing exposure: HMGCR
[info] 22:46:36 >   Loaded: {0: 947, 1: 53}
[info] 22:46:36 >   Stratification model: 'overall'
[info] 22:46:36 > Cached exposure 'HMGCR' in DataManager
[info] 22:46:47 >    Completed: 58 results rows generated
[info] 22:46:47 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/HMGCR_overall_baseline.tsv
[info] 22:46:47 > Results saved: /tmp/marvel_cli_example_ixe91nub/HMGCR_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:47 > Testing exposure: NPC1L1
[info] 22:46:47 >   Loaded: {0: 926, 1: 72, 2: 2}
[info] 22:46:47 >   Stratification model: 'overall'
[info] 22:46:47 > Cached exposure 'NPC1L1' in DataManager
[info] 22:46:57 >    Completed: 58 results rows generated
[info] 22:46:57 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/NPC1L1_overall_baseline.tsv
[info] 22:46:57 > Results saved: /tmp/marvel_cli_example_ixe91nub/NPC1L1_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:46:57 > Testing exposure: PCSK9
[info] 22:46:57 >   Loaded: {0: 955, 1: 44, 2: 1}
[info] 22:46:57 >   Stratification model: 'overall'
[info] 22:46:57 > Cached exposure 'PCSK9' in DataManager
[info] 22:47:09 >    Completed: 58 results rows generated
[info] 22:47:09 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/PCSK9_overall_baseline.tsv
[info] 22:47:09 > Results saved: /tmp/marvel_cli_example_ixe91nub/PCSK9_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:47:09 > Testing exposure: PPARA
[info] 22:47:10 >   Loaded: {0: 947, 1: 52, 2: 1}
[info] 22:47:10 >   Stratification model: 'overall'
[info] 22:47:10 > Cached exposure 'PPARA' in DataManager
[info] 22:47:22 >    Completed: 58 results rows generated
[info] 22:47:22 > Baseline table saved: /tmp/marvel_cli_example_ixe91nub/PPARA_overall_baseline.tsv
[info] 22:47:22 > Results saved: /tmp/marvel_cli_example_ixe91nub/PPARA_results.tsv.gz (58 rows from 1 stratum/strata)
[info] 22:47:22 > DataManager cache statistics (stratification 'overall'):
[info] 22:47:22 >   Outcomes: 99 hits, 9 misses, hit rate: 91.67%
[info] 22:47:22 >   Covariates: 53 hits, 1 misses, hit rate: 98.15%
[info] 22:47:22 >   Exposure reuses: 0
[info] 22:47:22 > Association testing complete
[info] 22:47:22 >
[info] 22:47:22 > ================================================================================
[info] 22:47:22 > MARVELous Pipeline Completed Successfully
[info] 22:47:22 > ================================================================================
[info] 22:47:22 >
[info] 22:47:22 > Pipeline Summary:
[info] 22:47:22 >   config: {'extract_variants': True, 'association_analysis': True, 'output_path': '/tmp/marvel_cli_example_ixe91nub'}
[info] 22:47:22 >   extraction: {'files_created': 1, 'successful_tasks': 5, 'failed_tasks': 0}
[info] 22:47:22 >   association: {'completed': 6, 'failed': 0, 'skipped': 0}

Step 4: Inspect the output

MARVELous writes two types of output files:

  • ``*_carriers.tsv.gz`` — one row per sample, one column per gene (carrier counts)

  • ``*_results.tsv.gz`` — association test results (one row per gene × outcome × model)

See the Output files section of the documentation for a full description of every column.

[5]:
import pandas as pd

# List all files written to the output directory
output_files = sorted(
    os.path.relpath(p, output_dir)
    for p in glob.glob(os.path.join(output_dir, "**"), recursive=True)
    if os.path.isfile(p)
)
print("Output files:")
for f in output_files:
    print(f"  {f}")
Output files:
  ANGPTL3_overall_baseline.tsv
  ANGPTL3_results.tsv.gz
  CETP_overall_baseline.tsv
  CETP_results.tsv.gz
  HMGCR_overall_baseline.tsv
  HMGCR_results.tsv.gz
  NPC1L1_overall_baseline.tsv
  NPC1L1_results.tsv.gz
  PCSK9_overall_baseline.tsv
  PCSK9_results.tsv.gz
  PPARA_overall_baseline.tsv
  PPARA_results.tsv.gz
  example.cnf
  variant_carriers_carriers.tsv.gz
  variant_carriers_summary.tsv.gz
[6]:
# Load the gene-level carrier matrix
carriers_file = glob.glob(os.path.join(output_dir, "*_carriers.tsv.gz"))[0]
carriers = pd.read_csv(carriers_file, sep="\t")
print(f"Carrier matrix shape: {carriers.shape}")
carriers.head()
Carrier matrix shape: (1000, 7)
[6]:
id PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA
0 ID_000000 0 0 0 0 0 0
1 ID_000001 0 0 0 0 0 0
2 ID_000002 0 0 0 0 0 0
3 ID_000003 0 0 0 0 0 0
4 ID_000004 0 0 0 0 0 0
[7]:
# Load one association results file and show the top hits
results_files = glob.glob(os.path.join(output_dir, "*_results.tsv.gz"))
results = pd.read_csv(results_files[0], sep="\t")
print(f"Results shape: {results.shape}")
results.head(10)
Results shape: (58, 15)
[7]:
Model Model name Stratification name Variable Exposure Cases Samples Exposed Non-exposed Point Estimates Standard Error Test statistic P-values Formatted Formatted (exp)
0 OLS Unadjusted overall sbp Intercept 0.0 595.0 43.0 552.0 120.180721 0.651921 NaN 0.000000e+00 120.18 (118.90; 121.46) 1562513444283125989701933543434768202131569939...
1 OLS Unadjusted overall sbp NPC1L1 0.0 595.0 43.0 552.0 0.696543 2.425043 NaN 7.740373e-01 0.70 (-4.06; 5.45) 2.01 (0.02; 232.65)
2 KW Unadjusted overall sbp NPC1L1 0.0 595.0 43.0 552.0 NaN NaN 0.086862 7.682057e-01 NaN NaN
3 OLS Unadjusted overall ldl_cholesterol Intercept 0.0 595.0 43.0 552.0 5.132331 0.047475 NaN 0.000000e+00 5.13 (5.04; 5.23) 169.41 (154.36; 185.93)
4 OLS Unadjusted overall ldl_cholesterol NPC1L1 0.0 595.0 43.0 552.0 -0.016067 0.176601 NaN 9.275404e-01 -0.02 (-0.36; 0.33) 0.98 (0.70; 1.39)
5 AOV Unadjusted overall ldl_cholesterol NPC1L1 0.0 595.0 43.0 552.0 NaN NaN 0.008277 9.275404e-01 NaN NaN
6 OLS Unadjusted overall crp_level Intercept 0.0 595.0 43.0 552.0 37.746194 0.852583 NaN 3.878451e-190 37.75 (36.08; 39.42) 24715186084767916.00 (4647764053864573.00; 131...
7 OLS Unadjusted overall crp_level NPC1L1 0.0 595.0 43.0 552.0 -0.471179 3.171472 NaN 8.819452e-01 -0.47 (-6.69; 5.74) 0.62 (0.00; 312.56)
8 KW Unadjusted overall crp_level NPC1L1 0.0 595.0 43.0 552.0 NaN NaN 0.003367 9.537299e-01 NaN NaN
9 AOV Unadjusted overall crp_level NPC1L1 0.0 595.0 43.0 552.0 NaN NaN 0.022072 8.819452e-01 NaN NaN

Step 5: Cleanup

Remove the temporary directory created at the start of this notebook.

[8]:
shutil.rmtree(output_dir)
print(f"Removed temporary directory: {output_dir}")
Removed temporary directory: /tmp/marvel_cli_example_ixe91nub