Configuration Reference
MARVELous uses INI-style configuration files (.cnf) to control all aspects of
the pipeline. This page provides a complete reference for all configuration sections
and options.
File Format
Configuration files use a simple INI format with sections enclosed in square brackets and tab-separated key-value pairs:
[SectionName]
key1 value1
key2 value2
Note
Keys and values must be separated by tabs, not spaces.
Lines starting with # are treated as comments.
Configuration Sections
[GenoInput]
Specifies genetic input files (VCF, BGEN, or PLINK format) for variant extraction.
Format: identifier<tab>path
Each entry maps a unique identifier to a genetic file path. The identifier is used internally to track which file variants were extracted from.
Supported file types:
.vcf,.vcf.gz- VCF files.bgen- BGEN files.bed,.bim,.fam- PLINK files
Example:
[GenoInput]
chr20 /path/to/chr20.vcf.gz
chr21 /path/to/chr21.vcf.gz
chr22 /path/to/chr22.vcf.gz
[VarInput]
Specifies variant definition files that list which variants to extract.
Format: identifier<tab>path
Variant files should be tab-separated with columns for chromosome, position,
reference allele, and alternate allele. The column names are configurable
via the [Options] section.
Example:
[VarInput]
variants /path/to/variant_list.tsv
Variant file format:
chr pos a1 a2 gene
20 1234567 A G GENE1
20 2345678 C T GENE1
21 3456789 G A GENE2
[ExpInput]
Specifies pre-extracted exposure/variant carrier files for association analysis. Use this when you have already extracted variants and want to run association testing only.
Format: identifier<tab>path
Example:
[ExpInput]
exposures /path/to/extracted_carriers.tsv.gz
[PhenoInput]
Specifies phenotype and covariate input files.
Required keys:
phenotypes- Path to file containing outcome variablescovariates- Path to file containing covariate variables (required if using covariate models)
Example:
[PhenoInput]
phenotypes /path/to/phenotypes.tsv
covariates /path/to/covariates.tsv
Both files should be tab-separated with an ID column matching the genetic data.
[ConTests]
Defines continuous outcomes and the statistical tests to perform.
Format: outcome_name<tab>test1;test2;test3
Tests are separated by semicolons. See Statistical Tests Guide for available tests.
Example:
[ConTests]
blood_pressure OLS;KW;MWU;T
cholesterol OLS;KW;MWU;T
bmi OLS
[CatTests]
Defines categorical outcomes and the statistical tests to perform.
Format: outcome_name<tab>test1;test2
Example:
[CatTests]
disease_severity CHISQ
treatment_response CHISQ
[BinTests]
Defines binary outcomes and the statistical tests to perform.
Format: outcome_name<tab>test1;test2;test3
Example:
[BinTests]
diabetes GLM-Binom;CHISQ;FISHER
hypertension GLM-Binom;CHISQ
mortality FISHER
[SurvTests]
Defines survival (time-to-event) outcomes and statistical tests.
Format: event_column;time_column<tab>test1;test2
Note the special syntax: event and time columns are separated by a semicolon.
Requirements:
Event column must be binary (0 = censored, 1 = event)
Time column must be continuous and positive
Both columns must exist in the phenotypes file
Example:
[SurvTests]
death;follow_up_years Cox-PH
cvd_event;time_to_cvd Cox-PH
cancer;age_at_diagnosis Cox-PH
In this example:
deathis the event indicator,follow_up_yearsis the time variableMultiple survival outcomes can be specified
All covariate models from
[Covs]will be applied
[Covs]
Defines covariate models for adjusted analyses. Each model specifies a set of covariates to include in regression analyses.
Format: model_name<tab>covariate1;covariate2;covariate3
Use None for an unadjusted model.
Example:
[Covs]
Unadjusted None
Age_Sex age;sex
Full age;sex;bmi;smoking_status;PC1;PC2;PC3;PC4
Each model is run separately, allowing comparison of adjusted and unadjusted results.
[Stratification]
Defines filter models for running separate association analyses on subgroups of individuals. Each filter model specifies criteria to select a subset of individuals from the phenotype file. The analysis is then run independently for each filter model, similar to how each covariate model is run separately.
Format: filter_name<tab>column=value1,value2;column2=value3
Semicolons (
;) separate criteria across different columns (AND logic)Commas (
,) separate multiple values within a column (OR logic)Prefix values with
!to exclude instead of include
Note
Filtering applies to association testing only, not to variant extraction.
Example:
[Stratification]
females sex=female
males sex=male
young_females sex=female;age_group=young,middle
In this example:
females: include only individuals wheresexequalsfemalemales: include only individuals wheresexequalsmaleyoung_females: include individuals wheresexequalsfemaleANDage_groupequalsyoungormiddle
To exclude values, prefix them with !:
[Stratification]
non_smokers smoking=!current
By default, an unfiltered “overall” analysis is also run alongside each filter model.
This can be disabled with the stratify_overall option (see [Options]).
[Output]
Specifies output file paths.
Available keys:
VarOutput- Base path for variant extraction output filesTestOutput- (Legacy) Path for test output
Example:
[Output]
VarOutput /path/to/results/extracted_variants
This creates:
Carrier matrix -
/path/to/results/extracted_variants_carriers.tsv.gzExtraction summary -
/path/to/results/extracted_variants_summary.tsv.gz
[Options]
Controls pipeline behavior with various configuration options.
Core Pipeline Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable variant extraction step |
|
bool |
|
Enable association testing step |
|
str |
|
Name of the sample ID column |
|
str |
|
Directory for output files |
|
str |
|
Directory for temporary files |
Column Configuration
These options configure column names in your variant input files:
Option |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Chromosome column name |
|
str |
|
Position column name |
|
str |
|
Reference allele column name |
|
str |
|
Alternate allele column name |
|
str |
|
Variant ID column name |
|
str |
|
Combined chr:pos column name |
|
str |
|
Category/gene column for aggregation |
|
str |
|
Separator for variant IDs (chr:pos:ref:alt) |
Missingness Handling
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
|
|
float |
|
Maximum allowed missingness proportion (0-1) |
|
bool |
|
Raise error if covariates exceed missingness threshold |
Extraction Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Include individual variants in output (alongside gene aggregations) |
|
bool |
|
Also check for ref/alt swapped variants |
|
str |
|
Genomic region to extract (e.g., |
|
bool |
|
Boolean. When |
|
bool / |
|
What negative genotypes should be set to. |
|
bool / |
|
What genotypes larger than 1 should be set to. |
Testing Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Minimum samples required per exposure group |
|
bool |
|
Raise exception on test failures |
|
bool |
|
Also run an unfiltered “overall” analysis alongside filtered models |
|
str |
|
Semicolon-separated list of specific exposures to test |
Parallel Execution
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of parallel jobs (-1 = all CPUs) |
|
str |
|
Directory for checkpointing |
|
bool |
|
Force re-extraction even if checkpoints exist |
Example:
[Options]
extract_variants True
association_analysis True
id_column sample_id
output_path ./results
cat_column gene
missingness_strategy True
max_missingness 0.3
n_jobs 4
Complete Example
Below is a working example combining all sections.
This full reference configuration file with every available option and its
default value can be downloaded here:
full_config.cnf
Configuration for Specific Use Cases
Extraction Only
To run variant extraction without association testing:
[GenoInput]
chr1 /data/chr1.vcf.gz
[VarInput]
variants /data/variants.tsv
[Output]
VarOutput /results/extracted
[Options]
extract_variants True
association_analysis False
id_column IID
Association Only
To run association testing on pre-extracted data:
[ExpInput]
carriers /data/extracted_carriers.tsv.gz
[PhenoInput]
phenotypes /data/outcomes.tsv
covariates /data/covariates.tsv
[BinTests]
disease GLM-Binom;FISHER
[Covs]
Adjusted age;sex
[Options]
extract_variants False
association_analysis True
id_column IID
output_path /results
Stratified Analysis
To run association testing separately for subgroups of individuals:
[ExpInput]
carriers /data/extracted_carriers.tsv.gz
[PhenoInput]
phenotypes /data/outcomes.tsv
covariates /data/covariates.tsv
[BinTests]
disease GLM-Binom;FISHER
[Covs]
Adjusted age;sex
[Stratification]
females sex=female
males sex=male
non_smokers smoking=!current
[Options]
extract_variants False
association_analysis True
id_column IID
output_path /results
stratify_overall True
This runs the association analysis four times: once unfiltered (“overall”), once for
females only, once for males only, and once excluding current smokers. Results include
a Filter name column identifying which filter model was applied.
Creation of configuration file
The configuration file can be created manually with the described formatting rules. To aid the creation, a helper function is available, which is illustrated below. This includes a check of the written file, which checks that:
at least
extract_variantsorassociation_analysisis enabledgenetic files have supported formats (based on their extension)
the variant output identifiers are as expected
a covariate file is supplied if adjusted models are specified
the survival formatting is right
the specified tests are recognised
specified input files and columns are present
# imports
from marvel.utils.config_tools import (
create_config,
ConfigParser,
check_config_file,
)
# set some dictionaries
geno_input = {
'chr21': 'path/to/chr21.vcf.gz',
'chr22': 'path/to/chr22.vcf.gz',
}
pheno_input = {
'phenotypes': 'path/to/outcomes.tsv',
'covariates': 'path/to/covariates.tsv',
}
con_tests = {
'sbp' : ['OLS', 'KW'],
'ldl_cholesterol' : ['OLS', 'AOV'],
'crp_level' : ['OLS', 'KW', 'AOV'],
}
cat_tests = {
'disease_severity' : ['CHISQ'],
'treatment_response' : ['CHISQ'],
'risk_category' : ['CHISQ'],
}
bin_tests = {
'hypertension' : ['GLM-Binom', 'CHISQ'],
'diabetes' : ['GLM-Binom', 'FISHER'],
'cvd' : ['GLM-Binom', 'CHISQ', 'FISHER'],
}
covs = {
'Unadjusted' : 'None',
'Adjusted' : ['age', 'sex'],
}
# create the configuration file, which will be written to the `path`
create_config(
path='path/to/config.cnf',
extract_variants=True,
geno_input=geno_input,
variant_input={'variants': 'path/to/variants.tsv'},
variant_output={'VarOutput': '/path/to/prefix'},
association_analysis=True,
pheno_input=pheno_input,
con_tests=con_tests,
cat_tests=cat_tests,
bin_tests=bin_tests,
covs=covs,
id_column = 'id',
raise_on_error = True,
ref_column = 'a1',
alt_column = 'a2',
cat_column = 'gene',
incl_var = False,
output_path = 'path/to/output',
)
# do some basic checks on the configuration file
config = ConfigParser('path/to/config.cnf')
check_config_file(config)
See Also
Usage Guide - Command-line usage
Statistical Tests Guide - Statistical tests reference
Advanced Features - Advanced features