Configuration Reference

MARVELous uses INI-style configuration files (.cnf) to control all aspects of the pipeline. This page provides a complete reference for all configuration sections and options.

File Format

Configuration files use a simple INI format with sections enclosed in square brackets and tab-separated key-value pairs:

[SectionName]
key1 value1
key2 value2

Note

Keys and values must be separated by tabs, not spaces. Lines starting with # are treated as comments.

Configuration Sections

[GenoInput]

Specifies genetic input files (VCF, BGEN, or PLINK format) for variant extraction.

Format: identifier<tab>path

Each entry maps a unique identifier to a genetic file path. The identifier is used internally to track which file variants were extracted from.

Supported file types:

.vcf, .vcf.gz - VCF files
.bgen - BGEN files
.bed, .bim, .fam - PLINK files

Example:

[GenoInput]
chr20        /path/to/chr20.vcf.gz
chr21        /path/to/chr21.vcf.gz
chr22        /path/to/chr22.vcf.gz

[VarInput]

Specifies variant definition files that list which variants to extract.

Format: identifier<tab>path

Variant files should be tab-separated with columns for chromosome, position, reference allele, and alternate allele. The column names are configurable via the [Options] section.

Example:

[VarInput]
variants     /path/to/variant_list.tsv

Variant file format:

chr  pos     a1      a2      gene
 1234567 A       G       GENE1
 2345678 C       T       GENE1
 3456789 G       A       GENE2

[ExpInput]

Specifies pre-extracted exposure/variant carrier files for association analysis. Use this when you have already extracted variants and want to run association testing only.

Format: identifier<tab>path

Example:

[ExpInput]
exposures    /path/to/extracted_carriers.tsv.gz

[PhenoInput]

Specifies phenotype and covariate input files.

Required keys:

phenotypes - Path to file containing outcome variables
covariates - Path to file containing covariate variables (required if using covariate models)

Example:

[PhenoInput]
phenotypes   /path/to/phenotypes.tsv
covariates   /path/to/covariates.tsv

Both files should be tab-separated with an ID column matching the genetic data.

[ConTests]

Defines continuous outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2;test3

Tests are separated by semicolons. See Statistical Tests Guide for available tests.

Example:

[ConTests]
blood_pressure       OLS;KW;MWU;T
cholesterol  OLS;KW;MWU;T
bmi  OLS

[CatTests]

Defines categorical outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2

Example:

[CatTests]
disease_severity     CHISQ
treatment_response   CHISQ

[BinTests]

Defines binary outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2;test3

Example:

[BinTests]
diabetes     GLM-Binom;CHISQ;FISHER
hypertension GLM-Binom;CHISQ
mortality    FISHER

[SurvTests]

Defines survival (time-to-event) outcomes and statistical tests.

Format: event_column;time_column<tab>test1;test2

Note the special syntax: event and time columns are separated by a semicolon.

Requirements:

Event column must be binary (0 = censored, 1 = event)
Time column must be continuous and positive
Both columns must exist in the phenotypes file

Example:

[SurvTests]
death;follow_up_years        Cox-PH
cvd_event;time_to_cvd        Cox-PH
cancer;age_at_diagnosis      Cox-PH

In this example:

death is the event indicator, follow_up_years is the time variable
Multiple survival outcomes can be specified
All covariate models from [Covs] will be applied

[Covs]

Defines covariate models for adjusted analyses. Each model specifies a set of covariates to include in regression analyses.

Format: model_name<tab>covariate1;covariate2;covariate3

Use None for an unadjusted model.

Example:

[Covs]
Unadjusted   None
Age_Sex      age;sex
Full age;sex;bmi;smoking_status;PC1;PC2;PC3;PC4

Each model is run separately, allowing comparison of adjusted and unadjusted results.

[Stratification]

Defines filter models for running separate association analyses on subgroups of individuals. Each filter model specifies criteria to select a subset of individuals from the phenotype file. The analysis is then run independently for each filter model, similar to how each covariate model is run separately.

Format: filter_name<tab>column=value1,value2;column2=value3

Semicolons (;) separate criteria across different columns (AND logic)
Commas (,) separate multiple values within a column (OR logic)
Prefix values with ! to exclude instead of include

Note

Filtering applies to association testing only, not to variant extraction.

Example:

[Stratification]
females      sex=female
males        sex=male
young_females        sex=female;age_group=young,middle

In this example:

females: include only individuals where sex equals female
males: include only individuals where sex equals male
young_females: include individuals where sex equals female AND age_group equals young or middle

To exclude values, prefix them with !:

[Stratification]
non_smokers  smoking=!current

By default, an unfiltered “overall” analysis is also run alongside each filter model. This can be disabled with the stratify_overall option (see [Options]).

[Output]

Specifies output file paths.

Available keys:

VarOutput - Base path for variant extraction output files
TestOutput - (Legacy) Path for test output

Example:

[Output]
VarOutput    /path/to/results/extracted_variants

This creates:

Carrier matrix - /path/to/results/extracted_variants_carriers.tsv.gz
Extraction summary - /path/to/results/extracted_variants_summary.tsv.gz

[Options]

Controls pipeline behavior with various configuration options.

Core Pipeline Options

Option	Type	Default	Description
`extract_variants`	bool	`True`	Enable variant extraction step
`association_analysis`	bool	`True`	Enable association testing step
`id_column`	str	`id`	Name of the sample ID column
`output_path`	str	`.`	Directory for output files
`tmp_dir`	str	`./tmp`	Directory for temporary files

Column Configuration

These options configure column names in your variant input files:

Option	Type	Default	Description
`chr_column`	str	`chr`	Chromosome column name
`pos_column`	str	`pos`	Position column name
`ref_column`	str	`a1`	Reference allele column name
`alt_column`	str	`a2`	Alternate allele column name
`var_column`	str	`ID`	Variant ID column name
`chr_pos_column`	str	`chr_pos`	Combined chr:pos column name
`cat_column`	str	`None`	Category/gene column for aggregation
`var_sep`	str	`:`	Separator for variant IDs (chr:pos:ref:alt)

Missingness Handling

Option	Type	Default	Description
`missingness_strategy`	bool	`True`	`True` for listwise deletion, `False` for pairwise
`max_missingness`	float	`0.5`	Maximum allowed missingness proportion (0-1)
`cov_miss_error`	bool	`True`	Raise error if covariates exceed missingness threshold

Extraction Options

Option	Type	Default	Description
`incl_var`	bool	`True`	Include individual variants in output (alongside gene aggregations)
`reverse`	bool	`True`	Also check for ref/alt swapped variants
`region`	str	`None`	Genomic region to extract (e.g., `20:1000000-2000000`)
`prefilter_regions`	bool	`False`	Boolean. When `True`, pre-filters the input variants list to chromosomes/positions present in the genetic file before extraction. Use to skip variants not in the file and speed up large extractions.
`neg_geno`	bool / `None` / str / int / float	`True`	What negative genotypes should be set to. `True` = keep unchanged, `None` = set to `NA`, any other value = set to this value
`sum_geno`	bool / `None` / str / int / float	`True`	What genotypes larger than 1 should be set to. `True` = keep unchanged, `None` = set to `NA`, any other value = set to this value

Testing Options

Option	Type	Default	Description
`min_group_size`	int	`1`	Minimum samples required per exposure group
`raise_on_error`	bool	`True`	Raise exception on test failures
`stratify_overall`	bool	`True`	Also run an unfiltered “overall” analysis alongside filtered models
`exposures`	str	`None`	Semicolon-separated list of specific exposures to test

Parallel Execution

Option	Type	Default	Description
`n_jobs`	int	`-1`	Number of parallel jobs (-1 = all CPUs)
`checkpoint_dir`	str	`./tmp_check`	Directory for checkpointing
`force_rerun`	bool	`False`	Force re-extraction even if checkpoints exist

Example:

[Options]
extract_variants     True
association_analysis True
id_column    sample_id
output_path  ./results
cat_column   gene
missingness_strategy True
max_missingness      0.3
n_jobs       4

Complete Example

Below is a working example combining all sections. This full reference configuration file with every available option and its default value can be downloaded here: full_config.cnf

Configuration for Specific Use Cases

Extraction Only

To run variant extraction without association testing:

[GenoInput]
chr1 /data/chr1.vcf.gz

[VarInput]
variants     /data/variants.tsv

[Output]
VarOutput    /results/extracted

[Options]
extract_variants     True
association_analysis False
id_column    IID

Association Only

To run association testing on pre-extracted data:

[ExpInput]
carriers     /data/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes   /data/outcomes.tsv
covariates   /data/covariates.tsv

[BinTests]
disease      GLM-Binom;FISHER

[Covs]
Adjusted     age;sex

[Options]
extract_variants     False
association_analysis True
id_column    IID
output_path  /results

Stratified Analysis

To run association testing separately for subgroups of individuals:

[ExpInput]
carriers     /data/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes   /data/outcomes.tsv
covariates   /data/covariates.tsv

[BinTests]
disease      GLM-Binom;FISHER

[Covs]
Adjusted     age;sex

[Stratification]
females      sex=female
males        sex=male
non_smokers  smoking=!current

[Options]
extract_variants     False
association_analysis True
id_column    IID
output_path  /results
stratify_overall     True

This runs the association analysis four times: once unfiltered (“overall”), once for females only, once for males only, and once excluding current smokers. Results include a Filter name column identifying which filter model was applied.

Creation of configuration file

The configuration file can be created manually with the described formatting rules. To aid the creation, a helper function is available, which is illustrated below. This includes a check of the written file, which checks that:

at least extract_variants or association_analysis is enabled
genetic files have supported formats (based on their extension)
the variant output identifiers are as expected
a covariate file is supplied if adjusted models are specified
the survival formatting is right
the specified tests are recognised
specified input files and columns are present

# imports
from marvel.utils.config_tools import (
        create_config,
        ConfigParser,
        check_config_file,
        )

# set some dictionaries
geno_input = {
    'chr21': 'path/to/chr21.vcf.gz',
    'chr22': 'path/to/chr22.vcf.gz',
}
pheno_input = {
    'phenotypes': 'path/to/outcomes.tsv',
    'covariates': 'path/to/covariates.tsv',
}
con_tests = {
    'sbp'             : ['OLS', 'KW'],
    'ldl_cholesterol' : ['OLS', 'AOV'],
    'crp_level'       : ['OLS', 'KW', 'AOV'],
}
cat_tests = {
    'disease_severity'   : ['CHISQ'],
    'treatment_response' : ['CHISQ'],
    'risk_category'      : ['CHISQ'],
}
bin_tests = {
    'hypertension' : ['GLM-Binom', 'CHISQ'],
    'diabetes'     : ['GLM-Binom', 'FISHER'],
    'cvd'          : ['GLM-Binom', 'CHISQ', 'FISHER'],
}
covs = {
    'Unadjusted' : 'None',
    'Adjusted'   : ['age', 'sex'],
}

# create the configuration file, which will be written to the `path`
create_config(
    path='path/to/config.cnf',
    extract_variants=True,
    geno_input=geno_input,
    variant_input={'variants': 'path/to/variants.tsv'},
    variant_output={'VarOutput': '/path/to/prefix'},
    association_analysis=True,
    pheno_input=pheno_input,
    con_tests=con_tests,
    cat_tests=cat_tests,
    bin_tests=bin_tests,
    covs=covs,
    id_column = 'id',
    raise_on_error = True,
    ref_column = 'a1',
    alt_column = 'a2',
    cat_column = 'gene',
    incl_var = False,
    output_path = 'path/to/output',
    )

# do some basic checks on the configuration file
config = ConfigParser('path/to/config.cnf')
check_config_file(config)

Configuration Reference

File Format

Configuration Sections

[GenoInput]

[VarInput]

[ExpInput]

[PhenoInput]

[ConTests]

[CatTests]

[BinTests]

[SurvTests]

[Covs]

[Stratification]

[Output]

[Options]

Core Pipeline Options

Column Configuration

Missingness Handling

Extraction Options

Testing Options

Parallel Execution

Complete Example

Configuration for Specific Use Cases

Extraction Only

Association Only

Stratified Analysis

Creation of configuration file

See Also