Configuration Reference

MARVELous uses INI-style configuration files (.cnf) to control all aspects of the pipeline. This page provides a complete reference for all configuration sections and options.

File Format

Configuration files use a simple INI format with sections enclosed in square brackets and tab-separated key-value pairs:

[SectionName]
key1 value1
key2 value2

Note

Keys and values must be separated by tabs, not spaces. Lines starting with # are treated as comments.

Configuration Sections

[GenoInput]

Specifies genetic input files (VCF, BGEN, or PLINK format) for variant extraction.

Format: identifier<tab>path

Each entry maps a unique identifier to a genetic file path. The identifier is used internally to track which file variants were extracted from.

Supported file types:

  • .vcf, .vcf.gz - VCF files

  • .bgen - BGEN files

  • .bed, .bim, .fam - PLINK files

Example:

[GenoInput]
chr20        /path/to/chr20.vcf.gz
chr21        /path/to/chr21.vcf.gz
chr22        /path/to/chr22.vcf.gz

[VarInput]

Specifies variant definition files that list which variants to extract.

Format: identifier<tab>path

Variant files should be tab-separated with columns for chromosome, position, reference allele, and alternate allele. The column names are configurable via the [Options] section.

Example:

[VarInput]
variants     /path/to/variant_list.tsv

Variant file format:

chr  pos     a1      a2      gene
20   1234567 A       G       GENE1
20   2345678 C       T       GENE1
21   3456789 G       A       GENE2

[ExpInput]

Specifies pre-extracted exposure/variant carrier files for association analysis. Use this when you have already extracted variants and want to run association testing only.

Format: identifier<tab>path

Example:

[ExpInput]
exposures    /path/to/extracted_carriers.tsv.gz

[PhenoInput]

Specifies phenotype and covariate input files.

Required keys:

  • phenotypes - Path to file containing outcome variables

  • covariates - Path to file containing covariate variables (required if using covariate models)

Example:

[PhenoInput]
phenotypes   /path/to/phenotypes.tsv
covariates   /path/to/covariates.tsv

Both files should be tab-separated with an ID column matching the genetic data.

[ConTests]

Defines continuous outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2;test3

Tests are separated by semicolons. See Statistical Tests Guide for available tests.

Example:

[ConTests]
blood_pressure       OLS;KW;MWU;T
cholesterol  OLS;KW;MWU;T
bmi  OLS

[CatTests]

Defines categorical outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2

Example:

[CatTests]
disease_severity     CHISQ
treatment_response   CHISQ

[BinTests]

Defines binary outcomes and the statistical tests to perform.

Format: outcome_name<tab>test1;test2;test3

Example:

[BinTests]
diabetes     GLM-Binom;CHISQ;FISHER
hypertension GLM-Binom;CHISQ
mortality    FISHER

[SurvTests]

Defines survival (time-to-event) outcomes and statistical tests.

Format: event_column;time_column<tab>test1;test2

Note the special syntax: event and time columns are separated by a semicolon.

Requirements:

  • Event column must be binary (0 = censored, 1 = event)

  • Time column must be continuous and positive

  • Both columns must exist in the phenotypes file

Example:

[SurvTests]
death;follow_up_years        Cox-PH
cvd_event;time_to_cvd        Cox-PH
cancer;age_at_diagnosis      Cox-PH

In this example:

  • death is the event indicator, follow_up_years is the time variable

  • Multiple survival outcomes can be specified

  • All covariate models from [Covs] will be applied

[Covs]

Defines covariate models for adjusted analyses. Each model specifies a set of covariates to include in regression analyses.

Format: model_name<tab>covariate1;covariate2;covariate3

Use None for an unadjusted model.

Example:

[Covs]
Unadjusted   None
Age_Sex      age;sex
Full age;sex;bmi;smoking_status;PC1;PC2;PC3;PC4

Each model is run separately, allowing comparison of adjusted and unadjusted results.

[Stratification]

Defines filter models for running separate association analyses on subgroups of individuals. Each filter model specifies criteria to select a subset of individuals from the phenotype file. The analysis is then run independently for each filter model, similar to how each covariate model is run separately.

Format: filter_name<tab>column=value1,value2;column2=value3

  • Semicolons (;) separate criteria across different columns (AND logic)

  • Commas (,) separate multiple values within a column (OR logic)

  • Prefix values with ! to exclude instead of include

Note

Filtering applies to association testing only, not to variant extraction.

Example:

[Stratification]
females      sex=female
males        sex=male
young_females        sex=female;age_group=young,middle

In this example:

  • females: include only individuals where sex equals female

  • males: include only individuals where sex equals male

  • young_females: include individuals where sex equals female AND age_group equals young or middle

To exclude values, prefix them with !:

[Stratification]
non_smokers  smoking=!current

By default, an unfiltered “overall” analysis is also run alongside each filter model. This can be disabled with the stratify_overall option (see [Options]).

[Output]

Specifies output file paths.

Available keys:

  • VarOutput - Base path for variant extraction output files

  • TestOutput - (Legacy) Path for test output

Example:

[Output]
VarOutput    /path/to/results/extracted_variants

This creates:

  • Carrier matrix - /path/to/results/extracted_variants_carriers.tsv.gz

  • Extraction summary - /path/to/results/extracted_variants_summary.tsv.gz

[Options]

Controls pipeline behavior with various configuration options.

Core Pipeline Options

Option

Type

Default

Description

extract_variants

bool

True

Enable variant extraction step

association_analysis

bool

True

Enable association testing step

id_column

str

id

Name of the sample ID column

output_path

str

.

Directory for output files

tmp_dir

str

./tmp

Directory for temporary files

Column Configuration

These options configure column names in your variant input files:

Option

Type

Default

Description

chr_column

str

chr

Chromosome column name

pos_column

str

pos

Position column name

ref_column

str

a1

Reference allele column name

alt_column

str

a2

Alternate allele column name

var_column

str

ID

Variant ID column name

chr_pos_column

str

chr_pos

Combined chr:pos column name

cat_column

str

None

Category/gene column for aggregation

var_sep

str

:

Separator for variant IDs (chr:pos:ref:alt)

Missingness Handling

Option

Type

Default

Description

missingness_strategy

bool

True

True for listwise deletion, False for pairwise

max_missingness

float

0.5

Maximum allowed missingness proportion (0-1)

cov_miss_error

bool

True

Raise error if covariates exceed missingness threshold

Extraction Options

Option

Type

Default

Description

incl_var

bool

True

Include individual variants in output (alongside gene aggregations)

reverse

bool

True

Also check for ref/alt swapped variants

region

str

None

Genomic region to extract (e.g., 20:1000000-2000000)

prefilter_regions

bool

False

Boolean. When True, pre-filters the input variants list to chromosomes/positions present in the genetic file before extraction. Use to skip variants not in the file and speed up large extractions.

neg_geno

bool / None / str / int / float

True

What negative genotypes should be set to. True = keep unchanged, None = set to NA, any other value = set to this value

sum_geno

bool / None / str / int / float

True

What genotypes larger than 1 should be set to. True = keep unchanged, None = set to NA, any other value = set to this value

Testing Options

Option

Type

Default

Description

min_group_size

int

1

Minimum samples required per exposure group

raise_on_error

bool

True

Raise exception on test failures

stratify_overall

bool

True

Also run an unfiltered “overall” analysis alongside filtered models

exposures

str

None

Semicolon-separated list of specific exposures to test

Parallel Execution

Option

Type

Default

Description

n_jobs

int

-1

Number of parallel jobs (-1 = all CPUs)

checkpoint_dir

str

./tmp_check

Directory for checkpointing

force_rerun

bool

False

Force re-extraction even if checkpoints exist

Example:

[Options]
extract_variants     True
association_analysis True
id_column    sample_id
output_path  ./results
cat_column   gene
missingness_strategy True
max_missingness      0.3
n_jobs       4

Complete Example

Below is a working example combining all sections. This full reference configuration file with every available option and its default value can be downloaded here: full_config.cnf

Configuration for Specific Use Cases

Extraction Only

To run variant extraction without association testing:

[GenoInput]
chr1 /data/chr1.vcf.gz

[VarInput]
variants     /data/variants.tsv

[Output]
VarOutput    /results/extracted

[Options]
extract_variants     True
association_analysis False
id_column    IID

Association Only

To run association testing on pre-extracted data:

[ExpInput]
carriers     /data/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes   /data/outcomes.tsv
covariates   /data/covariates.tsv

[BinTests]
disease      GLM-Binom;FISHER

[Covs]
Adjusted     age;sex

[Options]
extract_variants     False
association_analysis True
id_column    IID
output_path  /results

Stratified Analysis

To run association testing separately for subgroups of individuals:

[ExpInput]
carriers     /data/extracted_carriers.tsv.gz

[PhenoInput]
phenotypes   /data/outcomes.tsv
covariates   /data/covariates.tsv

[BinTests]
disease      GLM-Binom;FISHER

[Covs]
Adjusted     age;sex

[Stratification]
females      sex=female
males        sex=male
non_smokers  smoking=!current

[Options]
extract_variants     False
association_analysis True
id_column    IID
output_path  /results
stratify_overall     True

This runs the association analysis four times: once unfiltered (“overall”), once for females only, once for males only, and once excluding current smokers. Results include a Filter name column identifying which filter model was applied.

Creation of configuration file

The configuration file can be created manually with the described formatting rules. To aid the creation, a helper function is available, which is illustrated below. This includes a check of the written file, which checks that:

  • at least extract_variants or association_analysis is enabled

  • genetic files have supported formats (based on their extension)

  • the variant output identifiers are as expected

  • a covariate file is supplied if adjusted models are specified

  • the survival formatting is right

  • the specified tests are recognised

  • specified input files and columns are present

# imports
from marvel.utils.config_tools import (
        create_config,
        ConfigParser,
        check_config_file,
        )

# set some dictionaries
geno_input = {
    'chr21': 'path/to/chr21.vcf.gz',
    'chr22': 'path/to/chr22.vcf.gz',
}
pheno_input = {
    'phenotypes': 'path/to/outcomes.tsv',
    'covariates': 'path/to/covariates.tsv',
}
con_tests = {
    'sbp'             : ['OLS', 'KW'],
    'ldl_cholesterol' : ['OLS', 'AOV'],
    'crp_level'       : ['OLS', 'KW', 'AOV'],
}
cat_tests = {
    'disease_severity'   : ['CHISQ'],
    'treatment_response' : ['CHISQ'],
    'risk_category'      : ['CHISQ'],
}
bin_tests = {
    'hypertension' : ['GLM-Binom', 'CHISQ'],
    'diabetes'     : ['GLM-Binom', 'FISHER'],
    'cvd'          : ['GLM-Binom', 'CHISQ', 'FISHER'],
}
covs = {
    'Unadjusted' : 'None',
    'Adjusted'   : ['age', 'sex'],
}

# create the configuration file, which will be written to the `path`
create_config(
    path='path/to/config.cnf',
    extract_variants=True,
    geno_input=geno_input,
    variant_input={'variants': 'path/to/variants.tsv'},
    variant_output={'VarOutput': '/path/to/prefix'},
    association_analysis=True,
    pheno_input=pheno_input,
    con_tests=con_tests,
    cat_tests=cat_tests,
    bin_tests=bin_tests,
    covs=covs,
    id_column = 'id',
    raise_on_error = True,
    ref_column = 'a1',
    alt_column = 'a2',
    cat_column = 'gene',
    incl_var = False,
    output_path = 'path/to/output',
    )

# do some basic checks on the configuration file
config = ConfigParser('path/to/config.cnf')
check_config_file(config)

See Also