Advanced Features

This page covers advanced MARVELous features including gene aggregation, missingness handling, caching, and performance optimization.

Genotype Processing

After variants are aggregated to gene level, MARVEL applies optional post-processing to handle special genotype values. Two parameters control this behaviour: neg_geno (negative values) and sum_geno (values greater than 1).

Negative genotype values

Negative genotypes arise when a sample’s genotype cannot be determined:

PLINK encodes missing/no-call genotypes as -1 (or -9 in some contexts).
VCF missing genotypes are already represented as NaN during reading.
BGEN dosage values are always non-negative, so this is not relevant.

After gene aggregation a sample that carries a missing variant in a gene will produce a negative gene-level sum (e.g. -1).

Values greater than 1

A gene-level sum greater than 1 arises when:

A sample carries more than one qualifying variant in the same gene.
A BGEN dosage is fractional and the sum across variants exceeds 1.

These multi-carrier samples can be biologically real and often informative, but some analyses require binary (carrier / non-carrier) status.

`neg_geno` / `sum_geno` reference table

Value	Effect
`True` (default)	Keep the raw value unchanged.
`None`	Replace with `NA` (treat as missing).
integer / string (e.g. `0` or `1`)	Replace all matching values with the given value.

Common use cases

Keep raw values (default)

The default True preserves every value as extracted — no configuration needed:

[Options]
neg_geno     True
sum_geno     True

Treat missing genotypes as missing data

Set neg_geno=None to convert PLINK -1 no-calls to NA:

[Options]
neg_geno     None

Treat missing genotypes as non-carriers

Set neg_geno=0 to treat no-call samples as non-carriers:

[Options]
neg_geno     0

Binary carrier status

Set sum_geno=1 to cap gene counts at 1, converting every carrier to a binary (0/1) indicator regardless of how many variants they carry:

[Options]
sum_geno     1

Note

neg_geno and sum_geno can be combined freely. For example, setting neg_geno=0 and sum_geno=1 recodes missing as non-carrier and caps counts at 1.

Two-Phase Extraction

When aggregating variants by gene, MARVELous uses a two-phase extraction approach for optimal accuracy.

Note

Aggregating variants is generally referred to as aggregating by gene in this documentation. However, any type of aggregation specified by the cat_column is supported.

How It Works

Phase 1: Variant Extraction

Extract all variants from all genetic files
Combine variant carriers across files
Each variant tracked individually

Phase 2: Gene Aggregation

Aggregate combined variants by gene/category
Apply consistent aggregation logic
Preserve both variant-level and gene-level data

This approach ensures that variants split across multiple genetic files (e.g., different chromosomes) are correctly aggregated to the same category.

Configuration

Enable aggregation by setting cat_column:

[Options]
cat_column   gene
incl_var     True

The cat_column specifies which column in the variant file contains the gene/category identifier. Setting incl_var to True preserves individual variant columns alongside category aggregations.

Variant File Format

The variant file must include the aggregation column:

chr  pos     a1      a2      gene
 16050984        A       G       BRCA1
 16051107        C       T       BRCA1
 16051249        G       A       BRCA2

Output

With gene aggregation enabled, the carrier file includes both:

Gene columns (aggregated carrier counts)
Variant columns (if incl_var=True)

id   BRCA1   BRCA2   22:16050984:A:G 22:16051107:C:T 22:16051249:G:A
sample1      2       1       1       1       1
sample2      1       0       1       0       0

Missingness Handling

MARVELous provides sophisticated handling of missing data through the MissingnessHandler class.

Strategies

Listwise Deletion (missingness_strategy=True)

Remove samples with ANY missing values across all relevant columns
Applied once at pipeline start
Most conservative approach
Ensures consistent sample set across all tests

Pairwise Deletion (missingness_strategy=False)

Remove samples with missing values only for current test
Applied per exposure-outcome-covariate combination
Maximizes sample size per test
Different sample sets for different tests

Configuration

[Options]
missingness_strategy True
max_missingness      0.3
cov_miss_error       True

Parameters:

missingness_strategy: True for listwise, False for pairwise
max_missingness: Maximum allowed proportion of missing values (0-1)
cov_miss_error: Raise error if covariates exceed threshold (vs. warning)

Missingness Threshold

Variables exceeding max_missingness are handled differently by type:

Outcomes: Automatically removed from analysis
Exposures: Automatically removed from analysis
Covariates: Error if cov_miss_error=True, warning if False

Workflow

The missingness handling workflow:

Check threshold - Filter outcomes/exposures exceeding max_missingness
Compute valid IDs (listwise only) - Find samples complete across all variables
Initialize DataManager - Set up with valid IDs for filtering
Per-test handling - Apply pairwise deletion for NaN in specific test

DataManager Caching

The DataManager class provides intelligent caching to optimize memory and I/O performance.

Cache Architecture

MARVELous uses three caches:

Outcome Cache (LRU) - Stores recently loaded outcome columns - Default capacity: 20 items - Reduces file reads when testing multiple exposures
Covariate Cache (LRU) - Stores recently loaded covariate sets - Keyed by sorted covariate list - Same model reuses cached covariates
Exposure Cache (Single) - Stores current exposure being tested - Cleared when moving to next exposure - Reused across all outcomes

Cache Statistics

After association testing, cache statistics are logged:

[info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
[info] 14:49:37 >   Outcomes: 6 hits, 2 misses, hit rate: 75.00%
[info] 14:49:37 >   Covariates: 0 hits, 0 misses, hit rate: 0.00%
[info] 14:49:37 >   Exposure reuses: 0

High hit rates indicate effective caching.

Configuration

Cache sizes are set in marvel/constants.py. For most analyses the defaults are sufficient. Advanced users can tune cache sizes via the marvel.pipeline module API.

Parallel Execution

MARVELous supports parallel execution for variant extraction.

Configuration

[Options]
n_jobs       -1
checkpoint_dir       ./checkpoints
force_rerun  False

Parameters:

n_jobs: Number of parallel jobs (-1 = all CPUs)
checkpoint_dir: Directory for checkpoint files
force_rerun: Ignore checkpoints and re-run everything

Checkpointing

Extraction tasks are checkpointed:

Before starting, check if checkpoint exists
If exists and force_rerun=False, load from checkpoint
After completion, save checkpoint

This allows resuming interrupted extractions:

# First run (interrupted)
marvelous config.cnf -v
# Ctrl+C or failure

# Resume from checkpoints
marvelous config.cnf -v

To start fresh:

[Options]
force_rerun  True

Memory Optimization

Tips for analyzing large datasets:

Region Filtering

Extract variants from specific genomic regions:

[Options]
region       22:16000000-17000000

Format: chromosome:start-end

This filters the genetic file to only consider variants in the specified region.

Variant Pre-filtering by Region

MARVELous can pre-filter the input variants list before extraction to skip variants not present in a genetic file. This is a performance optimisation that uses the genetic file’s own index — no external BED file is needed.

Configuration

[Options]
prefilter_regions    True

Parameters:

prefilter_regions: Boolean (default False). When True, restricts the variants list to chromosomes and positions present in the genetic file before extraction begins.

When to use

Enable prefilter_regions when:

The variants list is large and many variants are absent from the genetic file.
The want to reduce unnecessary I/O and speed up large extractions.

Note

prefilter_regions is an optimisation flag only. It does not change which variants are extracted — only how quickly the extraction runs. Variants absent from the genetic file are skipped regardless of this setting.

Column-Based Loading

DataManager loads only required columns:

Phenotype file: ID + specific outcome
Covariate file: ID + specific covariates
Exposure file: ID + specific exposure

This avoids loading entire files into memory.

Process in Batches

For very many exposures, process in batches by specifying subsets:

[Options]
exposures    GENE1;GENE2;GENE3;GENE4;GENE5

Run multiple configurations with different exposure subsets.

Variant ID Configuration

Customize variant ID format:

[Options]
var_sep      :

The default variant ID format is chr:pos:ref:alt using : as separator.

With different separators:

[Options]
var_sep      _

Produces IDs like 22_16050984_A_G.

Allele Matching

MARVELous can match variants with swapped ref/alt alleles:

[Options]
reverse      True

When True, a variant 22:16050984:A:G also matches 22:16050984:G:A in the genetic data.

Programmatic Usage

MARVELous can also be used as a Python library for advanced workflows. The PipelineConfig and MARVELousPipeline classes allow constructing and running pipelines directly from Python code. See the marvel.pipeline module documentation for details.

Advanced Features

Genotype Processing

Negative genotype values

Values greater than 1

neg_geno / sum_geno reference table

Common use cases

Two-Phase Extraction

How It Works

Configuration

Variant File Format

Output

Missingness Handling

Strategies

Configuration

Missingness Threshold

Workflow

DataManager Caching

Cache Architecture

Cache Statistics

Configuration

Parallel Execution

Configuration

Checkpointing

Memory Optimization

Region Filtering

Variant Pre-filtering by Region

Configuration

When to use

Column-Based Loading

Process in Batches

Variant ID Configuration

Allele Matching

Programmatic Usage

See Also

`neg_geno` / `sum_geno` reference table