Advanced Features
This page covers advanced MARVELous features including gene aggregation, missingness handling, caching, and performance optimization.
Genotype Processing
After variants are aggregated to gene level, MARVEL applies optional post-processing
to handle special genotype values. Two parameters control this behaviour: neg_geno
(negative values) and sum_geno (values greater than 1).
Negative genotype values
Negative genotypes arise when a sample’s genotype cannot be determined:
PLINK encodes missing/no-call genotypes as
-1(or-9in some contexts).VCF missing genotypes are already represented as
NaNduring reading.BGEN dosage values are always non-negative, so this is not relevant.
After gene aggregation a sample that carries a missing variant in a gene
will produce a negative gene-level sum (e.g. -1).
Values greater than 1
A gene-level sum greater than 1 arises when:
A sample carries more than one qualifying variant in the same gene.
A BGEN dosage is fractional and the sum across variants exceeds 1.
These multi-carrier samples can be biologically real and often informative, but some analyses require binary (carrier / non-carrier) status.
neg_geno / sum_geno reference table
Value |
Effect |
|---|---|
|
Keep the raw value unchanged. |
|
Replace with |
integer / string (e.g. |
Replace all matching values with the given value. |
Common use cases
Keep raw values (default)
The default True preserves every value as extracted — no configuration needed:
[Options]
neg_geno True
sum_geno True
Treat missing genotypes as missing data
Set neg_geno=None to convert PLINK -1 no-calls to NA:
[Options]
neg_geno None
Treat missing genotypes as non-carriers
Set neg_geno=0 to treat no-call samples as non-carriers:
[Options]
neg_geno 0
Binary carrier status
Set sum_geno=1 to cap gene counts at 1, converting every carrier to a
binary (0/1) indicator regardless of how many variants they carry:
[Options]
sum_geno 1
Note
neg_geno and sum_geno can be combined freely. For example,
setting neg_geno=0 and sum_geno=1 recodes missing as non-carrier
and caps counts at 1.
Two-Phase Extraction
When aggregating variants by gene, MARVELous uses a two-phase extraction approach for optimal accuracy.
Note
Aggregating variants is generally referred to as aggregating by gene in
this documentation. However, any type of aggregation specified by the
cat_column is supported.
How It Works
Phase 1: Variant Extraction
Extract all variants from all genetic files
Combine variant carriers across files
Each variant tracked individually
Phase 2: Gene Aggregation
Aggregate combined variants by gene/category
Apply consistent aggregation logic
Preserve both variant-level and gene-level data
This approach ensures that variants split across multiple genetic files (e.g., different chromosomes) are correctly aggregated to the same category.
Configuration
Enable aggregation by setting cat_column:
[Options]
cat_column gene
incl_var True
The cat_column specifies which column in the variant file contains the
gene/category identifier. Setting incl_var to True preserves individual
variant columns alongside category aggregations.
Variant File Format
The variant file must include the aggregation column:
chr pos a1 a2 gene
22 16050984 A G BRCA1
22 16051107 C T BRCA1
22 16051249 G A BRCA2
Output
With gene aggregation enabled, the carrier file includes both:
Gene columns (aggregated carrier counts)
Variant columns (if
incl_var=True)
id BRCA1 BRCA2 22:16050984:A:G 22:16051107:C:T 22:16051249:G:A
sample1 2 1 1 1 1
sample2 1 0 1 0 0
Missingness Handling
MARVELous provides sophisticated handling of missing data through the
MissingnessHandler class.
Strategies
Listwise Deletion (missingness_strategy=True)
Remove samples with ANY missing values across all relevant columns
Applied once at pipeline start
Most conservative approach
Ensures consistent sample set across all tests
Pairwise Deletion (missingness_strategy=False)
Remove samples with missing values only for current test
Applied per exposure-outcome-covariate combination
Maximizes sample size per test
Different sample sets for different tests
Configuration
[Options]
missingness_strategy True
max_missingness 0.3
cov_miss_error True
Parameters:
missingness_strategy:Truefor listwise,Falsefor pairwisemax_missingness: Maximum allowed proportion of missing values (0-1)cov_miss_error: Raise error if covariates exceed threshold (vs. warning)
Missingness Threshold
Variables exceeding max_missingness are handled differently by type:
Outcomes: Automatically removed from analysis
Exposures: Automatically removed from analysis
Covariates: Error if
cov_miss_error=True, warning ifFalse
Workflow
The missingness handling workflow:
Check threshold - Filter outcomes/exposures exceeding
max_missingnessCompute valid IDs (listwise only) - Find samples complete across all variables
Initialize DataManager - Set up with valid IDs for filtering
Per-test handling - Apply pairwise deletion for NaN in specific test
DataManager Caching
The DataManager class provides intelligent caching to optimize memory
and I/O performance.
Cache Architecture
MARVELous uses three caches:
Outcome Cache (LRU) - Stores recently loaded outcome columns - Default capacity: 20 items - Reduces file reads when testing multiple exposures
Covariate Cache (LRU) - Stores recently loaded covariate sets - Keyed by sorted covariate list - Same model reuses cached covariates
Exposure Cache (Single) - Stores current exposure being tested - Cleared when moving to next exposure - Reused across all outcomes
Cache Statistics
After association testing, cache statistics are logged:
[info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
[info] 14:49:37 > Outcomes: 6 hits, 2 misses, hit rate: 75.00%
[info] 14:49:37 > Covariates: 0 hits, 0 misses, hit rate: 0.00%
[info] 14:49:37 > Exposure reuses: 0
High hit rates indicate effective caching.
Configuration
Cache sizes are set in marvel/constants.py. For most analyses the
defaults are sufficient. Advanced users can tune cache sizes via the
marvel.pipeline module API.
Parallel Execution
MARVELous supports parallel execution for variant extraction.
Configuration
[Options]
n_jobs -1
checkpoint_dir ./checkpoints
force_rerun False
Parameters:
n_jobs: Number of parallel jobs (-1= all CPUs)checkpoint_dir: Directory for checkpoint filesforce_rerun: Ignore checkpoints and re-run everything
Checkpointing
Extraction tasks are checkpointed:
Before starting, check if checkpoint exists
If exists and
force_rerun=False, load from checkpointAfter completion, save checkpoint
This allows resuming interrupted extractions:
# First run (interrupted)
marvelous config.cnf -v
# Ctrl+C or failure
# Resume from checkpoints
marvelous config.cnf -v
To start fresh:
[Options]
force_rerun True
Memory Optimization
Tips for analyzing large datasets:
Region Filtering
Extract variants from specific genomic regions:
[Options]
region 22:16000000-17000000
Format: chromosome:start-end
This filters the genetic file to only consider variants in the specified region.
Variant Pre-filtering by Region
MARVELous can pre-filter the input variants list before extraction to skip variants not present in a genetic file. This is a performance optimisation that uses the genetic file’s own index — no external BED file is needed.
Configuration
[Options]
prefilter_regions True
Parameters:
prefilter_regions: Boolean (defaultFalse). WhenTrue, restricts the variants list to chromosomes and positions present in the genetic file before extraction begins.
When to use
Enable prefilter_regions when:
The variants list is large and many variants are absent from the genetic file.
The want to reduce unnecessary I/O and speed up large extractions.
Note
prefilter_regions is an optimisation flag only. It does not change which
variants are extracted — only how quickly the extraction runs. Variants absent
from the genetic file are skipped regardless of this setting.
Column-Based Loading
DataManager loads only required columns:
Phenotype file: ID + specific outcome
Covariate file: ID + specific covariates
Exposure file: ID + specific exposure
This avoids loading entire files into memory.
Process in Batches
For very many exposures, process in batches by specifying subsets:
[Options]
exposures GENE1;GENE2;GENE3;GENE4;GENE5
Run multiple configurations with different exposure subsets.
Variant ID Configuration
Customize variant ID format:
[Options]
var_sep :
The default variant ID format is chr:pos:ref:alt using : as separator.
With different separators:
[Options]
var_sep _
Produces IDs like 22_16050984_A_G.
Allele Matching
MARVELous can match variants with swapped ref/alt alleles:
[Options]
reverse True
When True, a variant 22:16050984:A:G also matches 22:16050984:G:A
in the genetic data.
Programmatic Usage
MARVELous can also be used as a Python library for advanced workflows.
The PipelineConfig and MARVELousPipeline classes allow constructing
and running pipelines directly from Python code. See the marvel.pipeline module
documentation for details.
See Also
Configuration Reference - Configuration reference
marvel API - API documentation