Advanced Features

This page covers advanced MARVELous features including gene aggregation, missingness handling, caching, and performance optimization.

Genotype Processing

After variants are aggregated to gene level, MARVEL applies optional post-processing to handle special genotype values. Two parameters control this behaviour: neg_geno (negative values) and sum_geno (values greater than 1).

Negative genotype values

Negative genotypes arise when a sample’s genotype cannot be determined:

  • PLINK encodes missing/no-call genotypes as -1 (or -9 in some contexts).

  • VCF missing genotypes are already represented as NaN during reading.

  • BGEN dosage values are always non-negative, so this is not relevant.

After gene aggregation a sample that carries a missing variant in a gene will produce a negative gene-level sum (e.g. -1).

Values greater than 1

A gene-level sum greater than 1 arises when:

  • A sample carries more than one qualifying variant in the same gene.

  • A BGEN dosage is fractional and the sum across variants exceeds 1.

These multi-carrier samples can be biologically real and often informative, but some analyses require binary (carrier / non-carrier) status.

neg_geno / sum_geno reference table

Value

Effect

True (default)

Keep the raw value unchanged.

None

Replace with NA (treat as missing).

integer / string (e.g. 0 or 1)

Replace all matching values with the given value.

Common use cases

Keep raw values (default)

The default True preserves every value as extracted — no configuration needed:

[Options]
neg_geno     True
sum_geno     True

Treat missing genotypes as missing data

Set neg_geno=None to convert PLINK -1 no-calls to NA:

[Options]
neg_geno     None

Treat missing genotypes as non-carriers

Set neg_geno=0 to treat no-call samples as non-carriers:

[Options]
neg_geno     0

Binary carrier status

Set sum_geno=1 to cap gene counts at 1, converting every carrier to a binary (0/1) indicator regardless of how many variants they carry:

[Options]
sum_geno     1

Note

neg_geno and sum_geno can be combined freely. For example, setting neg_geno=0 and sum_geno=1 recodes missing as non-carrier and caps counts at 1.

Two-Phase Extraction

When aggregating variants by gene, MARVELous uses a two-phase extraction approach for optimal accuracy.

Note

Aggregating variants is generally referred to as aggregating by gene in this documentation. However, any type of aggregation specified by the cat_column is supported.

How It Works

Phase 1: Variant Extraction

  • Extract all variants from all genetic files

  • Combine variant carriers across files

  • Each variant tracked individually

Phase 2: Gene Aggregation

  • Aggregate combined variants by gene/category

  • Apply consistent aggregation logic

  • Preserve both variant-level and gene-level data

This approach ensures that variants split across multiple genetic files (e.g., different chromosomes) are correctly aggregated to the same category.

Configuration

Enable aggregation by setting cat_column:

[Options]
cat_column   gene
incl_var     True

The cat_column specifies which column in the variant file contains the gene/category identifier. Setting incl_var to True preserves individual variant columns alongside category aggregations.

Variant File Format

The variant file must include the aggregation column:

chr  pos     a1      a2      gene
22   16050984        A       G       BRCA1
22   16051107        C       T       BRCA1
22   16051249        G       A       BRCA2

Output

With gene aggregation enabled, the carrier file includes both:

  • Gene columns (aggregated carrier counts)

  • Variant columns (if incl_var=True)

id   BRCA1   BRCA2   22:16050984:A:G 22:16051107:C:T 22:16051249:G:A
sample1      2       1       1       1       1
sample2      1       0       1       0       0

Missingness Handling

MARVELous provides sophisticated handling of missing data through the MissingnessHandler class.

Strategies

Listwise Deletion (missingness_strategy=True)

  • Remove samples with ANY missing values across all relevant columns

  • Applied once at pipeline start

  • Most conservative approach

  • Ensures consistent sample set across all tests

Pairwise Deletion (missingness_strategy=False)

  • Remove samples with missing values only for current test

  • Applied per exposure-outcome-covariate combination

  • Maximizes sample size per test

  • Different sample sets for different tests

Configuration

[Options]
missingness_strategy True
max_missingness      0.3
cov_miss_error       True

Parameters:

  • missingness_strategy: True for listwise, False for pairwise

  • max_missingness: Maximum allowed proportion of missing values (0-1)

  • cov_miss_error: Raise error if covariates exceed threshold (vs. warning)

Missingness Threshold

Variables exceeding max_missingness are handled differently by type:

  • Outcomes: Automatically removed from analysis

  • Exposures: Automatically removed from analysis

  • Covariates: Error if cov_miss_error=True, warning if False

Workflow

The missingness handling workflow:

  1. Check threshold - Filter outcomes/exposures exceeding max_missingness

  2. Compute valid IDs (listwise only) - Find samples complete across all variables

  3. Initialize DataManager - Set up with valid IDs for filtering

  4. Per-test handling - Apply pairwise deletion for NaN in specific test

DataManager Caching

The DataManager class provides intelligent caching to optimize memory and I/O performance.

Cache Architecture

MARVELous uses three caches:

  1. Outcome Cache (LRU) - Stores recently loaded outcome columns - Default capacity: 20 items - Reduces file reads when testing multiple exposures

  2. Covariate Cache (LRU) - Stores recently loaded covariate sets - Keyed by sorted covariate list - Same model reuses cached covariates

  3. Exposure Cache (Single) - Stores current exposure being tested - Cleared when moving to next exposure - Reused across all outcomes

Cache Statistics

After association testing, cache statistics are logged:

[info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
[info] 14:49:37 >   Outcomes: 6 hits, 2 misses, hit rate: 75.00%
[info] 14:49:37 >   Covariates: 0 hits, 0 misses, hit rate: 0.00%
[info] 14:49:37 >   Exposure reuses: 0

High hit rates indicate effective caching.

Configuration

Cache sizes are set in marvel/constants.py. For most analyses the defaults are sufficient. Advanced users can tune cache sizes via the marvel.pipeline module API.

Parallel Execution

MARVELous supports parallel execution for variant extraction.

Configuration

[Options]
n_jobs       -1
checkpoint_dir       ./checkpoints
force_rerun  False

Parameters:

  • n_jobs: Number of parallel jobs (-1 = all CPUs)

  • checkpoint_dir: Directory for checkpoint files

  • force_rerun: Ignore checkpoints and re-run everything

Checkpointing

Extraction tasks are checkpointed:

  1. Before starting, check if checkpoint exists

  2. If exists and force_rerun=False, load from checkpoint

  3. After completion, save checkpoint

This allows resuming interrupted extractions:

# First run (interrupted)
marvelous config.cnf -v
# Ctrl+C or failure

# Resume from checkpoints
marvelous config.cnf -v

To start fresh:

[Options]
force_rerun  True

Memory Optimization

Tips for analyzing large datasets:

Region Filtering

Extract variants from specific genomic regions:

[Options]
region       22:16000000-17000000

Format: chromosome:start-end

This filters the genetic file to only consider variants in the specified region.

Variant Pre-filtering by Region

MARVELous can pre-filter the input variants list before extraction to skip variants not present in a genetic file. This is a performance optimisation that uses the genetic file’s own index — no external BED file is needed.

Configuration

[Options]
prefilter_regions    True

Parameters:

  • prefilter_regions: Boolean (default False). When True, restricts the variants list to chromosomes and positions present in the genetic file before extraction begins.

When to use

Enable prefilter_regions when:

  • The variants list is large and many variants are absent from the genetic file.

  • The want to reduce unnecessary I/O and speed up large extractions.

Note

prefilter_regions is an optimisation flag only. It does not change which variants are extracted — only how quickly the extraction runs. Variants absent from the genetic file are skipped regardless of this setting.

Column-Based Loading

DataManager loads only required columns:

  • Phenotype file: ID + specific outcome

  • Covariate file: ID + specific covariates

  • Exposure file: ID + specific exposure

This avoids loading entire files into memory.

Process in Batches

For very many exposures, process in batches by specifying subsets:

[Options]
exposures    GENE1;GENE2;GENE3;GENE4;GENE5

Run multiple configurations with different exposure subsets.

Variant ID Configuration

Customize variant ID format:

[Options]
var_sep      :

The default variant ID format is chr:pos:ref:alt using : as separator.

With different separators:

[Options]
var_sep      _

Produces IDs like 22_16050984_A_G.

Allele Matching

MARVELous can match variants with swapped ref/alt alleles:

[Options]
reverse      True

When True, a variant 22:16050984:A:G also matches 22:16050984:G:A in the genetic data.

Programmatic Usage

MARVELous can also be used as a Python library for advanced workflows. The PipelineConfig and MARVELousPipeline classes allow constructing and running pipelines directly from Python code. See the marvel.pipeline module documentation for details.

See Also