Advanced Features ================= This page covers advanced MARVELous features including gene aggregation, missingness handling, caching, and performance optimization. Genotype Processing ------------------- After variants are aggregated to gene level, MARVEL applies optional post-processing to handle special genotype values. Two parameters control this behaviour: ``neg_geno`` (negative values) and ``sum_geno`` (values greater than 1). Negative genotype values ^^^^^^^^^^^^^^^^^^^^^^^^ Negative genotypes arise when a sample's genotype cannot be determined: - **PLINK** encodes missing/no-call genotypes as ``-1`` (or ``-9`` in some contexts). - **VCF** missing genotypes are already represented as ``NaN`` during reading. - **BGEN** dosage values are always non-negative, so this is not relevant. After gene aggregation a sample that carries a missing variant in a gene will produce a negative gene-level sum (e.g. ``-1``). Values greater than 1 ^^^^^^^^^^^^^^^^^^^^^ A gene-level sum greater than 1 arises when: - A sample carries more than one qualifying variant in the same gene. - A BGEN dosage is fractional and the sum across variants exceeds 1. These multi-carrier samples can be biologically real and often informative, but some analyses require binary (carrier / non-carrier) status. ``neg_geno`` / ``sum_geno`` reference table ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - Value - Effect * - ``True`` *(default)* - Keep the raw value unchanged. * - ``None`` - Replace with ``NA`` (treat as missing). * - integer / string (e.g. ``0`` or ``1``) - Replace all matching values with the given value. Common use cases ^^^^^^^^^^^^^^^^ **Keep raw values (default)** The default ``True`` preserves every value as extracted — no configuration needed: .. code-block:: ini [Options] neg_geno True sum_geno True **Treat missing genotypes as missing data** Set ``neg_geno=None`` to convert PLINK ``-1`` no-calls to ``NA``: .. code-block:: ini [Options] neg_geno None **Treat missing genotypes as non-carriers** Set ``neg_geno=0`` to treat no-call samples as non-carriers: .. code-block:: ini [Options] neg_geno 0 **Binary carrier status** Set ``sum_geno=1`` to cap gene counts at 1, converting every carrier to a binary (0/1) indicator regardless of how many variants they carry: .. code-block:: ini [Options] sum_geno 1 .. note:: ``neg_geno`` and ``sum_geno`` can be combined freely. For example, setting ``neg_geno=0`` and ``sum_geno=1`` recodes missing as non-carrier and caps counts at 1. Two-Phase Extraction -------------------- When aggregating variants by gene, MARVELous uses a two-phase extraction approach for optimal accuracy. .. note:: Aggregating variants is generally referred to as aggregating by gene in this documentation. However, any type of aggregation specified by the ``cat_column`` is supported. How It Works ^^^^^^^^^^^^ **Phase 1: Variant Extraction** - Extract all variants from all genetic files - Combine variant carriers across files - Each variant tracked individually **Phase 2: Gene Aggregation** - Aggregate combined variants by gene/category - Apply consistent aggregation logic - Preserve both variant-level and gene-level data This approach ensures that variants split across multiple genetic files (e.g., different chromosomes) are correctly aggregated to the same category. Configuration ^^^^^^^^^^^^^ Enable aggregation by setting ``cat_column``: .. code-block:: ini [Options] cat_column gene incl_var True The ``cat_column`` specifies which column in the variant file contains the gene/category identifier. Setting ``incl_var`` to ``True`` preserves individual variant columns alongside category aggregations. Variant File Format ^^^^^^^^^^^^^^^^^^^ The variant file must include the aggregation column: .. code-block:: text chr pos a1 a2 gene 22 16050984 A G BRCA1 22 16051107 C T BRCA1 22 16051249 G A BRCA2 Output ^^^^^^ With gene aggregation enabled, the carrier file includes both: - Gene columns (aggregated carrier counts) - Variant columns (if ``incl_var=True``) .. code-block:: text id BRCA1 BRCA2 22:16050984:A:G 22:16051107:C:T 22:16051249:G:A sample1 2 1 1 1 1 sample2 1 0 1 0 0 Missingness Handling -------------------- MARVELous provides sophisticated handling of missing data through the ``MissingnessHandler`` class. Strategies ^^^^^^^^^^ **Listwise Deletion** (``missingness_strategy=True``) - Remove samples with ANY missing values across all relevant columns - Applied once at pipeline start - Most conservative approach - Ensures consistent sample set across all tests **Pairwise Deletion** (``missingness_strategy=False``) - Remove samples with missing values only for current test - Applied per exposure-outcome-covariate combination - Maximizes sample size per test - Different sample sets for different tests Configuration ^^^^^^^^^^^^^ .. code-block:: ini [Options] missingness_strategy True max_missingness 0.3 cov_miss_error True Parameters: - ``missingness_strategy``: ``True`` for listwise, ``False`` for pairwise - ``max_missingness``: Maximum allowed proportion of missing values (0-1) - ``cov_miss_error``: Raise error if covariates exceed threshold (vs. warning) Missingness Threshold ^^^^^^^^^^^^^^^^^^^^^ Variables exceeding ``max_missingness`` are handled differently by type: - **Outcomes**: Automatically removed from analysis - **Exposures**: Automatically removed from analysis - **Covariates**: Error if ``cov_miss_error=True``, warning if ``False`` Workflow ^^^^^^^^ The missingness handling workflow: 1. **Check threshold** - Filter outcomes/exposures exceeding ``max_missingness`` 2. **Compute valid IDs** (listwise only) - Find samples complete across all variables 3. **Initialize DataManager** - Set up with valid IDs for filtering 4. **Per-test handling** - Apply pairwise deletion for NaN in specific test DataManager Caching ------------------- The ``DataManager`` class provides intelligent caching to optimize memory and I/O performance. Cache Architecture ^^^^^^^^^^^^^^^^^^ MARVELous uses three caches: 1. **Outcome Cache** (LRU) - Stores recently loaded outcome columns - Default capacity: 20 items - Reduces file reads when testing multiple exposures 2. **Covariate Cache** (LRU) - Stores recently loaded covariate sets - Keyed by sorted covariate list - Same model reuses cached covariates 3. **Exposure Cache** (Single) - Stores current exposure being tested - Cleared when moving to next exposure - Reused across all outcomes Cache Statistics ^^^^^^^^^^^^^^^^ After association testing, cache statistics are logged: .. code-block:: text [info] 14:49:37 > DataManager cache statistics (stratification 'overall'): [info] 14:49:37 > Outcomes: 6 hits, 2 misses, hit rate: 75.00% [info] 14:49:37 > Covariates: 0 hits, 0 misses, hit rate: 0.00% [info] 14:49:37 > Exposure reuses: 0 High hit rates indicate effective caching. Configuration ^^^^^^^^^^^^^ Cache sizes are set in ``marvel/constants.py``. For most analyses the defaults are sufficient. Advanced users can tune cache sizes via the :doc:`api/pipeline` API. Parallel Execution ------------------ MARVELous supports parallel execution for variant extraction. Configuration ^^^^^^^^^^^^^ .. code-block:: ini [Options] n_jobs -1 checkpoint_dir ./checkpoints force_rerun False Parameters: - ``n_jobs``: Number of parallel jobs (``-1`` = all CPUs) - ``checkpoint_dir``: Directory for checkpoint files - ``force_rerun``: Ignore checkpoints and re-run everything Checkpointing ^^^^^^^^^^^^^ Extraction tasks are checkpointed: 1. Before starting, check if checkpoint exists 2. If exists and ``force_rerun=False``, load from checkpoint 3. After completion, save checkpoint This allows resuming interrupted extractions: .. code-block:: bash # First run (interrupted) marvelous config.cnf -v # Ctrl+C or failure # Resume from checkpoints marvelous config.cnf -v To start fresh: .. code-block:: ini [Options] force_rerun True Memory Optimization ------------------- Tips for analyzing large datasets: Region Filtering ^^^^^^^^^^^^^^^^ Extract variants from specific genomic regions: .. code-block:: ini [Options] region 22:16000000-17000000 Format: ``chromosome:start-end`` This filters the genetic file to only consider variants in the specified region. Variant Pre-filtering by Region ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ MARVELous can pre-filter the input variants list before extraction to skip variants not present in a genetic file. This is a performance optimisation that uses the genetic file's own index — no external BED file is needed. Configuration """"""""""""" .. code-block:: ini [Options] prefilter_regions True Parameters: - ``prefilter_regions``: Boolean (default ``False``). When ``True``, restricts the variants list to chromosomes and positions present in the genetic file before extraction begins. When to use """"""""""" Enable ``prefilter_regions`` when: - The variants list is large and many variants are absent from the genetic file. - The want to reduce unnecessary I/O and speed up large extractions. .. note:: ``prefilter_regions`` is an optimisation flag only. It does not change which variants are extracted — only how quickly the extraction runs. Variants absent from the genetic file are skipped regardless of this setting. Column-Based Loading ^^^^^^^^^^^^^^^^^^^^ DataManager loads only required columns: - Phenotype file: ID + specific outcome - Covariate file: ID + specific covariates - Exposure file: ID + specific exposure This avoids loading entire files into memory. Process in Batches ^^^^^^^^^^^^^^^^^^ For very many exposures, process in batches by specifying subsets: .. code-block:: ini [Options] exposures GENE1;GENE2;GENE3;GENE4;GENE5 Run multiple configurations with different exposure subsets. Variant ID Configuration ------------------------ Customize variant ID format: .. code-block:: ini [Options] var_sep : The default variant ID format is ``chr:pos:ref:alt`` using ``:`` as separator. With different separators: .. code-block:: ini [Options] var_sep _ Produces IDs like ``22_16050984_A_G``. Allele Matching --------------- MARVELous can match variants with swapped ref/alt alleles: .. code-block:: ini [Options] reverse True When ``True``, a variant ``22:16050984:A:G`` also matches ``22:16050984:G:A`` in the genetic data. Programmatic Usage ^^^^^^^^^^^^^^^^^^ MARVELous can also be used as a Python library for advanced workflows. The ``PipelineConfig`` and ``MARVELousPipeline`` classes allow constructing and running pipelines directly from Python code. See the :doc:`api/pipeline` documentation for details. See Also -------- - :doc:`configuration` - Configuration reference - :doc:`api` - API documentation