Advanced Features
=================

This page covers advanced MARVELous features including gene aggregation,
missingness handling, caching, and performance optimization.

Genotype Processing
-------------------

After variants are aggregated to gene level, MARVEL applies optional post-processing
to handle special genotype values.  Two parameters control this behaviour: ``neg_geno``
(negative values) and ``sum_geno`` (values greater than 1).

Negative genotype values
^^^^^^^^^^^^^^^^^^^^^^^^

Negative genotypes arise when a sample's genotype cannot be determined:

- **PLINK** encodes missing/no-call genotypes as ``-1`` (or ``-9`` in some contexts).
- **VCF** missing genotypes are already represented as ``NaN`` during reading.
- **BGEN** dosage values are always non-negative, so this is not relevant.

After gene aggregation a sample that carries a missing variant in a gene
will produce a negative gene-level sum (e.g. ``-1``).

Values greater than 1
^^^^^^^^^^^^^^^^^^^^^

A gene-level sum greater than 1 arises when:

- A sample carries more than one qualifying variant in the same gene.
- A BGEN dosage is fractional and the sum across variants exceeds 1.

These multi-carrier samples can be biologically real and often informative,
but some analyses require binary (carrier / non-carrier) status.

``neg_geno`` / ``sum_geno`` reference table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Value
     - Effect
   * - ``True`` *(default)*
     - Keep the raw value unchanged.
   * - ``None``
     - Replace with ``NA`` (treat as missing).
   * - integer / string (e.g. ``0`` or ``1``)
     - Replace all matching values with the given value.

Common use cases
^^^^^^^^^^^^^^^^

**Keep raw values (default)**

The default ``True`` preserves every value as extracted — no configuration needed:

.. code-block:: ini

   [Options]
   neg_geno	True
   sum_geno	True

**Treat missing genotypes as missing data**

Set ``neg_geno=None`` to convert PLINK ``-1`` no-calls to ``NA``:

.. code-block:: ini

   [Options]
   neg_geno	None

**Treat missing genotypes as non-carriers**

Set ``neg_geno=0`` to treat no-call samples as non-carriers:

.. code-block:: ini

   [Options]
   neg_geno	0

**Binary carrier status**

Set ``sum_geno=1`` to cap gene counts at 1, converting every carrier to a
binary (0/1) indicator regardless of how many variants they carry:

.. code-block:: ini

   [Options]
   sum_geno	1

.. note::

   ``neg_geno`` and ``sum_geno`` can be combined freely. For example,
   setting ``neg_geno=0`` and ``sum_geno=1`` recodes missing as non-carrier
   and caps counts at 1.


Two-Phase Extraction
--------------------

When aggregating variants by gene, MARVELous uses a two-phase extraction
approach for optimal accuracy.

.. note::

    Aggregating variants is generally referred to as aggregating by gene in
    this documentation. However, any type of aggregation specified by the
    ``cat_column`` is supported.


How It Works
^^^^^^^^^^^^

**Phase 1: Variant Extraction**

- Extract all variants from all genetic files
- Combine variant carriers across files
- Each variant tracked individually

**Phase 2: Gene Aggregation**

- Aggregate combined variants by gene/category
- Apply consistent aggregation logic
- Preserve both variant-level and gene-level data

This approach ensures that variants split across multiple genetic files
(e.g., different chromosomes) are correctly aggregated to the same category.

Configuration
^^^^^^^^^^^^^

Enable aggregation by setting ``cat_column``:

.. code-block:: ini

   [Options]
   cat_column	gene
   incl_var	True

The ``cat_column`` specifies which column in the variant file contains the
gene/category identifier. Setting ``incl_var`` to ``True`` preserves individual
variant columns alongside category aggregations.

Variant File Format
^^^^^^^^^^^^^^^^^^^

The variant file must include the aggregation column:

.. code-block:: text

   chr	pos	a1	a2	gene
   22	16050984	A	G	BRCA1
   22	16051107	C	T	BRCA1
   22	16051249	G	A	BRCA2

Output
^^^^^^

With gene aggregation enabled, the carrier file includes both:

- Gene columns (aggregated carrier counts)
- Variant columns (if ``incl_var=True``)

.. code-block:: text

   id	BRCA1	BRCA2	22:16050984:A:G	22:16051107:C:T	22:16051249:G:A
   sample1	2	1	1	1	1
   sample2	1	0	1	0	0


Missingness Handling
--------------------

MARVELous provides sophisticated handling of missing data through the
``MissingnessHandler`` class.

Strategies
^^^^^^^^^^

**Listwise Deletion** (``missingness_strategy=True``)

- Remove samples with ANY missing values across all relevant columns
- Applied once at pipeline start
- Most conservative approach
- Ensures consistent sample set across all tests

**Pairwise Deletion** (``missingness_strategy=False``)

- Remove samples with missing values only for current test
- Applied per exposure-outcome-covariate combination
- Maximizes sample size per test
- Different sample sets for different tests

Configuration
^^^^^^^^^^^^^

.. code-block:: ini

   [Options]
   missingness_strategy	True
   max_missingness	0.3
   cov_miss_error	True

Parameters:

- ``missingness_strategy``: ``True`` for listwise, ``False`` for pairwise
- ``max_missingness``: Maximum allowed proportion of missing values (0-1)
- ``cov_miss_error``: Raise error if covariates exceed threshold (vs. warning)

Missingness Threshold
^^^^^^^^^^^^^^^^^^^^^

Variables exceeding ``max_missingness`` are handled differently by type:

- **Outcomes**: Automatically removed from analysis
- **Exposures**: Automatically removed from analysis
- **Covariates**: Error if ``cov_miss_error=True``, warning if ``False``


Workflow
^^^^^^^^

The missingness handling workflow:

1. **Check threshold** - Filter outcomes/exposures exceeding ``max_missingness``
2. **Compute valid IDs** (listwise only) - Find samples complete across all variables
3. **Initialize DataManager** - Set up with valid IDs for filtering
4. **Per-test handling** - Apply pairwise deletion for NaN in specific test


DataManager Caching
-------------------

The ``DataManager`` class provides intelligent caching to optimize memory
and I/O performance.

Cache Architecture
^^^^^^^^^^^^^^^^^^

MARVELous uses three caches:

1. **Outcome Cache** (LRU)
   - Stores recently loaded outcome columns
   - Default capacity: 20 items
   - Reduces file reads when testing multiple exposures

2. **Covariate Cache** (LRU)
   - Stores recently loaded covariate sets
   - Keyed by sorted covariate list
   - Same model reuses cached covariates

3. **Exposure Cache** (Single)
   - Stores current exposure being tested
   - Cleared when moving to next exposure
   - Reused across all outcomes

Cache Statistics
^^^^^^^^^^^^^^^^

After association testing, cache statistics are logged:

.. code-block:: text

    [info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
    [info] 14:49:37 >   Outcomes: 6 hits, 2 misses, hit rate: 75.00%
    [info] 14:49:37 >   Covariates: 0 hits, 0 misses, hit rate: 0.00%
    [info] 14:49:37 >   Exposure reuses: 0

High hit rates indicate effective caching.

Configuration
^^^^^^^^^^^^^

Cache sizes are set in ``marvel/constants.py``. For most analyses the
defaults are sufficient. Advanced users can tune cache sizes via the
:doc:`api/pipeline` API.

Parallel Execution
------------------

MARVELous supports parallel execution for variant extraction.

Configuration
^^^^^^^^^^^^^

.. code-block:: ini

   [Options]
   n_jobs	-1
   checkpoint_dir	./checkpoints
   force_rerun	False

Parameters:

- ``n_jobs``: Number of parallel jobs (``-1`` = all CPUs)
- ``checkpoint_dir``: Directory for checkpoint files
- ``force_rerun``: Ignore checkpoints and re-run everything

Checkpointing
^^^^^^^^^^^^^

Extraction tasks are checkpointed:

1. Before starting, check if checkpoint exists
2. If exists and ``force_rerun=False``, load from checkpoint
3. After completion, save checkpoint

This allows resuming interrupted extractions:

.. code-block:: bash

   # First run (interrupted)
   marvelous config.cnf -v
   # Ctrl+C or failure

   # Resume from checkpoints
   marvelous config.cnf -v

To start fresh:

.. code-block:: ini

   [Options]
   force_rerun	True


Memory Optimization
-------------------

Tips for analyzing large datasets:

Region Filtering
^^^^^^^^^^^^^^^^

Extract variants from specific genomic regions:

.. code-block:: ini

   [Options]
   region	22:16000000-17000000

Format: ``chromosome:start-end``

This filters the genetic file to only consider variants in the specified region.

Variant Pre-filtering by Region
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

MARVELous can pre-filter the input variants list before extraction to skip
variants not present in a genetic file. This is a performance optimisation that
uses the genetic file's own index — no external BED file is needed.

Configuration
"""""""""""""

.. code-block:: ini

   [Options]
   prefilter_regions	True

Parameters:

- ``prefilter_regions``: Boolean (default ``False``). When ``True``, restricts
  the variants list to chromosomes and positions present in the genetic file
  before extraction begins.

When to use
"""""""""""

Enable ``prefilter_regions`` when:

- The variants list is large and many variants are absent from the genetic file.
- The want to reduce unnecessary I/O and speed up large extractions.

.. note::

   ``prefilter_regions`` is an optimisation flag only. It does not change which
   variants are extracted — only how quickly the extraction runs. Variants absent
   from the genetic file are skipped regardless of this setting.


Column-Based Loading
^^^^^^^^^^^^^^^^^^^^

DataManager loads only required columns:

- Phenotype file: ID + specific outcome
- Covariate file: ID + specific covariates
- Exposure file: ID + specific exposure

This avoids loading entire files into memory.

Process in Batches
^^^^^^^^^^^^^^^^^^

For very many exposures, process in batches by specifying subsets:

.. code-block:: ini

   [Options]
   exposures	GENE1;GENE2;GENE3;GENE4;GENE5

Run multiple configurations with different exposure subsets.

Variant ID Configuration
------------------------

Customize variant ID format:

.. code-block:: ini

   [Options]
   var_sep	:

The default variant ID format is ``chr:pos:ref:alt`` using ``:`` as separator.

With different separators:

.. code-block:: ini

   [Options]
   var_sep	_

Produces IDs like ``22_16050984_A_G``.


Allele Matching
---------------

MARVELous can match variants with swapped ref/alt alleles:

.. code-block:: ini

   [Options]
   reverse	True

When ``True``, a variant ``22:16050984:A:G`` also matches ``22:16050984:G:A``
in the genetic data.


Programmatic Usage
^^^^^^^^^^^^^^^^^^

MARVELous can also be used as a Python library for advanced workflows.
The ``PipelineConfig`` and ``MARVELousPipeline`` classes allow constructing
and running pipelines directly from Python code. See the :doc:`api/pipeline`
documentation for details.


See Also
--------

- :doc:`configuration` - Configuration reference
- :doc:`api` - API documentation