marvel.extraction package

marvel.extraction.core module

Core module to be used in several other modules

class marvel.extraction.core.AlleleComparator

Bases: object

Compare variant alleles

static compare(variant: Variant, alleles: List[str], ref: None | str = None, warning: bool = False) bool

Compare variant alleles with expected alleles

Parameters:
  • variant (Variant) – Genetic variant to compare

  • alleles (list[str]) – Expected alleles

  • ref (str, optional) – Reference allele for explicit matching

  • warning (bool) – Whether to warn on mismatch

Returns:

True if alleles match

Return type:

bool

class marvel.extraction.core.AlleleInverter(var_sep: str = ':')

Bases: object

Handles allele inversion operations

add_inversed_column(df: DataFrame, var_column: str = 'ID') DataFrame

Add column with inversed alleles to dataframe.

Parameters:
  • df (pd.DataFrame) – Dataframe with variant IDs

  • var_column (str) – Column name containing variant IDs

Returns:

Dataframe with added inversed allele column

Return type:

pd.DataFrame

reorder_alleles(variant_id: str) str

Inverse the order of alleles in a variant ID.

Parameters:

variant_id (str) – Variant ID string (e.g., ‘chr1:12345:A:T’)

Returns:

Variant ID with inversed alleles (e.g., ‘chr1:12345:T:A’)

Return type:

str

class marvel.extraction.core.GenotypeCounter(id_column: str = 'id', count_na: bool = True)

Bases: object

Class for counting and summarizing genotypes across samples

count(df: DataFrame, method: str = 'any') DataFrame

Count genotypes across samples for each variant/gene

Parameters:
  • df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns Index should contain sample IDs

  • method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples

Returns:

Summary dataframe with counts for each variant/gene

Return type:

pd.DataFrame

Raises:
  • ValueError – If method is not ‘any’ or ‘sum’

  • TypeError – If df is not a DataFrame

count_by_category(df: DataFrame, categories: DataFrame, var_column: str = 'ID', cat_column: str = None, method: str = 'any') DataFrame

Count genotypes grouped by categories (e.g., genes)

Parameters:
  • df (pd.DataFrame) – Dataframe with samples as rows and variants as columns

  • categories (pd.DataFrame) – Dataframe mapping variants to categories

  • var_column (str) – Column name in categories containing variant IDs

  • cat_column (str) – Column name in categories containing category/gene names

  • method ({'any', 'sum'}) – Counting method (see count() method)

Returns:

Summary dataframe with counts for each category

Return type:

pd.DataFrame

class marvel.extraction.core.GenotypeProcessor

Bases: object

Handles genotype value processing and validation

static process_genotypes(df: DataFrame, gene_columns: list[str], neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True) DataFrame

Process genotype values according to specified rules.

Parameters:
  • df (pd.DataFrame) – Dataframe with genotype columns

  • gene_columns (list[str]) – Column names to process

  • neg_geno (str, int, bool, or None) –

    How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.

    • True (default): keep the raw value unchanged.

    • None: replace with NA (treat as missing).

    • integer/string (e.g. 0): replace all negative values with that value.

  • sum_geno (str, int, bool, or None) –

    How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).

    • True (default): keep the raw value unchanged.

    • None: replace with NA.

    • integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.

Returns:

Processed dataframe

Return type:

pd.DataFrame

class marvel.extraction.core.Variant(rsid: str, chrom: str, pos: int, ref: str, alt: List[str], sample_data: List[Tuple[str, float]])

Bases: object

Genetic variant with sample data

Parameters:
  • rsid (str) – Variant ID

  • chrom (str) – Chromosome number

  • pos (int) – Basepair position

  • ref (str) – Reference allele

  • alt (list[str]) – Alternative allele

  • sample_data (list) – Genotypes of individuals for this variant

alt: List[str]
chrom: str
pos: int
ref: str
rsid: str
sample_data: List[Tuple[str, float]]
class marvel.extraction.core.VariantProcessor(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Processes variant dataframes

process(variants: DataFrame) DataFrame

Pre-process variants dataframe

Parameters:

variants (pd.DataFrame) – Variants dataframe

Returns:

variants – Processed variants dataframe

Return type:

pd.DataFrame

Raises:

InputValidationError – If required columns not present

marvel.extraction.core.count_gen(df: DataFrame, method: str = 'any', id_column: str = 'id', count_na: bool = True) DataFrame

Count genotypes across samples for each variant/gene

This is a convenience wrapper around the GenotypeCounter class.

Parameters:
  • df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns. Index should contain sample IDs.

  • method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples

  • id_column (str) – Name of the ID column to exclude from counting

  • count_na (bool) – Whether to count NA/missing values

Returns:

Summary dataframe with counts for each variant/gene. Columns include: variant, count, total, frequency If count_na=True, also includes: na_count, valid_samples

Return type:

pd.DataFrame

Examples

>>> # Count carriers for each variant
>>> summary = count_gen(carrier_df, method='any')
>>> # Sum genotypes across samples
>>> summary = count_gen(carrier_df, method='sum')
>>> # Count without tracking NA values
>>> summary = count_gen(carrier_df, count_na=False)

marvel.extraction.extraction module

A module to extract genetic variants from genetic files

class marvel.extraction.extraction.BgenReaderClass

Bases: GeneticFileReader

Reader for BGEN files

get_regions(geno_file) dict[str, tuple[int, int]] | None

Return per-chromosome position ranges from BGEN file.

read(geno_file: str | BgenReader) Generator[Variant, None, None]

Read BGEN file

Parameters:

geno_file (str or BgenReader) – Genetic input file, or path to

Yields:

Variant – An object representing a genetic variant.

class marvel.extraction.extraction.ExtractionTask(geno_id: str, geno_path: str, var_id: str, var_path: str, identifier: str)

Bases: object

A single extraction task.

geno_id: str
geno_path: str
identifier: str
var_id: str
var_path: str
class marvel.extraction.extraction.GeneticFileReader

Bases: ABC

Abstract base class for genetic file readers

get_regions(geno_file) dict[str, tuple[int, int]] | None

Return per-chromosome position ranges in the file, or None.

abstract read(geno_file) Generator[Variant, None, None]

Read genetic file and yield variants

class marvel.extraction.extraction.PlinkReaderClass

Bases: GeneticFileReader

Reader for PLINK files

get_regions(geno_file) dict[str, tuple[int, int]] | None

Return per-chromosome position ranges from PLINK BIM file.

read(geno_file: str | PyPlink) Generator[Variant, None, None]

Read PLINK file

Parameters:

geno_file (str or pyplink.Pyplink) – Genetic input file, or path to

Yields:

Variant – An object representing a genetic variant.

class marvel.extraction.extraction.VariantExtraction(logger=None, verbose=False)

Bases: object

Handles extraction of variant carriers from genetic data files.

Supports:
  • Single genetic file x single variant file

  • Multiple genetic file x single variant file

  • Single genetic file x multiple variant file

  • Multiple genetic file x multiple variant file

execute_tasks(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], raise_on_error: bool = True, *args, **kwargs) Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]

Execute all extraction tasks using the provided extraction function.

Parameters:
  • extraction_function (Callable) – Function that performs the extraction.

  • raise_on_error (bool) – If True, raise exception on first error. If False, log and continue

  • *args – Additional arguments passed to the called function

  • **kwargs – Additional keyword arguments passed to the called function

Returns:

Summary dataframe, carrier dataframe, and dictionary with ‘successful’, ‘failed’, and ‘skipped’ task lists

Return type:

tuple[pd.DataFrame, pd.DataFrame, dict]

Raises:

ValueError – If no tasks have been set up

execute_tasks_parallel(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], n_jobs: int = 1, raise_on_error: bool = True, checkpoint_dir: str = None, force_rerun: bool = False, *args, **kwargs) Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]

Execute all extraction tasks in parallel, grouped by geno_file, with checkpoint/resume support.

Parameters:
  • extraction_function (Callable) – Function that performs the extraction

  • n_jobs (int) – Number of parallel workers. If 1, runs sequentially. If -1, uses all available CPUs.

  • raise_on_error (bool) – If True, raise exception on first error. If False, log and continue

  • checkpoint_dir (str, optional) – Directory to save intermediate results. If provided, tasks will save results after completion and skip already completed tasks on re-run.

  • force_rerun (bool) – If True, ignore existing checkpoint files and re-run all tasks

  • *args – Additional arguments passed to extraction_function

  • **kwargs – Additional keyword arguments passed to extraction_function

Returns:

Summary dataframe, carrier dataframe, and results dictionary

Return type:

tuple[pd.DataFrame, pd.DataFrame, dict]

setup_tasks(geno_files: dict[str, str], var_files: dict[str, str]) list[ExtractionTask]

Set up all extraction tasks for the given file combinations.

Parameters:
  • geno_files (dict[str, str]) – Dictionary mapping genetic file IDs to paths

  • var_files (dict[str, str]) – Dictionary mapping variant file IDs to paths

Returns:

List of all extraction tasks to be performed

Return type:

list[ExtractionTask]

Raises:

Various exceptions from validation

class marvel.extraction.extraction.VariantExtractor(id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Main class for extracting variant carriers

extract(geno_file: str | VCF | BgenReader | PyPlink, variants: DataFrame, region: str | None = None, prefilter_regions: bool = False, **kwargs) DataFrame

Extract variant carriers from genetic file

Parameters:
  • geno_file (str or genetic file object) – Path or loaded genetic file

  • variants (pd.DataFrame) – Variants to extract

  • region (str, optional) – VCF region to consider

  • prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.

  • **kwargs – Additional arguments for allele comparison

Returns:

Carrier matrix (samples × variants)

Return type:

pd.DataFrame

class marvel.extraction.extraction.VcfReaderClass

Bases: GeneticFileReader

Reader for VCF files

get_regions(geno_file) dict[str, tuple[int, int]] | None

Return per-chromosome position ranges from VCF file.

read(geno_file: str | VCF, region: str | None = None) Generator[Variant, None, None]

Read VCF file.

When a GP FORMAT field is present (e.g. VCFs produced by qctool from BGEN files), dosage is computed from genotype probabilities using the same formula as BgenReaderClass:

dosage = P(het) + 2 * P(hom_alt)

This ensures mathematical equivalence between the BGEN and VCF code paths when the VCF originates from a BGEN source. When GP is absent the reader falls back to hard GT calls.

Parameters:
  • geno_file (str or cyvcf2.cyvcf2.VCF) – Genetic input file, or path to

  • region (str or None) – Region within the vcf-file to consider

Yields:

Variant – An object representing a genetic variant.

marvel.extraction.extraction.extract_carriers(geno_file: str | VCF | BgenReader | PyPlink, variants: str | DataFrame, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', region: str | None = None, summarise: bool = True, prefilter_regions: bool = False, **comparison_kwargs) Tuple[DataFrame, DataFrame]

Extract variant carriers from genetic file

Parameters:
  • geno_file (str or genetic file object) – Path or loaded genetic file

  • variants (pd.DataFrame) – Variants to extract

  • id_column (str) – Individual ID column name

  • chr_column (str) – Chromosome column name

  • pos_column (str) – Position column name

  • chr_pos_column (str) – Chromosome:position column name

  • ref_column (str) – Reference allele column name

  • alt_column (str) – Alternate allele column name

  • var_column (str) – Variant ID column name

  • var_sep (str) – Separator for variant ID components

  • region (str, optional) – VCF region to consider

  • summarise (bool) – Whether to return summary and detailed results

  • prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.

  • **comparison_kwargs – Additional arguments for allele comparison

Returns:

Carrier matrix, optionally with empty summary

Return type:

pd.DataFrame or tuple[pd.DataFrame, pd.DataFrame]

marvel.extraction.aggregation module

Extract variant carriers in specific genes/categories

class marvel.extraction.aggregation.GeneExtractor(id_column: str = 'id', cat_column: str = None, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Main class for extracting gene-level carrier information

extract(input_file: str | DataFrame, genes: str | DataFrame, reverse: bool = True, incl_var: bool = True, summarise: bool = True, method: str = 'any', count_na: bool = True, neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True, **kwargs) Tuple[DataFrame, DataFrame]

Extract gene-level carrier information.

Parameters:
  • input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe

  • genes (str or pd.DataFrame) – Path to or loaded dataframe with variant-gene mapping

  • id_column (str) – Individual ID column name

  • reverse (bool) – Whether to check inversed alleles

  • incl_var (bool) – Whether to include variant columns in output

  • summarise (bool) – Whether to create summary report

  • method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples

  • count_na (bool) – Whether to count NA/missing values

  • neg_geno (str, int, bool, or None) –

    How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.

    • True (default): keep the raw value unchanged.

    • None: replace with NA (treat as missing).

    • integer/string (e.g. 0): replace all negative values with that value.

  • sum_geno (str, int, bool, or None) –

    How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).

    • True (default): keep the raw value unchanged.

    • None: replace with NA.

    • integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.

  • **kwargs – Additional arguments for extract_carriers

Returns:

Summary dataframe and gene carriers dataframe

Return type:

tuple[pd.DataFrame, pd.DataFrame]

class marvel.extraction.aggregation.VariantToGeneAggregator(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True)

Bases: object

Aggregates variant-level data to gene-level

aggregate(var_carriers: DataFrame, genes: DataFrame, cat_column: str, var_column: str = 'ID') DataFrame

Aggregate variant carriers to gene carriers.

Parameters:
  • var_carriers (pd.DataFrame) – Dataframe with variant carrier information

  • genes (pd.DataFrame) – Dataframe mapping variants to genes

  • cat_column (str) – Column in genes indicating gene name

  • var_column (str) – Column in genes indicating variant ID

Returns:

Dataframe with gene-level carrier information

Return type:

pd.DataFrame

marvel.extraction.aggregation.extract_genes(input_file: str | DataFrame, genes: str | DataFrame, cat_column: str, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True, incl_var: bool = True, summarise: bool = True, neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True, **kwargs) Tuple[DataFrame, DataFrame]

Extract gene-level carrier information from genetic data.

This function aggregates variant-level carrier information to gene-level, optionally checking inversed alleles and processing genotype values.

Parameters:
  • input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe from extract_carriers

  • genes (str or pd.DataFrame) – Path to or loaded dataframe mapping variants to genes

  • cat_column (str) – Column name indicating gene for each variant

  • id_column (str) – Column name for sample IDs

  • chr_column (str) – Chromosome column name

  • pos_column (str) – Position column name

  • chr_pos (str) – Chromosome:position column name

  • ref_column (str) – Reference allele column name

  • alt_column (str) – Alternate allele column name

  • var_column (str) – Column name indicating variant IDs

  • var_sep (str) – Separator for variant ID components (chr:pos:ref:alt)

  • reverse (bool) – Whether to check inversed alleles (e.g., A:T vs T:A)

  • incl_var (bool) – Whether to include variant columns in output

  • summarise (bool) – Whether to create summary report

  • method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples

  • count_na (bool) – Whether to count NA/missing values

  • neg_geno (str, int, bool, or None) –

    How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.

    • True (default): keep the raw value unchanged.

    • None: replace with NA (treat as missing).

    • integer/string (e.g. 0): replace all negative values with that value.

  • sum_geno (str, int, bool, or None) –

    How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).

    • True (default): keep the raw value unchanged.

    • None: replace with NA.

    • integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.

  • **kwargs – Additional arguments passed to extract_carriers

Returns:

Summary dataframe and gene carriers dataframe

Return type:

tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> sum_df, gen_carriers = extract_genes(
...     input_file='genotypes.vcf',
...     genes='gene_variants.tsv',
...     reverse=True,
...     incl_var=False
... )