`marvel.extraction` package

`marvel.extraction.core` module

Core module to be used in several other modules

class marvel.extraction.core.AlleleComparator

Bases: object

Compare variant alleles

static compare(variant: Variant, alleles: List[str], ref: None | str = None, warning: bool = False) → bool

Compare variant alleles with expected alleles

Parameters:

variant (Variant) – Genetic variant to compare
alleles (list[str]) – Expected alleles
ref (str, optional) – Reference allele for explicit matching
warning (bool) – Whether to warn on mismatch

Returns:

True if alleles match

Return type:

bool

class marvel.extraction.core.AlleleInverter(var_sep: str = ':')

Bases: object

Handles allele inversion operations

add_inversed_column(df: DataFrame, var_column: str = 'ID') → DataFrame

Add column with inversed alleles to dataframe.

Parameters:

df (pd.DataFrame) – Dataframe with variant IDs
var_column (str) – Column name containing variant IDs

Returns:

Dataframe with added inversed allele column

Return type:

pd.DataFrame

reorder_alleles(variant_id: str) → str

Inverse the order of alleles in a variant ID.

Parameters:: variant_id (str) – Variant ID string (e.g., ‘chr1:12345:A:T’)
Returns:: Variant ID with inversed alleles (e.g., ‘chr1:12345:T:A’)
Return type:: str

class marvel.extraction.core.GenotypeCounter(id_column: str = 'id', count_na: bool = True)

Bases: object

Class for counting and summarizing genotypes across samples

count(df: DataFrame, method: str = 'any') → DataFrame

Count genotypes across samples for each variant/gene

Parameters:

df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns Index should contain sample IDs
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples

Returns:

Summary dataframe with counts for each variant/gene

Return type:

pd.DataFrame

Raises:

ValueError – If method is not ‘any’ or ‘sum’
TypeError – If df is not a DataFrame

count_by_category(df: DataFrame, categories: DataFrame, var_column: str = 'ID', cat_column: str = None, method: str = 'any') → DataFrame

Count genotypes grouped by categories (e.g., genes)

Parameters:

df (pd.DataFrame) – Dataframe with samples as rows and variants as columns
categories (pd.DataFrame) – Dataframe mapping variants to categories
var_column (str) – Column name in categories containing variant IDs
cat_column (str) – Column name in categories containing category/gene names
method ({'any', 'sum'}) – Counting method (see count() method)

Returns:

Summary dataframe with counts for each category

Return type:

pd.DataFrame

class marvel.extraction.core.GenotypeProcessor

Bases: object

Handles genotype value processing and validation

Process genotype values according to specified rules.

Parameters:

df (pd.DataFrame) – Dataframe with genotype columns
gene_columns (list[str]) – Column names to process
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
- True (default): keep the raw value unchanged.
- None: replace with NA (treat as missing).
- integer/string (e.g. 0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
- True (default): keep the raw value unchanged.
- None: replace with NA.
- integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.

Returns:

Processed dataframe

Return type:

pd.DataFrame

class marvel.extraction.core.Variant(rsid: str, chrom: str, pos: int, ref: str, alt: List[str], sample_data: List[Tuple[str, float]])

Bases: object

Genetic variant with sample data

Parameters:

rsid (str) – Variant ID
chrom (str) – Chromosome number
pos (int) – Basepair position
ref (str) – Reference allele
alt (list[str]) – Alternative allele
sample_data (list) – Genotypes of individuals for this variant

alt: List[str]

chrom: str

pos: int

ref: str

rsid: str

sample_data: List[Tuple[str, float]]

class marvel.extraction.core.VariantProcessor(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Processes variant dataframes

process(variants: DataFrame) → DataFrame

Pre-process variants dataframe

Parameters:: variants (pd.DataFrame) – Variants dataframe
Returns:: variants – Processed variants dataframe
Return type:: pd.DataFrame
Raises:: InputValidationError – If required columns not present

marvel.extraction.core.count_gen(df: DataFrame, method: str = 'any', id_column: str = 'id', count_na: bool = True) → DataFrame

Count genotypes across samples for each variant/gene

This is a convenience wrapper around the GenotypeCounter class.

Parameters:

df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns. Index should contain sample IDs.
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
id_column (str) – Name of the ID column to exclude from counting
count_na (bool) – Whether to count NA/missing values

Returns:

Summary dataframe with counts for each variant/gene. Columns include: variant, count, total, frequency If count_na=True, also includes: na_count, valid_samples

Return type:

pd.DataFrame

Examples

>>> # Count carriers for each variant
>>> summary = count_gen(carrier_df, method='any')

>>> # Sum genotypes across samples
>>> summary = count_gen(carrier_df, method='sum')

>>> # Count without tracking NA values
>>> summary = count_gen(carrier_df, count_na=False)

`marvel.extraction.extraction` module

A module to extract genetic variants from genetic files

class marvel.extraction.extraction.BgenReaderClass

Bases: GeneticFileReader

Reader for BGEN files

get_regions(geno_file) → dict[str, tuple[int, int]] | None: Return per-chromosome position ranges from BGEN file.

read(geno_file: str | BgenReader) → Generator[Variant, None, None]

Read BGEN file

Parameters:: geno_file (str or BgenReader) – Genetic input file, or path to
Yields:: Variant – An object representing a genetic variant.

class marvel.extraction.extraction.ExtractionTask(geno_id: str, geno_path: str, var_id: str, var_path: str, identifier: str)

Bases: object

A single extraction task.

geno_id: str

geno_path: str

identifier: str

var_id: str

var_path: str

class marvel.extraction.extraction.GeneticFileReader

Bases: ABC

Abstract base class for genetic file readers

get_regions(geno_file) → dict[str, tuple[int, int]] | None: Return per-chromosome position ranges in the file, or None.

abstract read(geno_file) → Generator[Variant, None, None]: Read genetic file and yield variants

class marvel.extraction.extraction.PlinkReaderClass

Bases: GeneticFileReader

Reader for PLINK files

get_regions(geno_file) → dict[str, tuple[int, int]] | None: Return per-chromosome position ranges from PLINK BIM file.

read(geno_file: str | PyPlink) → Generator[Variant, None, None]

Read PLINK file

Parameters:: geno_file (str or pyplink.Pyplink) – Genetic input file, or path to
Yields:: Variant – An object representing a genetic variant.

class marvel.extraction.extraction.VariantExtraction(logger=None, verbose=False)

Bases: object

Handles extraction of variant carriers from genetic data files.

Supports:

Single genetic file x single variant file
Multiple genetic file x single variant file
Single genetic file x multiple variant file
Multiple genetic file x multiple variant file

execute_tasks(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], raise_on_error: bool = True, *args, **kwargs) → Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]

Execute all extraction tasks using the provided extraction function.

Parameters:

extraction_function (Callable) – Function that performs the extraction.
raise_on_error (bool) – If True, raise exception on first error. If False, log and continue
*args – Additional arguments passed to the called function
**kwargs – Additional keyword arguments passed to the called function

Returns:

Summary dataframe, carrier dataframe, and dictionary with ‘successful’, ‘failed’, and ‘skipped’ task lists

Return type:

tuple[pd.DataFrame, pd.DataFrame, dict]

Raises:

ValueError – If no tasks have been set up

execute_tasks_parallel(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], n_jobs: int = 1, raise_on_error: bool = True, checkpoint_dir: str = None, force_rerun: bool = False, *args, **kwargs) → Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]

Execute all extraction tasks in parallel, grouped by geno_file, with checkpoint/resume support.

Parameters:

extraction_function (Callable) – Function that performs the extraction
n_jobs (int) – Number of parallel workers. If 1, runs sequentially. If -1, uses all available CPUs.
raise_on_error (bool) – If True, raise exception on first error. If False, log and continue
checkpoint_dir (str, optional) – Directory to save intermediate results. If provided, tasks will save results after completion and skip already completed tasks on re-run.
force_rerun (bool) – If True, ignore existing checkpoint files and re-run all tasks
*args – Additional arguments passed to extraction_function
**kwargs – Additional keyword arguments passed to extraction_function

Returns:

Summary dataframe, carrier dataframe, and results dictionary

Return type:

tuple[pd.DataFrame, pd.DataFrame, dict]

setup_tasks(geno_files: dict[str, str], var_files: dict[str, str]) → list[ExtractionTask]

Set up all extraction tasks for the given file combinations.

Parameters:

geno_files (dict[str, str]) – Dictionary mapping genetic file IDs to paths
var_files (dict[str, str]) – Dictionary mapping variant file IDs to paths

Returns:

List of all extraction tasks to be performed

Return type:

list[ExtractionTask]

Raises:

Various exceptions from validation –

class marvel.extraction.extraction.VariantExtractor(id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Main class for extracting variant carriers

extract(geno_file: str | VCF | BgenReader | PyPlink, variants: DataFrame, region: str | None = None, prefilter_regions: bool = False, **kwargs) → DataFrame

Extract variant carriers from genetic file

Parameters:

geno_file (str or genetic file object) – Path or loaded genetic file
variants (pd.DataFrame) – Variants to extract
region (str, optional) – VCF region to consider
prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.
**kwargs – Additional arguments for allele comparison

Returns:

Carrier matrix (samples × variants)

Return type:

pd.DataFrame

class marvel.extraction.extraction.VcfReaderClass

Bases: GeneticFileReader

Reader for VCF files

get_regions(geno_file) → dict[str, tuple[int, int]] | None: Return per-chromosome position ranges from VCF file.

read(geno_file: str | VCF, region: str | None = None) → Generator[Variant, None, None]

Read VCF file.

When a GP FORMAT field is present (e.g. VCFs produced by qctool from BGEN files), dosage is computed from genotype probabilities using the same formula as BgenReaderClass:

dosage = P(het) + 2 * P(hom_alt)

This ensures mathematical equivalence between the BGEN and VCF code paths when the VCF originates from a BGEN source. When GP is absent the reader falls back to hard GT calls.

Parameters:

geno_file (str or cyvcf2.cyvcf2.VCF) – Genetic input file, or path to
region (str or None) – Region within the vcf-file to consider

Yields:

Variant – An object representing a genetic variant.

marvel.extraction.extraction.extract_carriers(geno_file: str | VCF | BgenReader | PyPlink, variants: str | DataFrame, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', region: str | None = None, summarise: bool = True, prefilter_regions: bool = False, **comparison_kwargs) → Tuple[DataFrame, DataFrame]

Extract variant carriers from genetic file

Parameters:

geno_file (str or genetic file object) – Path or loaded genetic file
variants (pd.DataFrame) – Variants to extract
id_column (str) – Individual ID column name
chr_column (str) – Chromosome column name
pos_column (str) – Position column name
chr_pos_column (str) – Chromosome:position column name
ref_column (str) – Reference allele column name
alt_column (str) – Alternate allele column name
var_column (str) – Variant ID column name
var_sep (str) – Separator for variant ID components
region (str, optional) – VCF region to consider
summarise (bool) – Whether to return summary and detailed results
prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.
**comparison_kwargs – Additional arguments for allele comparison

Returns:

Carrier matrix, optionally with empty summary

Return type:

pd.DataFrame or tuple[pd.DataFrame, pd.DataFrame]

`marvel.extraction.aggregation` module

Extract variant carriers in specific genes/categories

class marvel.extraction.aggregation.GeneExtractor(id_column: str = 'id', cat_column: str = None, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')

Bases: object

Main class for extracting gene-level carrier information

Extract gene-level carrier information.

Parameters:

input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe
genes (str or pd.DataFrame) – Path to or loaded dataframe with variant-gene mapping
id_column (str) – Individual ID column name
reverse (bool) – Whether to check inversed alleles
incl_var (bool) – Whether to include variant columns in output
summarise (bool) – Whether to create summary report
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
count_na (bool) – Whether to count NA/missing values
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
- True (default): keep the raw value unchanged.
- None: replace with NA (treat as missing).
- integer/string (e.g. 0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
- True (default): keep the raw value unchanged.
- None: replace with NA.
- integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.
**kwargs – Additional arguments for extract_carriers

Returns:

Summary dataframe and gene carriers dataframe

Return type:

tuple[pd.DataFrame, pd.DataFrame]

class marvel.extraction.aggregation.VariantToGeneAggregator(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True)

Bases: object

Aggregates variant-level data to gene-level

aggregate(var_carriers: DataFrame, genes: DataFrame, cat_column: str, var_column: str = 'ID') → DataFrame

Aggregate variant carriers to gene carriers.

Parameters:

var_carriers (pd.DataFrame) – Dataframe with variant carrier information
genes (pd.DataFrame) – Dataframe mapping variants to genes
cat_column (str) – Column in genes indicating gene name
var_column (str) – Column in genes indicating variant ID

Returns:

Dataframe with gene-level carrier information

Return type:

pd.DataFrame

marvel.extraction.aggregation.extract_genes(input_file: str | DataFrame, genes: str | DataFrame, cat_column: str, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True, incl_var: bool = True, summarise: bool = True, neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True, **kwargs) → Tuple[DataFrame, DataFrame]

Extract gene-level carrier information from genetic data.

This function aggregates variant-level carrier information to gene-level, optionally checking inversed alleles and processing genotype values.

Parameters:

input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe from extract_carriers
genes (str or pd.DataFrame) – Path to or loaded dataframe mapping variants to genes
cat_column (str) – Column name indicating gene for each variant
id_column (str) – Column name for sample IDs
chr_column (str) – Chromosome column name
pos_column (str) – Position column name
chr_pos (str) – Chromosome:position column name
ref_column (str) – Reference allele column name
alt_column (str) – Alternate allele column name
var_column (str) – Column name indicating variant IDs
var_sep (str) – Separator for variant ID components (chr:pos:ref:alt)
reverse (bool) – Whether to check inversed alleles (e.g., A:T vs T:A)
incl_var (bool) – Whether to include variant columns in output
summarise (bool) – Whether to create summary report
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
count_na (bool) – Whether to count NA/missing values
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
- True (default): keep the raw value unchanged.
- None: replace with NA (treat as missing).
- integer/string (e.g. 0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
- True (default): keep the raw value unchanged.
- None: replace with NA.
- integer/string (e.g. 1): replace all values > 1 with that value. Setting sum_geno=1 converts gene carriers to binary (0/1) status.
**kwargs – Additional arguments passed to extract_carriers

Returns:

Summary dataframe and gene carriers dataframe

Return type:

tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> sum_df, gen_carriers = extract_genes(
...     input_file='genotypes.vcf',
...     genes='gene_variants.tsv',
...     reverse=True,
...     incl_var=False
... )

marvel.extraction package

marvel.extraction.core module

marvel.extraction.extraction module

marvel.extraction.aggregation module

`marvel.extraction` package

`marvel.extraction.core` module

`marvel.extraction.extraction` module

`marvel.extraction.aggregation` module