marvel.extraction package
marvel.extraction.core module
Core module to be used in several other modules
- class marvel.extraction.core.AlleleComparator
Bases:
objectCompare variant alleles
- class marvel.extraction.core.AlleleInverter(var_sep: str = ':')
Bases:
objectHandles allele inversion operations
- add_inversed_column(df: DataFrame, var_column: str = 'ID') DataFrame
Add column with inversed alleles to dataframe.
- Parameters:
df (pd.DataFrame) – Dataframe with variant IDs
var_column (str) – Column name containing variant IDs
- Returns:
Dataframe with added inversed allele column
- Return type:
pd.DataFrame
- class marvel.extraction.core.GenotypeCounter(id_column: str = 'id', count_na: bool = True)
Bases:
objectClass for counting and summarizing genotypes across samples
- count(df: DataFrame, method: str = 'any') DataFrame
Count genotypes across samples for each variant/gene
- Parameters:
df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns Index should contain sample IDs
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
- Returns:
Summary dataframe with counts for each variant/gene
- Return type:
pd.DataFrame
- Raises:
ValueError – If method is not ‘any’ or ‘sum’
TypeError – If df is not a DataFrame
- count_by_category(df: DataFrame, categories: DataFrame, var_column: str = 'ID', cat_column: str = None, method: str = 'any') DataFrame
Count genotypes grouped by categories (e.g., genes)
- Parameters:
df (pd.DataFrame) – Dataframe with samples as rows and variants as columns
categories (pd.DataFrame) – Dataframe mapping variants to categories
var_column (str) – Column name in categories containing variant IDs
cat_column (str) – Column name in categories containing category/gene names
method ({'any', 'sum'}) – Counting method (see count() method)
- Returns:
Summary dataframe with counts for each category
- Return type:
pd.DataFrame
- class marvel.extraction.core.GenotypeProcessor
Bases:
objectHandles genotype value processing and validation
- static process_genotypes(df: DataFrame, gene_columns: list[str], neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True) DataFrame
Process genotype values according to specified rules.
- Parameters:
df (pd.DataFrame) – Dataframe with genotype columns
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
True(default): keep the raw value unchanged.None: replace withNA(treat as missing).integer/string (e.g.
0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
True(default): keep the raw value unchanged.None: replace withNA.integer/string (e.g.
1): replace all values > 1 with that value. Settingsum_geno=1converts gene carriers to binary (0/1) status.
- Returns:
Processed dataframe
- Return type:
pd.DataFrame
- class marvel.extraction.core.Variant(rsid: str, chrom: str, pos: int, ref: str, alt: List[str], sample_data: List[Tuple[str, float]])
Bases:
objectGenetic variant with sample data
- Parameters:
- class marvel.extraction.core.VariantProcessor(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')
Bases:
objectProcesses variant dataframes
- process(variants: DataFrame) DataFrame
Pre-process variants dataframe
- Parameters:
variants (pd.DataFrame) – Variants dataframe
- Returns:
variants – Processed variants dataframe
- Return type:
pd.DataFrame
- Raises:
InputValidationError – If required columns not present
- marvel.extraction.core.count_gen(df: DataFrame, method: str = 'any', id_column: str = 'id', count_na: bool = True) DataFrame
Count genotypes across samples for each variant/gene
This is a convenience wrapper around the GenotypeCounter class.
- Parameters:
df (pd.DataFrame) – Dataframe with samples as rows and variants/genes as columns. Index should contain sample IDs.
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
id_column (str) – Name of the ID column to exclude from counting
count_na (bool) – Whether to count NA/missing values
- Returns:
Summary dataframe with counts for each variant/gene. Columns include: variant, count, total, frequency If count_na=True, also includes: na_count, valid_samples
- Return type:
pd.DataFrame
Examples
>>> # Count carriers for each variant >>> summary = count_gen(carrier_df, method='any')
>>> # Sum genotypes across samples >>> summary = count_gen(carrier_df, method='sum')
>>> # Count without tracking NA values >>> summary = count_gen(carrier_df, count_na=False)
marvel.extraction.extraction module
A module to extract genetic variants from genetic files
- class marvel.extraction.extraction.BgenReaderClass
Bases:
GeneticFileReaderReader for BGEN files
- class marvel.extraction.extraction.ExtractionTask(geno_id: str, geno_path: str, var_id: str, var_path: str, identifier: str)
Bases:
objectA single extraction task.
- class marvel.extraction.extraction.GeneticFileReader
Bases:
ABCAbstract base class for genetic file readers
- class marvel.extraction.extraction.PlinkReaderClass
Bases:
GeneticFileReaderReader for PLINK files
- class marvel.extraction.extraction.VariantExtraction(logger=None, verbose=False)
Bases:
objectHandles extraction of variant carriers from genetic data files.
- Supports:
Single genetic file x single variant file
Multiple genetic file x single variant file
Single genetic file x multiple variant file
Multiple genetic file x multiple variant file
- execute_tasks(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], raise_on_error: bool = True, *args, **kwargs) Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]
Execute all extraction tasks using the provided extraction function.
- Parameters:
extraction_function (Callable) – Function that performs the extraction.
raise_on_error (bool) – If True, raise exception on first error. If False, log and continue
*args – Additional arguments passed to the called function
**kwargs – Additional keyword arguments passed to the called function
- Returns:
Summary dataframe, carrier dataframe, and dictionary with ‘successful’, ‘failed’, and ‘skipped’ task lists
- Return type:
- Raises:
ValueError – If no tasks have been set up
- execute_tasks_parallel(extraction_function: Callable[[...], Tuple[DataFrame, DataFrame]], n_jobs: int = 1, raise_on_error: bool = True, checkpoint_dir: str = None, force_rerun: bool = False, *args, **kwargs) Tuple[DataFrame, DataFrame, dict[str, list[ExtractionTask]]]
Execute all extraction tasks in parallel, grouped by geno_file, with checkpoint/resume support.
- Parameters:
extraction_function (Callable) – Function that performs the extraction
n_jobs (int) – Number of parallel workers. If 1, runs sequentially. If -1, uses all available CPUs.
raise_on_error (bool) – If True, raise exception on first error. If False, log and continue
checkpoint_dir (str, optional) – Directory to save intermediate results. If provided, tasks will save results after completion and skip already completed tasks on re-run.
force_rerun (bool) – If True, ignore existing checkpoint files and re-run all tasks
*args – Additional arguments passed to extraction_function
**kwargs – Additional keyword arguments passed to extraction_function
- Returns:
Summary dataframe, carrier dataframe, and results dictionary
- Return type:
- setup_tasks(geno_files: dict[str, str], var_files: dict[str, str]) list[ExtractionTask]
Set up all extraction tasks for the given file combinations.
- Parameters:
geno_files (dict[str, str]) – Dictionary mapping genetic file IDs to paths
var_files (dict[str, str]) – Dictionary mapping variant file IDs to paths
- Returns:
List of all extraction tasks to be performed
- Return type:
- Raises:
Various exceptions from validation –
- class marvel.extraction.extraction.VariantExtractor(id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')
Bases:
objectMain class for extracting variant carriers
- extract(geno_file: str | VCF | BgenReader | PyPlink, variants: DataFrame, region: str | None = None, prefilter_regions: bool = False, **kwargs) DataFrame
Extract variant carriers from genetic file
- Parameters:
geno_file (str or genetic file object) – Path or loaded genetic file
variants (pd.DataFrame) – Variants to extract
region (str, optional) – VCF region to consider
prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.
**kwargs – Additional arguments for allele comparison
- Returns:
Carrier matrix (samples × variants)
- Return type:
pd.DataFrame
- class marvel.extraction.extraction.VcfReaderClass
Bases:
GeneticFileReaderReader for VCF files
- get_regions(geno_file) dict[str, tuple[int, int]] | None
Return per-chromosome position ranges from VCF file.
- read(geno_file: str | VCF, region: str | None = None) Generator[Variant, None, None]
Read VCF file.
When a
GPFORMAT field is present (e.g. VCFs produced by qctool from BGEN files), dosage is computed from genotype probabilities using the same formula asBgenReaderClass:dosage = P(het) + 2 * P(hom_alt)
This ensures mathematical equivalence between the BGEN and VCF code paths when the VCF originates from a BGEN source. When
GPis absent the reader falls back to hard GT calls.- Parameters:
geno_file (str or cyvcf2.cyvcf2.VCF) – Genetic input file, or path to
region (str or None) – Region within the vcf-file to consider
- Yields:
Variant – An object representing a genetic variant.
- marvel.extraction.extraction.extract_carriers(geno_file: str | VCF | BgenReader | PyPlink, variants: str | DataFrame, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', region: str | None = None, summarise: bool = True, prefilter_regions: bool = False, **comparison_kwargs) Tuple[DataFrame, DataFrame]
Extract variant carriers from genetic file
- Parameters:
geno_file (str or genetic file object) – Path or loaded genetic file
variants (pd.DataFrame) – Variants to extract
id_column (str) – Individual ID column name
chr_column (str) – Chromosome column name
pos_column (str) – Position column name
chr_pos_column (str) – Chromosome:position column name
ref_column (str) – Reference allele column name
alt_column (str) – Alternate allele column name
var_column (str) – Variant ID column name
var_sep (str) – Separator for variant ID components
region (str, optional) – VCF region to consider
summarise (bool) – Whether to return summary and detailed results
prefilter_regions (bool, optional) – Pre-filter input variants by chromosome/position ranges in the genetic file before extraction. Default False.
**comparison_kwargs – Additional arguments for allele comparison
- Returns:
Carrier matrix, optionally with empty summary
- Return type:
pd.DataFrame or tuple[pd.DataFrame, pd.DataFrame]
marvel.extraction.aggregation module
Extract variant carriers in specific genes/categories
- class marvel.extraction.aggregation.GeneExtractor(id_column: str = 'id', cat_column: str = None, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':')
Bases:
objectMain class for extracting gene-level carrier information
- extract(input_file: str | DataFrame, genes: str | DataFrame, reverse: bool = True, incl_var: bool = True, summarise: bool = True, method: str = 'any', count_na: bool = True, neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True, **kwargs) Tuple[DataFrame, DataFrame]
Extract gene-level carrier information.
- Parameters:
input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe
genes (str or pd.DataFrame) – Path to or loaded dataframe with variant-gene mapping
id_column (str) – Individual ID column name
reverse (bool) – Whether to check inversed alleles
incl_var (bool) – Whether to include variant columns in output
summarise (bool) – Whether to create summary report
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
count_na (bool) – Whether to count NA/missing values
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
True(default): keep the raw value unchanged.None: replace withNA(treat as missing).integer/string (e.g.
0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
True(default): keep the raw value unchanged.None: replace withNA.integer/string (e.g.
1): replace all values > 1 with that value. Settingsum_geno=1converts gene carriers to binary (0/1) status.
**kwargs – Additional arguments for extract_carriers
- Returns:
Summary dataframe and gene carriers dataframe
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
- class marvel.extraction.aggregation.VariantToGeneAggregator(chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True)
Bases:
objectAggregates variant-level data to gene-level
- marvel.extraction.aggregation.extract_genes(input_file: str | DataFrame, genes: str | DataFrame, cat_column: str, id_column: str = 'id', chr_column: str = 'chr', pos_column: str = 'pos', chr_pos: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', var_sep: str = ':', reverse: bool = True, incl_var: bool = True, summarise: bool = True, neg_geno: str | int | bool | None = True, sum_geno: str | int | bool | None = True, **kwargs) Tuple[DataFrame, DataFrame]
Extract gene-level carrier information from genetic data.
This function aggregates variant-level carrier information to gene-level, optionally checking inversed alleles and processing genotype values.
- Parameters:
input_file (str or pd.DataFrame) – Path to genetic data file or carrier dataframe from extract_carriers
genes (str or pd.DataFrame) – Path to or loaded dataframe mapping variants to genes
cat_column (str) – Column name indicating gene for each variant
id_column (str) – Column name for sample IDs
chr_column (str) – Chromosome column name
pos_column (str) – Position column name
chr_pos (str) – Chromosome:position column name
ref_column (str) – Reference allele column name
alt_column (str) – Alternate allele column name
var_column (str) – Column name indicating variant IDs
var_sep (str) – Separator for variant ID components (chr:pos:ref:alt)
reverse (bool) – Whether to check inversed alleles (e.g., A:T vs T:A)
incl_var (bool) – Whether to include variant columns in output
summarise (bool) – Whether to create summary report
method ({'any', 'sum'}) – Counting method: - ‘any’: Count samples with any non-zero genotype (carriers) - ‘sum’: Sum all genotype values across samples
count_na (bool) – Whether to count NA/missing values
neg_geno (str, int, bool, or None) –
How to handle negative genotype values (e.g. PLINK no-call, coded as -1). Negative values arise when a genotype cannot be determined for a sample.
True(default): keep the raw value unchanged.None: replace withNA(treat as missing).integer/string (e.g.
0): replace all negative values with that value.
sum_geno (str, int, bool, or None) –
How to handle genotype values greater than 1 (e.g. a sample carrying two variants in the same gene, or a BGEN dosage above 1).
True(default): keep the raw value unchanged.None: replace withNA.integer/string (e.g.
1): replace all values > 1 with that value. Settingsum_geno=1converts gene carriers to binary (0/1) status.
**kwargs – Additional arguments passed to extract_carriers
- Returns:
Summary dataframe and gene carriers dataframe
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
Examples
>>> sum_df, gen_carriers = extract_genes( ... input_file='genotypes.vcf', ... genes='gene_variants.tsv', ... reverse=True, ... incl_var=False ... )