Variant Extraction
This notebook demonstrates how to extract variant carriers from genetic files using extract_carriers and extract_genes. It covers:
Extracting carriers from a VCF file
Aggregating variant carriers to gene level from a BGEN file
Pre-filtering variants by genomic region with
prefilter_regionsControlling genotype value handling with
neg_genoandsum_geno
[1]:
# imports
import os
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
warnings.warn(
Load the data
First, we’ll load the example data. We’re interested in variants across six genes and want to extract all carriers from a VCF file. These variants are defined in the variants dataframe. For illustration purposes, not all variants will be present in the VCF file, and one variant has different alleles from the one we’re looking for.
[2]:
# Load variants to extract
variants = examples.get_data('dummy_variants')
# Show example
variants
[2]:
| gene | rsid | chr | pos | chr_pos | a1 | a2 | |
|---|---|---|---|---|---|---|---|
| 0 | PCSK9 | rs00001 | chr1 | 22843509 | chr1:22843509 | A | C |
| 1 | PCSK9 | rs00002 | chr1 | 18589438 | chr1:18589438 | T | G |
| 2 | PCSK9 | rs00003 | chr1 | 47423257 | chr1:47423257 | A | T |
| 3 | PCSK9 | rs00004 | chr1 | 1138210 | chr1:1138210 | C | A |
| 4 | PCSK9 | rs00005 | chr1 | 5648006 | chr1:5648006 | A | G |
| 5 | PCSK9 | rs00006 | chr1 | 44328361 | chr1:44328361 | A | G |
| 6 | ANGPTL3 | rs00007 | chr1 | 35045263 | chr1:35045263 | G | C |
| 7 | ANGPTL3 | rs00008 | chr1 | 9474009 | chr1:9474009 | A | T |
| 8 | ANGPTL3 | rs00009 | chr1 | 25132332 | chr1:25132332 | A | C |
| 9 | ANGPTL3 | rs00010 | chr1 | 42576549 | chr1:42576549 | C | A |
| 10 | ANGPTL3 | rs00011 | chr1 | 30247857 | chr1:30247857 | C | A |
| 11 | ANGPTL3 | rs00012 | chr1 | 28845057 | chr1:28845057 | G | T |
| 12 | HMGCR | rs00013 | chr5 | 34047537 | chr5:34047537 | G | C |
| 13 | HMGCR | rs00014 | chr5 | 35835987 | chr5:35835987 | A | T |
| 14 | HMGCR | rs00015 | chr5 | 45536080 | chr5:45536080 | A | G |
| 15 | HMGCR | rs00016 | chr5 | 1569066 | chr5:1569066 | G | T |
| 16 | HMGCR | rs00017 | chr5 | 46013857 | chr5:46013857 | A | T |
| 17 | HMGCR | rs00018 | chr5 | 18521034 | chr5:18521034 | T | C |
| 18 | NPC1L1 | rs00019 | chr7 | 4561666 | chr7:4561666 | G | A |
| 19 | NPC1L1 | rs00020 | chr7 | 36749874 | chr7:36749874 | G | A |
| 20 | NPC1L1 | rs00021 | chr7 | 47201525 | chr7:47201525 | A | C |
| 21 | NPC1L1 | rs00022 | chr7 | 24661435 | chr7:24661435 | G | T |
| 22 | NPC1L1 | rs00023 | chr7 | 1107036 | chr7:1107036 | C | T |
| 23 | NPC1L1 | rs00024 | chr7 | 4600516 | chr7:4600516 | T | C |
| 24 | CETP | rs00025 | chr16 | 19054093 | chr16:19054093 | C | A |
| 25 | CETP | rs00026 | chr16 | 24959866 | chr16:24959866 | A | T |
| 26 | CETP | rs00027 | chr16 | 47073671 | chr16:47073671 | C | A |
| 27 | CETP | rs00028 | chr16 | 34925485 | chr16:34925485 | A | T |
| 28 | CETP | rs00029 | chr16 | 12048442 | chr16:12048442 | C | T |
| 29 | CETP | rs00030 | chr16 | 29686549 | chr16:29686549 | G | C |
| 30 | PPARA | rs00031 | chr22 | 8039064 | chr22:8039064 | G | T |
| 31 | PPARA | rs00032 | chr22 | 28629784 | chr22:28629784 | T | A |
| 32 | PPARA | rs00033 | chr22 | 1437663 | chr22:1437663 | C | A |
| 33 | PPARA | rs00034 | chr22 | 30905065 | chr22:30905065 | A | C |
| 34 | PPARA | rs00035 | chr22 | 41350377 | chr22:41350377 | G | T |
| 35 | PPARA | rs00036 | chr22 | 37632612 | chr22:37632612 | G | C |
[3]:
# Load VCF-file of chromosome 1
vex = examples.get_data('vcf_example')
# Print the IDs of the individuals
print(f'VCF-file contains {len(vex.samples)} samples and {vex.num_records} genetic variants')
VCF-file contains 1000 samples and 11 genetic variants
Extract the carriers
Below we extract carriers from the VCF file. One variant exists in the file but with alleles that differ from our input definition — setting warning=True reports this mismatch. The variant is still included in the output, matched by position, with its column name showing both allele sets.
[4]:
# Extract carriers
summary, tmp = extract_carriers(vex, variants, warning = True)
# Show output
tmp
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/extraction/core.py:374: UserWarning: Variant chr1:1138210 has non-matching alleles
warnings.warn(
[4]:
| chr1:1138210:C:A_chr1:1138210:G:C | chr1:5648006:A:G | chr1:9474009:A:T | chr1:18589438:T:G | chr1:22843509:A:C | chr1:25132332:A:C | chr1:28845057:G:T | chr1:30247857:C:A | chr1:35045263:G:C | chr1:42576549:C:A | chr1:44328361:A:G | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ID_000000 | <NA> | 0 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000001 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000002 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000003 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000004 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ID_000995 | <NA> | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000996 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000997 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000998 | <NA> | 0 | <NA> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000999 | <NA> | 0 | 0 | 0 | 0 | <NA> | 0 | 0 | 0 | 0 | 0 |
1000 rows × 11 columns
[5]:
summary
[5]:
| variant | count | total | frequency | na_count | valid_samples | |
|---|---|---|---|---|---|---|
| 0 | chr1:18589438:T:G | 17 | 1000 | 0.017 | 24 | 976 |
| 1 | chr1:25132332:A:C | 14 | 1000 | 0.014 | 28 | 972 |
| 2 | chr1:9474009:A:T | 13 | 1000 | 0.013 | 26 | 974 |
| 3 | chr1:44328361:A:G | 13 | 1000 | 0.013 | 30 | 970 |
| 4 | chr1:22843509:A:C | 12 | 1000 | 0.012 | 23 | 977 |
| 5 | chr1:30247857:C:A | 7 | 1000 | 0.007 | 18 | 982 |
| 6 | chr1:35045263:G:C | 5 | 1000 | 0.005 | 28 | 972 |
| 7 | chr1:42576549:C:A | 5 | 1000 | 0.005 | 23 | 977 |
| 8 | chr1:5648006:A:G | 4 | 1000 | 0.004 | 19 | 981 |
| 9 | chr1:1138210:C:A_chr1:1138210:G:C | 0 | 1000 | 0.000 | 1000 | 0 |
| 10 | chr1:28845057:G:T | 0 | 1000 | 0.000 | 27 | 973 |
Extract carriers of all genes
The variants dataframe spans six genes. To aggregate variant carriers to gene level in one step, we pass the BGEN file directly to extract_genes. We use the BGEN file here since it holds all chromosomes in a single file, whereas the example VCF files are split by chromosome.
Handling negative genotypes and multi-carrier values
Two parameters control what happens to special genotype values after gene aggregation:
``neg_geno`` — negative genotype values (e.g. PLINK no-call coded as -1): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA (treat as missing) | | integer / string (e.g. 0) | Replace all negative values with that value |
``sum_geno`` — genotype values greater than 1 (a sample carries more than one variant in the gene): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA | | integer / string (e.g. 1) | Replace all values > 1 with that value — use sum_geno=1 for binary (0/1) carrier status |
Note: BGEN dosage values are always non-negative, so
neg_genohas no effect on the examples below. For PLINK data (.bed), no-call genotypes are encoded as-1and will appear in the gene sum unlessneg_genois set.
[ ]:
summary, gen_df = extract_genes(
input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
genes = variants,
cat_column = 'gene',
method = 'any',
count_na = True,
sum_geno = True,
incl_var = False,
)
gen_df
[7]:
summary
[7]:
| variant | count | total | frequency | na_count | valid_samples | |
|---|---|---|---|---|---|---|
| 0 | NPC1L1 | 183 | 1000 | 0.183 | 0 | 1000 |
| 1 | CETP | 182 | 1000 | 0.182 | 0 | 1000 |
| 2 | ANGPTL3 | 175 | 1000 | 0.175 | 0 | 1000 |
| 3 | HMGCR | 169 | 1000 | 0.169 | 0 | 1000 |
| 4 | PPARA | 169 | 1000 | 0.169 | 0 | 1000 |
| 5 | PCSK9 | 131 | 1000 | 0.131 | 0 | 1000 |
Some individuals carry variants at more than one position within a gene, resulting in values greater than one. For analyses that require binary carrier status (0 = non-carrier, 1 = carrier), set sum_geno=1 to cap these values.
[ ]:
summary, gen_df = extract_genes(
input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
genes = variants,
cat_column = 'gene',
method = 'any',
count_na = True,
sum_geno = 1,
incl_var = False,
)
gen_df
Pre-filtering variants with prefilter_regions
When working with large genetic files that contain only a subset of chromosomes or positions, you can use prefilter_regions=True to pre-filter the input variant list before extraction begins. This avoids iterating over millions of positions for variants that fall outside the genetic file’s range.
The demo below uses the full dummy_variants dataframe — which spans chromosomes 1, 5, 7, 16, and 22 — and extracts against the chr22-only VCF. Without pre-filtering, all 36 variants are passed to the extraction loop (only 6 of them are on chr22). With prefilter_regions=True, the variant list is reduced to chr22 variants before extraction starts, producing the same carrier output with fewer candidates to iterate over.
[ ]:
# Load the full multi-chromosome dummy_variants fixture
dummy_variants = examples.get_data('dummy_variants')
# Get the chr22 VCF path via the public API
chr22_vcf_path = examples.get_path('vcf_example', 'example_genotypes_chr22.vcf.gz')
total_variants = len(dummy_variants)
chr22_expected = len(
dummy_variants[dummy_variants['chr'].astype(str).str.strip('chr') == '22']
)
print(f"Total input variants (all chromosomes): {total_variants}")
print(f"Expected chr22 candidates after prefilter_regions=True: {chr22_expected}")
print(f"Variants that will be skipped with prefilter_regions=True: {total_variants - chr22_expected}")
[ ]:
from cyvcf2 import VCF
# Extract WITHOUT prefilter_regions (all 36 variants passed to extraction loop)
vcf_no_filter = VCF(chr22_vcf_path)
summary_no_filter, carriers_no_filter = extract_carriers(
vcf_no_filter, dummy_variants, prefilter_regions=False
)
# Extract WITH prefilter_regions=True (only chr22 variants passed to extraction loop)
vcf_prefiltered = VCF(chr22_vcf_path)
summary_prefiltered, carriers_prefiltered = extract_carriers(
vcf_prefiltered, dummy_variants, prefilter_regions=True
)
print("Columns in carriers_no_filter:", list(carriers_no_filter.columns))
print("Columns in carriers_prefiltered:", list(carriers_prefiltered.columns))
[11]:
import pandas as pd
# Verify the carrier outputs are identical regardless of prefilter_regions setting
shared_cols = sorted(set(carriers_no_filter.columns) & set(carriers_prefiltered.columns))
pd.testing.assert_frame_equal(
carriers_no_filter[shared_cols].sort_values(shared_cols).reset_index(drop=True),
carriers_prefiltered[shared_cols].sort_values(shared_cols).reset_index(drop=True),
)
print("Carrier outputs are identical with and without prefilter_regions=True")
# Show a sample of the result
carriers_prefiltered.head()
Carrier outputs are identical with and without prefilter_regions=True
[11]:
| chr22:1437663:C:A | chr22:8039064:G:T | chr22:28629784:T:A | chr22:30905065:A:C | chr22:37632612:G:C | |
|---|---|---|---|---|---|
| ID_000000 | 0 | 0 | 0 | 0 | 0 |
| ID_000001 | 0 | 0 | 0 | 0 | 0 |
| ID_000002 | 0 | 0 | 0 | 0 | 0 |
| ID_000003 | 0 | 0 | <NA> | 0 | 0 |
| ID_000004 | 0 | 0 | 0 | 0 | 0 |
BGEN extraction
BGEN files store all chromosomes in a single file, unlike VCFs that are typically split by chromosome. The extract_carriers() function accepts a BgenReader object directly. No region argument is needed — the function iterates over all variants in the BGEN file and matches against the input variant list.
The example BGEN file contains all six genes from dummy_variants, so the result will include carriers for all of them.
[12]:
# Load the BGEN file — returns a BgenReader object directly
bgen_file = examples.get_data('bgen_example')
# Extract carriers from BGEN; no region argument required
summary_bgen, carriers_bgen = extract_carriers(bgen_file, dummy_variants)
print(f"BGEN extraction found {carriers_bgen.shape[1]} variants across {carriers_bgen.shape[0]} samples")
carriers_bgen.head()
BGEN extraction found 31 variants across 1000 samples
[12]:
| chr22:1437663:C:A | chr22:8039064:G:T | chr22:28629784:T:A | chr22:30905065:A:C | chr22:37632612:G:C | chr16:12048442:C:T | chr16:19054093:C:A | chr16:24959866:A:T | chr16:29686549:G:C | chr16:34925485:A:T | ... | chr7:1107036:C:T | chr7:4561666:G:A | chr7:4600516:T:C | chr7:24661435:G:T | chr7:36749874:G:A | chr5:1569066:G:T | chr5:18521034:T:C | chr5:34047537:G:C | chr5:35835987:A:T | chr5:45536080:A:G | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID_000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000002 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000003 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 31 columns
[13]:
summary_bgen
[13]:
| variant | count | total | frequency | na_count | valid_samples | |
|---|---|---|---|---|---|---|
| 0 | chr7:4561666:G:A | 49 | 1000 | 0.049 | 0 | 1000 |
| 1 | chr16:12048442:C:T | 48 | 1000 | 0.048 | 0 | 1000 |
| 2 | chr5:1569066:G:T | 47 | 1000 | 0.047 | 0 | 1000 |
| 3 | chr7:24661435:G:T | 45 | 1000 | 0.045 | 0 | 1000 |
| 4 | chr7:4600516:T:C | 45 | 1000 | 0.045 | 0 | 1000 |
| 5 | chr1:44328361:A:G | 43 | 1000 | 0.043 | 0 | 1000 |
| 6 | chr16:24959866:A:T | 43 | 1000 | 0.043 | 0 | 1000 |
| 7 | chr22:1437663:C:A | 42 | 1000 | 0.042 | 0 | 1000 |
| 8 | chr1:25132332:A:C | 42 | 1000 | 0.042 | 0 | 1000 |
| 9 | chr1:18589438:T:G | 41 | 1000 | 0.041 | 0 | 1000 |
| 10 | chr5:18521034:T:C | 41 | 1000 | 0.041 | 0 | 1000 |
| 11 | chr22:8039064:G:T | 40 | 1000 | 0.040 | 0 | 1000 |
| 12 | chr1:9474009:A:T | 39 | 1000 | 0.039 | 0 | 1000 |
| 13 | chr16:34925485:A:T | 38 | 1000 | 0.038 | 0 | 1000 |
| 14 | chr22:28629784:T:A | 37 | 1000 | 0.037 | 0 | 1000 |
| 15 | chr16:19054093:C:A | 36 | 1000 | 0.036 | 0 | 1000 |
| 16 | chr22:30905065:A:C | 36 | 1000 | 0.036 | 0 | 1000 |
| 17 | chr1:22843509:A:C | 35 | 1000 | 0.035 | 0 | 1000 |
| 18 | chr1:35045263:G:C | 33 | 1000 | 0.033 | 0 | 1000 |
| 19 | chr5:34047537:G:C | 32 | 1000 | 0.032 | 0 | 1000 |
| 20 | chr22:37632612:G:C | 32 | 1000 | 0.032 | 0 | 1000 |
| 21 | chr5:45536080:A:G | 30 | 1000 | 0.030 | 0 | 1000 |
| 22 | chr7:1107036:C:T | 29 | 1000 | 0.029 | 0 | 1000 |
| 23 | chr16:29686549:G:C | 29 | 1000 | 0.029 | 0 | 1000 |
| 24 | chr1:42576549:C:A | 28 | 1000 | 0.028 | 0 | 1000 |
| 25 | chr7:36749874:G:A | 27 | 1000 | 0.027 | 0 | 1000 |
| 26 | chr1:28845057:G:T | 27 | 1000 | 0.027 | 0 | 1000 |
| 27 | chr5:35835987:A:T | 27 | 1000 | 0.027 | 0 | 1000 |
| 28 | chr1:30247857:C:A | 25 | 1000 | 0.025 | 0 | 1000 |
| 29 | chr1:5648006:A:G | 23 | 1000 | 0.023 | 0 | 1000 |
| 30 | chr1:1138210:C:A_chr1:1138210:G:C | 0 | 1000 | 0.000 | 1000 | 0 |