Variant Extraction

This notebook demonstrates how to extract variant carriers from genetic files using extract_carriers and extract_genes. It covers:

  • Extracting carriers from a VCF file

  • Aggregating variant carriers to gene level from a BGEN file

  • Pre-filtering variants by genomic region with prefilter_regions

  • Controlling genotype value handling with neg_geno and sum_geno

[1]:
# imports
import os
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data. We’re interested in variants across six genes and want to extract all carriers from a VCF file. These variants are defined in the variants dataframe. For illustration purposes, not all variants will be present in the VCF file, and one variant has different alleles from the one we’re looking for.

[2]:
# Load variants to extract
variants = examples.get_data('dummy_variants')

# Show example
variants
[2]:
gene rsid chr pos chr_pos a1 a2
0 PCSK9 rs00001 chr1 22843509 chr1:22843509 A C
1 PCSK9 rs00002 chr1 18589438 chr1:18589438 T G
2 PCSK9 rs00003 chr1 47423257 chr1:47423257 A T
3 PCSK9 rs00004 chr1 1138210 chr1:1138210 C A
4 PCSK9 rs00005 chr1 5648006 chr1:5648006 A G
5 PCSK9 rs00006 chr1 44328361 chr1:44328361 A G
6 ANGPTL3 rs00007 chr1 35045263 chr1:35045263 G C
7 ANGPTL3 rs00008 chr1 9474009 chr1:9474009 A T
8 ANGPTL3 rs00009 chr1 25132332 chr1:25132332 A C
9 ANGPTL3 rs00010 chr1 42576549 chr1:42576549 C A
10 ANGPTL3 rs00011 chr1 30247857 chr1:30247857 C A
11 ANGPTL3 rs00012 chr1 28845057 chr1:28845057 G T
12 HMGCR rs00013 chr5 34047537 chr5:34047537 G C
13 HMGCR rs00014 chr5 35835987 chr5:35835987 A T
14 HMGCR rs00015 chr5 45536080 chr5:45536080 A G
15 HMGCR rs00016 chr5 1569066 chr5:1569066 G T
16 HMGCR rs00017 chr5 46013857 chr5:46013857 A T
17 HMGCR rs00018 chr5 18521034 chr5:18521034 T C
18 NPC1L1 rs00019 chr7 4561666 chr7:4561666 G A
19 NPC1L1 rs00020 chr7 36749874 chr7:36749874 G A
20 NPC1L1 rs00021 chr7 47201525 chr7:47201525 A C
21 NPC1L1 rs00022 chr7 24661435 chr7:24661435 G T
22 NPC1L1 rs00023 chr7 1107036 chr7:1107036 C T
23 NPC1L1 rs00024 chr7 4600516 chr7:4600516 T C
24 CETP rs00025 chr16 19054093 chr16:19054093 C A
25 CETP rs00026 chr16 24959866 chr16:24959866 A T
26 CETP rs00027 chr16 47073671 chr16:47073671 C A
27 CETP rs00028 chr16 34925485 chr16:34925485 A T
28 CETP rs00029 chr16 12048442 chr16:12048442 C T
29 CETP rs00030 chr16 29686549 chr16:29686549 G C
30 PPARA rs00031 chr22 8039064 chr22:8039064 G T
31 PPARA rs00032 chr22 28629784 chr22:28629784 T A
32 PPARA rs00033 chr22 1437663 chr22:1437663 C A
33 PPARA rs00034 chr22 30905065 chr22:30905065 A C
34 PPARA rs00035 chr22 41350377 chr22:41350377 G T
35 PPARA rs00036 chr22 37632612 chr22:37632612 G C
[3]:
# Load VCF-file of chromosome 1
vex = examples.get_data('vcf_example')

# Print the IDs of the individuals
print(f'VCF-file contains {len(vex.samples)} samples and {vex.num_records} genetic variants')
VCF-file contains 1000 samples and 11 genetic variants

Extract the carriers

Below we extract carriers from the VCF file. One variant exists in the file but with alleles that differ from our input definition — setting warning=True reports this mismatch. The variant is still included in the output, matched by position, with its column name showing both allele sets.

[4]:
# Extract carriers
summary, tmp = extract_carriers(vex, variants, warning = True)

# Show output
tmp
/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/extraction/core.py:374: UserWarning: Variant chr1:1138210 has non-matching alleles
  warnings.warn(
[4]:
chr1:1138210:C:A_chr1:1138210:G:C chr1:5648006:A:G chr1:9474009:A:T chr1:18589438:T:G chr1:22843509:A:C chr1:25132332:A:C chr1:28845057:G:T chr1:30247857:C:A chr1:35045263:G:C chr1:42576549:C:A chr1:44328361:A:G
ID_000000 <NA> 0 <NA> 0 0 0 0 0 0 0 0
ID_000001 <NA> 0 0 0 0 0 0 0 0 0 0
ID_000002 <NA> 0 0 0 0 0 0 0 0 0 0
ID_000003 <NA> 0 0 0 0 0 0 0 0 0 0
ID_000004 <NA> 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
ID_000995 <NA> 0 1 0 0 0 0 0 0 0 0
ID_000996 <NA> 0 0 0 0 0 0 0 0 0 0
ID_000997 <NA> 0 0 0 0 0 0 0 0 0 0
ID_000998 <NA> 0 <NA> 0 0 0 0 0 0 0 0
ID_000999 <NA> 0 0 0 0 <NA> 0 0 0 0 0

1000 rows × 11 columns

[5]:
summary
[5]:
variant count total frequency na_count valid_samples
0 chr1:18589438:T:G 17 1000 0.017 24 976
1 chr1:25132332:A:C 14 1000 0.014 28 972
2 chr1:9474009:A:T 13 1000 0.013 26 974
3 chr1:44328361:A:G 13 1000 0.013 30 970
4 chr1:22843509:A:C 12 1000 0.012 23 977
5 chr1:30247857:C:A 7 1000 0.007 18 982
6 chr1:35045263:G:C 5 1000 0.005 28 972
7 chr1:42576549:C:A 5 1000 0.005 23 977
8 chr1:5648006:A:G 4 1000 0.004 19 981
9 chr1:1138210:C:A_chr1:1138210:G:C 0 1000 0.000 1000 0
10 chr1:28845057:G:T 0 1000 0.000 27 973

Extract carriers of all genes

The variants dataframe spans six genes. To aggregate variant carriers to gene level in one step, we pass the BGEN file directly to extract_genes. We use the BGEN file here since it holds all chromosomes in a single file, whereas the example VCF files are split by chromosome.

Handling negative genotypes and multi-carrier values

Two parameters control what happens to special genotype values after gene aggregation:

``neg_geno`` — negative genotype values (e.g. PLINK no-call coded as -1): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA (treat as missing) | | integer / string (e.g. 0) | Replace all negative values with that value |

``sum_geno`` — genotype values greater than 1 (a sample carries more than one variant in the gene): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA | | integer / string (e.g. 1) | Replace all values > 1 with that value — use sum_geno=1 for binary (0/1) carrier status |

Note: BGEN dosage values are always non-negative, so neg_geno has no effect on the examples below. For PLINK data (.bed), no-call genotypes are encoded as -1 and will appear in the gene sum unless neg_geno is set.

[ ]:
summary, gen_df = extract_genes(
    input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
    genes = variants,
    cat_column = 'gene',
    method = 'any',
    count_na = True,
    sum_geno = True,
    incl_var = False,
)
gen_df
[7]:
summary
[7]:
variant count total frequency na_count valid_samples
0 NPC1L1 183 1000 0.183 0 1000
1 CETP 182 1000 0.182 0 1000
2 ANGPTL3 175 1000 0.175 0 1000
3 HMGCR 169 1000 0.169 0 1000
4 PPARA 169 1000 0.169 0 1000
5 PCSK9 131 1000 0.131 0 1000

Some individuals carry variants at more than one position within a gene, resulting in values greater than one. For analyses that require binary carrier status (0 = non-carrier, 1 = carrier), set sum_geno=1 to cap these values.

[ ]:
summary, gen_df = extract_genes(
    input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
    genes = variants,
    cat_column = 'gene',
    method = 'any',
    count_na = True,
    sum_geno = 1,
    incl_var = False,
)
gen_df

Pre-filtering variants with prefilter_regions

When working with large genetic files that contain only a subset of chromosomes or positions, you can use prefilter_regions=True to pre-filter the input variant list before extraction begins. This avoids iterating over millions of positions for variants that fall outside the genetic file’s range.

The demo below uses the full dummy_variants dataframe — which spans chromosomes 1, 5, 7, 16, and 22 — and extracts against the chr22-only VCF. Without pre-filtering, all 36 variants are passed to the extraction loop (only 6 of them are on chr22). With prefilter_regions=True, the variant list is reduced to chr22 variants before extraction starts, producing the same carrier output with fewer candidates to iterate over.

[ ]:
# Load the full multi-chromosome dummy_variants fixture
dummy_variants = examples.get_data('dummy_variants')

# Get the chr22 VCF path via the public API
chr22_vcf_path = examples.get_path('vcf_example', 'example_genotypes_chr22.vcf.gz')

total_variants = len(dummy_variants)
chr22_expected = len(
    dummy_variants[dummy_variants['chr'].astype(str).str.strip('chr') == '22']
)

print(f"Total input variants (all chromosomes): {total_variants}")
print(f"Expected chr22 candidates after prefilter_regions=True: {chr22_expected}")
print(f"Variants that will be skipped with prefilter_regions=True: {total_variants - chr22_expected}")
[ ]:
from cyvcf2 import VCF

# Extract WITHOUT prefilter_regions (all 36 variants passed to extraction loop)
vcf_no_filter = VCF(chr22_vcf_path)
summary_no_filter, carriers_no_filter = extract_carriers(
    vcf_no_filter, dummy_variants, prefilter_regions=False
)

# Extract WITH prefilter_regions=True (only chr22 variants passed to extraction loop)
vcf_prefiltered = VCF(chr22_vcf_path)
summary_prefiltered, carriers_prefiltered = extract_carriers(
    vcf_prefiltered, dummy_variants, prefilter_regions=True
)

print("Columns in carriers_no_filter:", list(carriers_no_filter.columns))
print("Columns in carriers_prefiltered:", list(carriers_prefiltered.columns))
[11]:
import pandas as pd

# Verify the carrier outputs are identical regardless of prefilter_regions setting
shared_cols = sorted(set(carriers_no_filter.columns) & set(carriers_prefiltered.columns))
pd.testing.assert_frame_equal(
    carriers_no_filter[shared_cols].sort_values(shared_cols).reset_index(drop=True),
    carriers_prefiltered[shared_cols].sort_values(shared_cols).reset_index(drop=True),
)
print("Carrier outputs are identical with and without prefilter_regions=True")

# Show a sample of the result
carriers_prefiltered.head()
Carrier outputs are identical with and without prefilter_regions=True
[11]:
chr22:1437663:C:A chr22:8039064:G:T chr22:28629784:T:A chr22:30905065:A:C chr22:37632612:G:C
ID_000000 0 0 0 0 0
ID_000001 0 0 0 0 0
ID_000002 0 0 0 0 0
ID_000003 0 0 <NA> 0 0
ID_000004 0 0 0 0 0

BGEN extraction

BGEN files store all chromosomes in a single file, unlike VCFs that are typically split by chromosome. The extract_carriers() function accepts a BgenReader object directly. No region argument is needed — the function iterates over all variants in the BGEN file and matches against the input variant list.

The example BGEN file contains all six genes from dummy_variants, so the result will include carriers for all of them.

[12]:
# Load the BGEN file — returns a BgenReader object directly
bgen_file = examples.get_data('bgen_example')

# Extract carriers from BGEN; no region argument required
summary_bgen, carriers_bgen = extract_carriers(bgen_file, dummy_variants)

print(f"BGEN extraction found {carriers_bgen.shape[1]} variants across {carriers_bgen.shape[0]} samples")
carriers_bgen.head()
BGEN extraction found 31 variants across 1000 samples
[12]:
chr22:1437663:C:A chr22:8039064:G:T chr22:28629784:T:A chr22:30905065:A:C chr22:37632612:G:C chr16:12048442:C:T chr16:19054093:C:A chr16:24959866:A:T chr16:29686549:G:C chr16:34925485:A:T ... chr7:1107036:C:T chr7:4561666:G:A chr7:4600516:T:C chr7:24661435:G:T chr7:36749874:G:A chr5:1569066:G:T chr5:18521034:T:C chr5:34047537:G:C chr5:35835987:A:T chr5:45536080:A:G
ID_000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000001 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000003 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 31 columns

[13]:
summary_bgen
[13]:
variant count total frequency na_count valid_samples
0 chr7:4561666:G:A 49 1000 0.049 0 1000
1 chr16:12048442:C:T 48 1000 0.048 0 1000
2 chr5:1569066:G:T 47 1000 0.047 0 1000
3 chr7:24661435:G:T 45 1000 0.045 0 1000
4 chr7:4600516:T:C 45 1000 0.045 0 1000
5 chr1:44328361:A:G 43 1000 0.043 0 1000
6 chr16:24959866:A:T 43 1000 0.043 0 1000
7 chr22:1437663:C:A 42 1000 0.042 0 1000
8 chr1:25132332:A:C 42 1000 0.042 0 1000
9 chr1:18589438:T:G 41 1000 0.041 0 1000
10 chr5:18521034:T:C 41 1000 0.041 0 1000
11 chr22:8039064:G:T 40 1000 0.040 0 1000
12 chr1:9474009:A:T 39 1000 0.039 0 1000
13 chr16:34925485:A:T 38 1000 0.038 0 1000
14 chr22:28629784:T:A 37 1000 0.037 0 1000
15 chr16:19054093:C:A 36 1000 0.036 0 1000
16 chr22:30905065:A:C 36 1000 0.036 0 1000
17 chr1:22843509:A:C 35 1000 0.035 0 1000
18 chr1:35045263:G:C 33 1000 0.033 0 1000
19 chr5:34047537:G:C 32 1000 0.032 0 1000
20 chr22:37632612:G:C 32 1000 0.032 0 1000
21 chr5:45536080:A:G 30 1000 0.030 0 1000
22 chr7:1107036:C:T 29 1000 0.029 0 1000
23 chr16:29686549:G:C 29 1000 0.029 0 1000
24 chr1:42576549:C:A 28 1000 0.028 0 1000
25 chr7:36749874:G:A 27 1000 0.027 0 1000
26 chr1:28845057:G:T 27 1000 0.027 0 1000
27 chr5:35835987:A:T 27 1000 0.027 0 1000
28 chr1:30247857:C:A 25 1000 0.025 0 1000
29 chr1:5648006:A:G 23 1000 0.023 0 1000
30 chr1:1138210:C:A_chr1:1138210:G:C 0 1000 0.000 1000 0