Variant Extraction

This notebook demonstrates how to extract variant carriers from genetic files using extract_carriers and extract_genes. It covers:

Extracting carriers from a VCF file
Aggregating variant carriers to gene level from a BGEN file
Pre-filtering variants by genomic region with prefilter_regions
Controlling genotype value handling with neg_geno and sum_geno

[1]:

# imports
import os
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples

/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data. We’re interested in variants across six genes and want to extract all carriers from a VCF file. These variants are defined in the variants dataframe. For illustration purposes, not all variants will be present in the VCF file, and one variant has different alleles from the one we’re looking for.

[2]:

# Load variants to extract
variants = examples.get_data('dummy_variants')

# Show example
variants

[2]:

	gene	rsid	chr	pos	chr_pos	a1	a2
0	PCSK9	rs00001	chr1	22843509	chr1:22843509	A	C
1	PCSK9	rs00002	chr1	18589438	chr1:18589438	T	G
2	PCSK9	rs00003	chr1	47423257	chr1:47423257	A	T
3	PCSK9	rs00004	chr1	1138210	chr1:1138210	C	A
4	PCSK9	rs00005	chr1	5648006	chr1:5648006	A	G
5	PCSK9	rs00006	chr1	44328361	chr1:44328361	A	G
6	ANGPTL3	rs00007	chr1	35045263	chr1:35045263	G	C
7	ANGPTL3	rs00008	chr1	9474009	chr1:9474009	A	T
8	ANGPTL3	rs00009	chr1	25132332	chr1:25132332	A	C
9	ANGPTL3	rs00010	chr1	42576549	chr1:42576549	C	A
10	ANGPTL3	rs00011	chr1	30247857	chr1:30247857	C	A
11	ANGPTL3	rs00012	chr1	28845057	chr1:28845057	G	T
12	HMGCR	rs00013	chr5	34047537	chr5:34047537	G	C
13	HMGCR	rs00014	chr5	35835987	chr5:35835987	A	T
14	HMGCR	rs00015	chr5	45536080	chr5:45536080	A	G
15	HMGCR	rs00016	chr5	1569066	chr5:1569066	G	T
16	HMGCR	rs00017	chr5	46013857	chr5:46013857	A	T
17	HMGCR	rs00018	chr5	18521034	chr5:18521034	T	C
18	NPC1L1	rs00019	chr7	4561666	chr7:4561666	G	A
19	NPC1L1	rs00020	chr7	36749874	chr7:36749874	G	A
20	NPC1L1	rs00021	chr7	47201525	chr7:47201525	A	C
21	NPC1L1	rs00022	chr7	24661435	chr7:24661435	G	T
22	NPC1L1	rs00023	chr7	1107036	chr7:1107036	C	T
23	NPC1L1	rs00024	chr7	4600516	chr7:4600516	T	C
24	CETP	rs00025	chr16	19054093	chr16:19054093	C	A
25	CETP	rs00026	chr16	24959866	chr16:24959866	A	T
26	CETP	rs00027	chr16	47073671	chr16:47073671	C	A
27	CETP	rs00028	chr16	34925485	chr16:34925485	A	T
28	CETP	rs00029	chr16	12048442	chr16:12048442	C	T
29	CETP	rs00030	chr16	29686549	chr16:29686549	G	C
30	PPARA	rs00031	chr22	8039064	chr22:8039064	G	T
31	PPARA	rs00032	chr22	28629784	chr22:28629784	T	A
32	PPARA	rs00033	chr22	1437663	chr22:1437663	C	A
33	PPARA	rs00034	chr22	30905065	chr22:30905065	A	C
34	PPARA	rs00035	chr22	41350377	chr22:41350377	G	T
35	PPARA	rs00036	chr22	37632612	chr22:37632612	G	C

[3]:

# Load VCF-file of chromosome 1
vex = examples.get_data('vcf_example')

# Print the IDs of the individuals
print(f'VCF-file contains {len(vex.samples)} samples and {vex.num_records} genetic variants')

VCF-file contains 1000 samples and 11 genetic variants

Extract the carriers

Below we extract carriers from the VCF file. One variant exists in the file but with alleles that differ from our input definition — setting warning=True reports this mismatch. The variant is still included in the output, matched by position, with its column name showing both allele sets.

[4]:

# Extract carriers
summary, tmp = extract_carriers(vex, variants, warning = True)

# Show output
tmp

/Users/marionvanvugt/Documents/work.nosync/projects/marvel/marvel/extraction/core.py:374: UserWarning: Variant chr1:1138210 has non-matching alleles
  warnings.warn(

[4]:

	chr1:1138210:C:A_chr1:1138210:G:C	chr1:5648006:A:G	chr1:9474009:A:T	chr1:18589438:T:G	chr1:22843509:A:C	chr1:25132332:A:C	chr1:28845057:G:T	chr1:30247857:C:A	chr1:35045263:G:C	chr1:42576549:C:A	chr1:44328361:A:G
ID_000000	<NA>	0	<NA>	0	0	0	0	0	0	0	0
ID_000001	<NA>	0	0	0	0	0	0	0	0	0	0
ID_000002	<NA>	0	0	0	0	0	0	0	0	0	0
ID_000003	<NA>	0	0	0	0	0	0	0	0	0	0
ID_000004	<NA>	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
ID_000995	<NA>	0	1	0	0	0	0	0	0	0	0
ID_000996	<NA>	0	0	0	0	0	0	0	0	0	0
ID_000997	<NA>	0	0	0	0	0	0	0	0	0	0
ID_000998	<NA>	0	<NA>	0	0	0	0	0	0	0	0
ID_000999	<NA>	0	0	0	0	<NA>	0	0	0	0	0

1000 rows × 11 columns

[5]:

summary

[5]:

	variant	count	total	frequency	na_count	valid_samples
0	chr1:18589438:T:G	17	1000	0.017	24	976
1	chr1:25132332:A:C	14	1000	0.014	28	972
2	chr1:9474009:A:T	13	1000	0.013	26	974
3	chr1:44328361:A:G	13	1000	0.013	30	970
4	chr1:22843509:A:C	12	1000	0.012	23	977
5	chr1:30247857:C:A	7	1000	0.007	18	982
6	chr1:35045263:G:C	5	1000	0.005	28	972
7	chr1:42576549:C:A	5	1000	0.005	23	977
8	chr1:5648006:A:G	4	1000	0.004	19	981
9	chr1:1138210:C:A_chr1:1138210:G:C	0	1000	0.000	1000	0
10	chr1:28845057:G:T	0	1000	0.000	27	973

Extract carriers of all genes

The variants dataframe spans six genes. To aggregate variant carriers to gene level in one step, we pass the BGEN file directly to extract_genes. We use the BGEN file here since it holds all chromosomes in a single file, whereas the example VCF files are split by chromosome.

Handling negative genotypes and multi-carrier values

Two parameters control what happens to special genotype values after gene aggregation:

Note: BGEN dosage values are always non-negative, so neg_geno has no effect on the examples below. For PLINK data (.bed), no-call genotypes are encoded as -1 and will appear in the gene sum unless neg_geno is set.

[ ]:

summary, gen_df = extract_genes(
    input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
    genes = variants,
    cat_column = 'gene',
    method = 'any',
    count_na = True,
    sum_geno = True,
    incl_var = False,
)
gen_df

[7]:

summary

[7]:

	variant	count	total	frequency	valid_samples
0	NPC1L1	183	1000	0.183	1000
1	CETP	182	1000	0.182	1000
2	ANGPTL3	175	1000	0.175	1000
3	HMGCR	169	1000	0.169	1000
4	PPARA	169	1000	0.169	1000
5	PCSK9	131	1000	0.131	1000

Some individuals carry variants at more than one position within a gene, resulting in values greater than one. For analyses that require binary carrier status (0 = non-carrier, 1 = carrier), set sum_geno=1 to cap these values.

[ ]:

summary, gen_df = extract_genes(
    input_file = examples.get_path('bgen_example', 'example_genotypes.bgen'),
    genes = variants,
    cat_column = 'gene',
    method = 'any',
    count_na = True,
    sum_geno = 1,
    incl_var = False,
)
gen_df

Pre-filtering variants with `prefilter_regions`

When working with large genetic files that contain only a subset of chromosomes or positions, you can use prefilter_regions=True to pre-filter the input variant list before extraction begins. This avoids iterating over millions of positions for variants that fall outside the genetic file’s range.

The demo below uses the full dummy_variants dataframe — which spans chromosomes 1, 5, 7, 16, and 22 — and extracts against the chr22-only VCF. Without pre-filtering, all 36 variants are passed to the extraction loop (only 6 of them are on chr22). With prefilter_regions=True, the variant list is reduced to chr22 variants before extraction starts, producing the same carrier output with fewer candidates to iterate over.

[ ]:

# Load the full multi-chromosome dummy_variants fixture
dummy_variants = examples.get_data('dummy_variants')

# Get the chr22 VCF path via the public API
chr22_vcf_path = examples.get_path('vcf_example', 'example_genotypes_chr22.vcf.gz')

total_variants = len(dummy_variants)
chr22_expected = len(
    dummy_variants[dummy_variants['chr'].astype(str).str.strip('chr') == '22']
)

print(f"Total input variants (all chromosomes): {total_variants}")
print(f"Expected chr22 candidates after prefilter_regions=True: {chr22_expected}")
print(f"Variants that will be skipped with prefilter_regions=True: {total_variants - chr22_expected}")

[ ]:

from cyvcf2 import VCF

# Extract WITHOUT prefilter_regions (all 36 variants passed to extraction loop)
vcf_no_filter = VCF(chr22_vcf_path)
summary_no_filter, carriers_no_filter = extract_carriers(
    vcf_no_filter, dummy_variants, prefilter_regions=False
)

# Extract WITH prefilter_regions=True (only chr22 variants passed to extraction loop)
vcf_prefiltered = VCF(chr22_vcf_path)
summary_prefiltered, carriers_prefiltered = extract_carriers(
    vcf_prefiltered, dummy_variants, prefilter_regions=True
)

print("Columns in carriers_no_filter:", list(carriers_no_filter.columns))
print("Columns in carriers_prefiltered:", list(carriers_prefiltered.columns))

[11]:

import pandas as pd

# Verify the carrier outputs are identical regardless of prefilter_regions setting
shared_cols = sorted(set(carriers_no_filter.columns) & set(carriers_prefiltered.columns))
pd.testing.assert_frame_equal(
    carriers_no_filter[shared_cols].sort_values(shared_cols).reset_index(drop=True),
    carriers_prefiltered[shared_cols].sort_values(shared_cols).reset_index(drop=True),
)
print("Carrier outputs are identical with and without prefilter_regions=True")

# Show a sample of the result
carriers_prefiltered.head()

Carrier outputs are identical with and without prefilter_regions=True

[11]:

	chr22:1437663:C:A	chr22:8039064:G:T	chr22:28629784:T:A	chr22:30905065:A:C	chr22:37632612:G:C
ID_000000	0	0	0	0	0
ID_000001	0	0	0	0	0
ID_000002	0	0	0	0	0
ID_000003	0	0	<NA>	0	0
ID_000004	0	0	0	0	0

BGEN extraction

BGEN files store all chromosomes in a single file, unlike VCFs that are typically split by chromosome. The extract_carriers() function accepts a BgenReader object directly. No region argument is needed — the function iterates over all variants in the BGEN file and matches against the input variant list.

The example BGEN file contains all six genes from dummy_variants, so the result will include carriers for all of them.

[12]:

# Load the BGEN file — returns a BgenReader object directly
bgen_file = examples.get_data('bgen_example')

# Extract carriers from BGEN; no region argument required
summary_bgen, carriers_bgen = extract_carriers(bgen_file, dummy_variants)

print(f"BGEN extraction found {carriers_bgen.shape[1]} variants across {carriers_bgen.shape[0]} samples")
carriers_bgen.head()

BGEN extraction found 31 variants across 1000 samples

[12]:

	chr22:28629784:T:A	chr16:34925485:A:T	...
ID_000000	0.0	0.0	...
ID_000001	0.0	0.0	...
ID_000002	0.0	0.0	...
ID_000003	2.0	2.0	...
ID_000004	0.0	0.0	...

5 rows × 31 columns

[13]:

summary_bgen

[13]:

	variant	count	total	frequency	na_count	valid_samples
0	chr7:4561666:G:A	49	1000	0.049	0	1000
1	chr16:12048442:C:T	48	1000	0.048	0	1000
2	chr5:1569066:G:T	47	1000	0.047	0	1000
3	chr7:24661435:G:T	45	1000	0.045	0	1000
4	chr7:4600516:T:C	45	1000	0.045	0	1000
5	chr1:44328361:A:G	43	1000	0.043	0	1000
6	chr16:24959866:A:T	43	1000	0.043	0	1000
7	chr22:1437663:C:A	42	1000	0.042	0	1000
8	chr1:25132332:A:C	42	1000	0.042	0	1000
9	chr1:18589438:T:G	41	1000	0.041	0	1000
10	chr5:18521034:T:C	41	1000	0.041	0	1000
11	chr22:8039064:G:T	40	1000	0.040	0	1000
12	chr1:9474009:A:T	39	1000	0.039	0	1000
13	chr16:34925485:A:T	38	1000	0.038	0	1000
14	chr22:28629784:T:A	37	1000	0.037	0	1000
15	chr16:19054093:C:A	36	1000	0.036	0	1000
16	chr22:30905065:A:C	36	1000	0.036	0	1000
17	chr1:22843509:A:C	35	1000	0.035	0	1000
18	chr1:35045263:G:C	33	1000	0.033	0	1000
19	chr5:34047537:G:C	32	1000	0.032	0	1000
20	chr22:37632612:G:C	32	1000	0.032	0	1000
21	chr5:45536080:A:G	30	1000	0.030	0	1000
22	chr7:1107036:C:T	29	1000	0.029	0	1000
23	chr16:29686549:G:C	29	1000	0.029	0	1000
24	chr1:42576549:C:A	28	1000	0.028	0	1000
25	chr7:36749874:G:A	27	1000	0.027	0	1000
26	chr1:28845057:G:T	27	1000	0.027	0	1000
27	chr5:35835987:A:T	27	1000	0.027	0	1000
28	chr1:30247857:C:A	25	1000	0.025	0	1000
29	chr1:5648006:A:G	23	1000	0.023	0	1000
30	chr1:1138210:C:A_chr1:1138210:G:C	0	1000	0.000	1000	0