Gene/Category Aggregation

This notebook demonstrates how to aggregate individual variant carriers to gene (or category) level using extract_genes. It covers:

The indirect approach: extract variant carriers first, then aggregate to genes
The direct approach: pass a genetic file path to extract_genes directly
Controlling how negative and multi-carrier genotype values are handled (neg_geno, sum_geno)

[1]:

# imports
import os
import pandas as pd
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples

/Users/marionvanvugt/Documents/work/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data. We’re interested in seven genetic variants and want to extract all carriers from a PLINK dataset. These variants are defined in the variants dataframe. For illustration purposes, the PLINK file will not contain all variants we’re interested in, and one variant has different alleles from the one we’re looking for.

[2]:

# Load variants to extract
variants = examples.get_data('dummy_variants')

# Show example
variants

[2]:

	gene	rsid	chr	pos	chr_pos	a1	a2
0	PCSK9	rs00001	chr1	22843509	chr1:22843509	A	C
1	PCSK9	rs00002	chr1	18589438	chr1:18589438	T	G
2	PCSK9	rs00003	chr1	47423257	chr1:47423257	A	T
3	PCSK9	rs00004	chr1	1138210	chr1:1138210	C	A
4	PCSK9	rs00005	chr1	5648006	chr1:5648006	A	G
5	PCSK9	rs00006	chr1	44328361	chr1:44328361	A	G
6	ANGPTL3	rs00007	chr1	35045263	chr1:35045263	G	C
7	ANGPTL3	rs00008	chr1	9474009	chr1:9474009	A	T
8	ANGPTL3	rs00009	chr1	25132332	chr1:25132332	A	C
9	ANGPTL3	rs00010	chr1	42576549	chr1:42576549	C	A
10	ANGPTL3	rs00011	chr1	30247857	chr1:30247857	C	A
11	ANGPTL3	rs00012	chr1	28845057	chr1:28845057	G	T
12	HMGCR	rs00013	chr5	34047537	chr5:34047537	G	C
13	HMGCR	rs00014	chr5	35835987	chr5:35835987	A	T
14	HMGCR	rs00015	chr5	45536080	chr5:45536080	A	G
15	HMGCR	rs00016	chr5	1569066	chr5:1569066	G	T
16	HMGCR	rs00017	chr5	46013857	chr5:46013857	A	T
17	HMGCR	rs00018	chr5	18521034	chr5:18521034	T	C
18	NPC1L1	rs00019	chr7	4561666	chr7:4561666	G	A
19	NPC1L1	rs00020	chr7	36749874	chr7:36749874	G	A
20	NPC1L1	rs00021	chr7	47201525	chr7:47201525	A	C
21	NPC1L1	rs00022	chr7	24661435	chr7:24661435	G	T
22	NPC1L1	rs00023	chr7	1107036	chr7:1107036	C	T
23	NPC1L1	rs00024	chr7	4600516	chr7:4600516	T	C
24	CETP	rs00025	chr16	19054093	chr16:19054093	C	A
25	CETP	rs00026	chr16	24959866	chr16:24959866	A	T
26	CETP	rs00027	chr16	47073671	chr16:47073671	C	A
27	CETP	rs00028	chr16	34925485	chr16:34925485	A	T
28	CETP	rs00029	chr16	12048442	chr16:12048442	C	T
29	CETP	rs00030	chr16	29686549	chr16:29686549	G	C
30	PPARA	rs00031	chr22	8039064	chr22:8039064	G	T
31	PPARA	rs00032	chr22	28629784	chr22:28629784	T	A
32	PPARA	rs00033	chr22	1437663	chr22:1437663	C	A
33	PPARA	rs00034	chr22	30905065	chr22:30905065	A	C
34	PPARA	rs00035	chr22	41350377	chr22:41350377	G	T
35	PPARA	rs00036	chr22	37632612	chr22:37632612	G	C

Extract the variant carriers

There are two ways to extract gene-level carriers: an indirect approach and a direct approach. The indirect approach extracts variant carriers first and then aggregates them to genes. The direct approach passes a file path straight to extract_genes, which handles both steps internally.

[3]:

# Load the plink-file
bed, fam, bim = examples.get_data('plink_example')

# Extract variant carriers
_, variant_carriers = extract_carriers(bed, variants, warning = False)
variant_carriers

[3]:

	chr1:1138210:C:A	chr1:5648006:A:G	chr1:9474009:A:T	chr1:18589438:T:G	chr1:22843509:A:C	chr1:25132332:A:C	chr1:28845057:G:T	chr1:30247857:C:A	chr1:35045263:G:C	chr1:42576549:C:A	...	chr16:24959866:A:T	chr16:29686549:G:C	chr16:34925485:A:T	chr16:47073671:C:A	chr22:1437663:C:A	chr22:8039064:G:T	chr22:28629784:T:A	chr22:30905065:A:C	chr22:37632612:G:C	chr22:41350377:G:T
ID_000000	0	0	-1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000001	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000002	-1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000003	0	0	0	0	0	0	0	0	0	0	...	0	0	-1	0	0	0	-1	0	0	0
ID_000004	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ID_000995	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000996	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000997	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ID_000998	0	0	-1	0	0	0	0	0	0	0	...	0	-1	0	0	0	0	0	0	0	0
ID_000999	0	0	0	0	0	-1	0	0	0	0	...	-1	0	0	0	0	0	0	0	0	0

1000 rows × 36 columns

Extract the gene carriers

The variant carrier dataframe returned above can be passed directly to extract_genes to aggregate carriers to gene level.

[4]:

# Summarise to genes using variant carriers as input
_, gene_out = extract_genes(
    input_file = variant_carriers,
    genes = variants,
    incl_var = False,
    cat_column = 'gene',
)
gene_out

[4]:

	PCSK9	ANGPTL3	HMGCR	NPC1L1	CETP	PPARA
ID_000000	0.0	NaN	0.0	0.0	0.0	0.0
ID_000001	0.0	0.0	0.0	0.0	0.0	0.0
ID_000002	NaN	0.0	0.0	0.0	0.0	0.0
ID_000003	0.0	0.0	0.0	0.0	NaN	NaN
ID_000004	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...
ID_000995	0.0	1.0	0.0	0.0	0.0	0.0
ID_000996	0.0	0.0	NaN	0.0	0.0	0.0
ID_000997	0.0	0.0	0.0	0.0	0.0	0.0
ID_000998	0.0	NaN	0.0	1.0	NaN	0.0
ID_000999	0.0	NaN	0.0	0.0	NaN	0.0

1000 rows × 6 columns

Controlling genotype value handling: `neg_geno` and `sum_geno`

Two parameters control what happens to special genotype values after gene aggregation.

In the example above (PLINK input), the default neg_geno=True means PLINK no-call values (-1) appear unchanged in the gene sum as NaN after aggregation (a negative sum is treated as missing by pandas arithmetic). To explicitly replace them with NA, pass neg_geno=None; to treat no-calls as non-carriers, pass neg_geno=0.

The direct approach passes a file path to extract_genes, which performs variant extraction and gene aggregation in a single step.

[ ]:

# Summarise genes pointing to the plink-path
summary, results = extract_genes(
    input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
    genes = variants, incl_var = True, cat_column = 'gene',
)

[6]:

summary.head()

[6]:

	variant	count	total	frequency	na_count	valid_samples
0	NPC1L1	88	1000	0.088	125	875
1	PCSK9	65	1000	0.065	125	875
2	HMGCR	53	1000	0.053	131	869
3	PPARA	51	1000	0.051	135	865
4	CETP	50	1000	0.050	155	845

[7]:

results

[7]:

	chr1:1138210:C:A	chr1:5648006:A:G	chr1:9474009:A:T	chr1:18589438:T:G	chr1:22843509:A:C	chr1:25132332:A:C	chr1:28845057:G:T	chr1:30247857:C:A	chr1:35045263:G:C	chr1:42576549:C:A	...	chr22:28629784:T:A	chr22:30905065:A:C	chr22:37632612:G:C	chr22:41350377:G:T	PCSK9	ANGPTL3	HMGCR	NPC1L1	CETP	PPARA
ID_000000	0	0	-1	0	0	0	0	0	0	0	...	0	0	0	0	0.0	NaN	0.0	0.0	0.0	0.0
ID_000001	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
ID_000002	-1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	NaN	0.0	0.0	0.0	0.0	0.0
ID_000003	0	0	0	0	0	0	0	0	0	0	...	-1	0	0	0	0.0	0.0	0.0	0.0	NaN	NaN
ID_000004	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ID_000995	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0.0	1.0	0.0	0.0	0.0	0.0
ID_000996	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0.0	0.0	NaN	0.0	0.0	0.0
ID_000997	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
ID_000998	0	0	-1	0	0	0	0	0	0	0	...	0	0	0	0	0.0	NaN	0.0	1.0	NaN	0.0
ID_000999	0	0	0	0	0	-1	0	0	0	0	...	0	0	0	0	0.0	NaN	0.0	0.0	NaN	0.0

1000 rows × 42 columns

Gene/Category Aggregation

Load the data

Extract the variant carriers

Extract the gene carriers

Controlling genotype value handling: neg_geno and sum_geno

Controlling genotype value handling: `neg_geno` and `sum_geno`