Gene/Category Aggregation

This notebook demonstrates how to aggregate individual variant carriers to gene (or category) level using extract_genes. It covers:

  • The indirect approach: extract variant carriers first, then aggregate to genes

  • The direct approach: pass a genetic file path to extract_genes directly

  • Controlling how negative and multi-carrier genotype values are handled (neg_geno, sum_geno)

[1]:
# imports
import os
import pandas as pd
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples
/Users/marionvanvugt/Documents/work/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work/projects/marvel/marvel/association/tests.py.
  warnings.warn(

Load the data

First, we’ll load the example data. We’re interested in seven genetic variants and want to extract all carriers from a PLINK dataset. These variants are defined in the variants dataframe. For illustration purposes, the PLINK file will not contain all variants we’re interested in, and one variant has different alleles from the one we’re looking for.

[2]:
# Load variants to extract
variants = examples.get_data('dummy_variants')

# Show example
variants
[2]:
gene rsid chr pos chr_pos a1 a2
0 PCSK9 rs00001 chr1 22843509 chr1:22843509 A C
1 PCSK9 rs00002 chr1 18589438 chr1:18589438 T G
2 PCSK9 rs00003 chr1 47423257 chr1:47423257 A T
3 PCSK9 rs00004 chr1 1138210 chr1:1138210 C A
4 PCSK9 rs00005 chr1 5648006 chr1:5648006 A G
5 PCSK9 rs00006 chr1 44328361 chr1:44328361 A G
6 ANGPTL3 rs00007 chr1 35045263 chr1:35045263 G C
7 ANGPTL3 rs00008 chr1 9474009 chr1:9474009 A T
8 ANGPTL3 rs00009 chr1 25132332 chr1:25132332 A C
9 ANGPTL3 rs00010 chr1 42576549 chr1:42576549 C A
10 ANGPTL3 rs00011 chr1 30247857 chr1:30247857 C A
11 ANGPTL3 rs00012 chr1 28845057 chr1:28845057 G T
12 HMGCR rs00013 chr5 34047537 chr5:34047537 G C
13 HMGCR rs00014 chr5 35835987 chr5:35835987 A T
14 HMGCR rs00015 chr5 45536080 chr5:45536080 A G
15 HMGCR rs00016 chr5 1569066 chr5:1569066 G T
16 HMGCR rs00017 chr5 46013857 chr5:46013857 A T
17 HMGCR rs00018 chr5 18521034 chr5:18521034 T C
18 NPC1L1 rs00019 chr7 4561666 chr7:4561666 G A
19 NPC1L1 rs00020 chr7 36749874 chr7:36749874 G A
20 NPC1L1 rs00021 chr7 47201525 chr7:47201525 A C
21 NPC1L1 rs00022 chr7 24661435 chr7:24661435 G T
22 NPC1L1 rs00023 chr7 1107036 chr7:1107036 C T
23 NPC1L1 rs00024 chr7 4600516 chr7:4600516 T C
24 CETP rs00025 chr16 19054093 chr16:19054093 C A
25 CETP rs00026 chr16 24959866 chr16:24959866 A T
26 CETP rs00027 chr16 47073671 chr16:47073671 C A
27 CETP rs00028 chr16 34925485 chr16:34925485 A T
28 CETP rs00029 chr16 12048442 chr16:12048442 C T
29 CETP rs00030 chr16 29686549 chr16:29686549 G C
30 PPARA rs00031 chr22 8039064 chr22:8039064 G T
31 PPARA rs00032 chr22 28629784 chr22:28629784 T A
32 PPARA rs00033 chr22 1437663 chr22:1437663 C A
33 PPARA rs00034 chr22 30905065 chr22:30905065 A C
34 PPARA rs00035 chr22 41350377 chr22:41350377 G T
35 PPARA rs00036 chr22 37632612 chr22:37632612 G C

Extract the variant carriers

There are two ways to extract gene-level carriers: an indirect approach and a direct approach. The indirect approach extracts variant carriers first and then aggregates them to genes. The direct approach passes a file path straight to extract_genes, which handles both steps internally.

[3]:
# Load the plink-file
bed, fam, bim = examples.get_data('plink_example')

# Extract variant carriers
_, variant_carriers = extract_carriers(bed, variants, warning = False)
variant_carriers
[3]:
chr1:1138210:C:A chr1:5648006:A:G chr1:9474009:A:T chr1:18589438:T:G chr1:22843509:A:C chr1:25132332:A:C chr1:28845057:G:T chr1:30247857:C:A chr1:35045263:G:C chr1:42576549:C:A ... chr16:24959866:A:T chr16:29686549:G:C chr16:34925485:A:T chr16:47073671:C:A chr22:1437663:C:A chr22:8039064:G:T chr22:28629784:T:A chr22:30905065:A:C chr22:37632612:G:C chr22:41350377:G:T
ID_000000 0 0 -1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000001 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000002 -1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000003 0 0 0 0 0 0 0 0 0 0 ... 0 0 -1 0 0 0 -1 0 0 0
ID_000004 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ID_000995 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ID_000998 0 0 -1 0 0 0 0 0 0 0 ... 0 -1 0 0 0 0 0 0 0 0
ID_000999 0 0 0 0 0 -1 0 0 0 0 ... -1 0 0 0 0 0 0 0 0 0

1000 rows × 36 columns

Extract the gene carriers

The variant carrier dataframe returned above can be passed directly to extract_genes to aggregate carriers to gene level.

[4]:
# Summarise to genes using variant carriers as input
_, gene_out = extract_genes(
    input_file = variant_carriers,
    genes = variants,
    incl_var = False,
    cat_column = 'gene',
)
gene_out
[4]:
PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA
ID_000000 0.0 NaN 0.0 0.0 0.0 0.0
ID_000001 0.0 0.0 0.0 0.0 0.0 0.0
ID_000002 NaN 0.0 0.0 0.0 0.0 0.0
ID_000003 0.0 0.0 0.0 0.0 NaN NaN
ID_000004 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
ID_000995 0.0 1.0 0.0 0.0 0.0 0.0
ID_000996 0.0 0.0 NaN 0.0 0.0 0.0
ID_000997 0.0 0.0 0.0 0.0 0.0 0.0
ID_000998 0.0 NaN 0.0 1.0 NaN 0.0
ID_000999 0.0 NaN 0.0 0.0 NaN 0.0

1000 rows × 6 columns

Controlling genotype value handling: neg_geno and sum_geno

Two parameters control what happens to special genotype values after gene aggregation.

``neg_geno`` — negative genotype values (e.g. PLINK no-call coded as -1): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA (treat as missing) | | integer / string (e.g. 0) | Replace all negative values with that value |

``sum_geno`` — genotype values greater than 1 (a sample carries multiple variants in the same gene): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA | | integer / string (e.g. 1) | Replace all values > 1 with that value — use sum_geno=1 for binary (0/1) carrier status |

In the example above (PLINK input), the default neg_geno=True means PLINK no-call values (-1) appear unchanged in the gene sum as NaN after aggregation (a negative sum is treated as missing by pandas arithmetic). To explicitly replace them with NA, pass neg_geno=None; to treat no-calls as non-carriers, pass neg_geno=0.

The direct approach passes a file path to extract_genes, which performs variant extraction and gene aggregation in a single step.

[ ]:
# Summarise genes pointing to the plink-path
summary, results = extract_genes(
    input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
    genes = variants, incl_var = True, cat_column = 'gene',
)
[6]:
summary.head()
[6]:
variant count total frequency na_count valid_samples
0 NPC1L1 88 1000 0.088 125 875
1 PCSK9 65 1000 0.065 125 875
2 HMGCR 53 1000 0.053 131 869
3 PPARA 51 1000 0.051 135 865
4 CETP 50 1000 0.050 155 845
[7]:
results
[7]:
chr1:1138210:C:A chr1:5648006:A:G chr1:9474009:A:T chr1:18589438:T:G chr1:22843509:A:C chr1:25132332:A:C chr1:28845057:G:T chr1:30247857:C:A chr1:35045263:G:C chr1:42576549:C:A ... chr22:28629784:T:A chr22:30905065:A:C chr22:37632612:G:C chr22:41350377:G:T PCSK9 ANGPTL3 HMGCR NPC1L1 CETP PPARA
ID_000000 0 0 -1 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 NaN 0.0 0.0 0.0 0.0
ID_000001 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000002 -1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 NaN 0.0 0.0 0.0 0.0 0.0
ID_000003 0 0 0 0 0 0 0 0 0 0 ... -1 0 0 0 0.0 0.0 0.0 0.0 NaN NaN
ID_000004 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ID_000995 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 1.0 0.0 0.0 0.0 0.0
ID_000996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 0.0 NaN 0.0 0.0 0.0
ID_000997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
ID_000998 0 0 -1 0 0 0 0 0 0 0 ... 0 0 0 0 0.0 NaN 0.0 1.0 NaN 0.0
ID_000999 0 0 0 0 0 -1 0 0 0 0 ... 0 0 0 0 0.0 NaN 0.0 0.0 NaN 0.0

1000 rows × 42 columns