Gene/Category Aggregation
This notebook demonstrates how to aggregate individual variant carriers to gene (or category) level using extract_genes. It covers:
The indirect approach: extract variant carriers first, then aggregate to genes
The direct approach: pass a genetic file path to
extract_genesdirectlyControlling how negative and multi-carrier genotype values are handled (
neg_geno,sum_geno)
[1]:
# imports
import os
import pandas as pd
from marvel.extraction.extraction import extract_carriers
from marvel.extraction.aggregation import extract_genes
from marvel.example_data import examples
/Users/marionvanvugt/Documents/work/projects/marvel/marvel/utils/utils.py:455: Warning: The environmental variable `MARVEL_TEST_DEFS` is not set. Trying to recover using the default configuration files in /Users/marionvanvugt/Documents/work/projects/marvel/marvel/association/tests.py.
warnings.warn(
Load the data
First, we’ll load the example data. We’re interested in seven genetic variants and want to extract all carriers from a PLINK dataset. These variants are defined in the variants dataframe. For illustration purposes, the PLINK file will not contain all variants we’re interested in, and one variant has different alleles from the one we’re looking for.
[2]:
# Load variants to extract
variants = examples.get_data('dummy_variants')
# Show example
variants
[2]:
| gene | rsid | chr | pos | chr_pos | a1 | a2 | |
|---|---|---|---|---|---|---|---|
| 0 | PCSK9 | rs00001 | chr1 | 22843509 | chr1:22843509 | A | C |
| 1 | PCSK9 | rs00002 | chr1 | 18589438 | chr1:18589438 | T | G |
| 2 | PCSK9 | rs00003 | chr1 | 47423257 | chr1:47423257 | A | T |
| 3 | PCSK9 | rs00004 | chr1 | 1138210 | chr1:1138210 | C | A |
| 4 | PCSK9 | rs00005 | chr1 | 5648006 | chr1:5648006 | A | G |
| 5 | PCSK9 | rs00006 | chr1 | 44328361 | chr1:44328361 | A | G |
| 6 | ANGPTL3 | rs00007 | chr1 | 35045263 | chr1:35045263 | G | C |
| 7 | ANGPTL3 | rs00008 | chr1 | 9474009 | chr1:9474009 | A | T |
| 8 | ANGPTL3 | rs00009 | chr1 | 25132332 | chr1:25132332 | A | C |
| 9 | ANGPTL3 | rs00010 | chr1 | 42576549 | chr1:42576549 | C | A |
| 10 | ANGPTL3 | rs00011 | chr1 | 30247857 | chr1:30247857 | C | A |
| 11 | ANGPTL3 | rs00012 | chr1 | 28845057 | chr1:28845057 | G | T |
| 12 | HMGCR | rs00013 | chr5 | 34047537 | chr5:34047537 | G | C |
| 13 | HMGCR | rs00014 | chr5 | 35835987 | chr5:35835987 | A | T |
| 14 | HMGCR | rs00015 | chr5 | 45536080 | chr5:45536080 | A | G |
| 15 | HMGCR | rs00016 | chr5 | 1569066 | chr5:1569066 | G | T |
| 16 | HMGCR | rs00017 | chr5 | 46013857 | chr5:46013857 | A | T |
| 17 | HMGCR | rs00018 | chr5 | 18521034 | chr5:18521034 | T | C |
| 18 | NPC1L1 | rs00019 | chr7 | 4561666 | chr7:4561666 | G | A |
| 19 | NPC1L1 | rs00020 | chr7 | 36749874 | chr7:36749874 | G | A |
| 20 | NPC1L1 | rs00021 | chr7 | 47201525 | chr7:47201525 | A | C |
| 21 | NPC1L1 | rs00022 | chr7 | 24661435 | chr7:24661435 | G | T |
| 22 | NPC1L1 | rs00023 | chr7 | 1107036 | chr7:1107036 | C | T |
| 23 | NPC1L1 | rs00024 | chr7 | 4600516 | chr7:4600516 | T | C |
| 24 | CETP | rs00025 | chr16 | 19054093 | chr16:19054093 | C | A |
| 25 | CETP | rs00026 | chr16 | 24959866 | chr16:24959866 | A | T |
| 26 | CETP | rs00027 | chr16 | 47073671 | chr16:47073671 | C | A |
| 27 | CETP | rs00028 | chr16 | 34925485 | chr16:34925485 | A | T |
| 28 | CETP | rs00029 | chr16 | 12048442 | chr16:12048442 | C | T |
| 29 | CETP | rs00030 | chr16 | 29686549 | chr16:29686549 | G | C |
| 30 | PPARA | rs00031 | chr22 | 8039064 | chr22:8039064 | G | T |
| 31 | PPARA | rs00032 | chr22 | 28629784 | chr22:28629784 | T | A |
| 32 | PPARA | rs00033 | chr22 | 1437663 | chr22:1437663 | C | A |
| 33 | PPARA | rs00034 | chr22 | 30905065 | chr22:30905065 | A | C |
| 34 | PPARA | rs00035 | chr22 | 41350377 | chr22:41350377 | G | T |
| 35 | PPARA | rs00036 | chr22 | 37632612 | chr22:37632612 | G | C |
Extract the variant carriers
There are two ways to extract gene-level carriers: an indirect approach and a direct approach. The indirect approach extracts variant carriers first and then aggregates them to genes. The direct approach passes a file path straight to extract_genes, which handles both steps internally.
[3]:
# Load the plink-file
bed, fam, bim = examples.get_data('plink_example')
# Extract variant carriers
_, variant_carriers = extract_carriers(bed, variants, warning = False)
variant_carriers
[3]:
| chr1:1138210:C:A | chr1:5648006:A:G | chr1:9474009:A:T | chr1:18589438:T:G | chr1:22843509:A:C | chr1:25132332:A:C | chr1:28845057:G:T | chr1:30247857:C:A | chr1:35045263:G:C | chr1:42576549:C:A | ... | chr16:24959866:A:T | chr16:29686549:G:C | chr16:34925485:A:T | chr16:47073671:C:A | chr22:1437663:C:A | chr22:8039064:G:T | chr22:28629784:T:A | chr22:30905065:A:C | chr22:37632612:G:C | chr22:41350377:G:T | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID_000000 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000002 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000003 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | -1 | 0 | 0 | 0 | -1 | 0 | 0 | 0 |
| ID_000004 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ID_000995 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000996 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000997 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000998 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ID_000999 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | ... | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1000 rows × 36 columns
Extract the gene carriers
The variant carrier dataframe returned above can be passed directly to extract_genes to aggregate carriers to gene level.
[4]:
# Summarise to genes using variant carriers as input
_, gene_out = extract_genes(
input_file = variant_carriers,
genes = variants,
incl_var = False,
cat_column = 'gene',
)
gene_out
[4]:
| PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | |
|---|---|---|---|---|---|---|
| ID_000000 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000002 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000003 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN |
| ID_000004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| ID_000995 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000996 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
| ID_000997 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000998 | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 |
| ID_000999 | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 |
1000 rows × 6 columns
Controlling genotype value handling: neg_geno and sum_geno
Two parameters control what happens to special genotype values after gene aggregation.
``neg_geno`` — negative genotype values (e.g. PLINK no-call coded as -1): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA (treat as missing) | | integer / string (e.g. 0) | Replace all negative values with that value |
``sum_geno`` — genotype values greater than 1 (a sample carries multiple variants in the same gene): | Value | Effect | |——-|——–| | True (default) | Keep the raw value unchanged | | None | Replace with NA | | integer / string (e.g. 1) | Replace all values > 1 with that value — use sum_geno=1 for binary (0/1) carrier status |
In the example above (PLINK input), the default neg_geno=True means PLINK no-call values (-1) appear unchanged in the gene sum as NaN after aggregation (a negative sum is treated as missing by pandas arithmetic). To explicitly replace them with NA, pass neg_geno=None; to treat no-calls as non-carriers, pass neg_geno=0.
The direct approach passes a file path to extract_genes, which performs variant extraction and gene aggregation in a single step.
[ ]:
# Summarise genes pointing to the plink-path
summary, results = extract_genes(
input_file = examples.get_path('plink_example', 'example_genotypes.bed'),
genes = variants, incl_var = True, cat_column = 'gene',
)
[6]:
summary.head()
[6]:
| variant | count | total | frequency | na_count | valid_samples | |
|---|---|---|---|---|---|---|
| 0 | NPC1L1 | 88 | 1000 | 0.088 | 125 | 875 |
| 1 | PCSK9 | 65 | 1000 | 0.065 | 125 | 875 |
| 2 | HMGCR | 53 | 1000 | 0.053 | 131 | 869 |
| 3 | PPARA | 51 | 1000 | 0.051 | 135 | 865 |
| 4 | CETP | 50 | 1000 | 0.050 | 155 | 845 |
[7]:
results
[7]:
| chr1:1138210:C:A | chr1:5648006:A:G | chr1:9474009:A:T | chr1:18589438:T:G | chr1:22843509:A:C | chr1:25132332:A:C | chr1:28845057:G:T | chr1:30247857:C:A | chr1:35045263:G:C | chr1:42576549:C:A | ... | chr22:28629784:T:A | chr22:30905065:A:C | chr22:37632612:G:C | chr22:41350377:G:T | PCSK9 | ANGPTL3 | HMGCR | NPC1L1 | CETP | PPARA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID_000000 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000002 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000003 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN |
| ID_000004 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ID_000995 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000996 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
| ID_000997 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID_000998 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 |
| ID_000999 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 |
1000 rows × 42 columns