Getting Started

Banner

MARVELous - Marion’s Analysis of Rare Variant Effects in Large data

version: 0.4.1a0

The MARVELous package is a toolkit to extract carriers of certain genetic variants and associate phenotypes with these carriers.

There is online documentation for MARVELous.

Installation instructions

At present, MARVELous is undergoing development and no packages exist yet on PyPi. Therefore it is recommended that you install in either of the two ways below. Please keep in mind that all development is carried out on MacOS and Python v3.12.

Using pip

Note: cyvcf2 is listed here but VCF example files (used in some tests) require tabix for CSI indexing, which is NOT bundled with the pip wheel. Install via conda to get tabix automatically: conda install -c bioconda cyvcf2 Or install htslib separately: conda install -c bioconda htslib PLINK and BGEN examples work without tabix. You can install using pip from the root of the cloned repository, first clone and cd into the repository root:

git clone git@gitlab.com:mvvugt/marvel.git
cd marvel

Install the dependencies:

python -m pip install --upgrade -r requirements.txt

Then install using pip

python -m pip install .

Or for an editable (developer) install run the command below from the root of the repository. The difference with this is that you can just to a git pull to update, or switch branches without re-installing:

python -m pip install -e .

Using Conda

There are also conda yaml environment files in ./resources/conda/envs that have the same contents as requirements.txt but for conda packages, so all the pre-requisites.

A new conda environment can be built using the command:

conda env create --file ./resources/conda/envs/conda_create.yaml

To add to an existing environment use:

conda env update --file ./resources/conda/envs/conda_update.yaml

Next the package can be installed:

make install

Verify installation

marvelous --version
marvelous --help
Development

For development work, install the package in editable mode with Git commit hooks configured:

make install-dev

This command installs the package in editable mode and configures Git commit hooks, allowing you to run git pull to update the repository or switch branches without reinstalling.

Alternatively, you can install manually:

python -m pip install -e .
python .setup_git_hooks.py
Git Hooks Configuration

When setting up a development environment, the setup-hooks command configures Git hooks to enforce conventional commit message formatting and spell check using codespell.

To view the commit message format requirements, run:

./.githooks/commit-msg -help

For frequent use, add this function to your shell configuration (~/.bashrc or ~/.zshrc):

commit-format-help() {
    local git_root
    git_root=$(git rev-parse --show-toplevel 2>/dev/null)

    if [ -z "$git_root" ]; then
        echo "Error: Not inside a git repository"
        return 1
    fi

    local hook_path="$git_root/.githooks/commit-msg"

    if [ ! -f "$hook_path" ]; then
        echo "Error: commit-msg hook not found"
        return 1
    fi

    "$hook_path" --help
}

If you have already installed the package in editable mode without running _setup_git_hooks.py, you can configure the hooks manually at any time by running:

_setup_git_hooks.py

Run the tests

After installation you will be able to run the tests. If you have cloned the repository you can run the command below from the root of the repository:

pytest ./tests

Get help

To get help, type marvelous -h

Quick start example

Generate example data files:

marvelous example generate-data --outdir ./marvel_example

Create a configuration file pointing to them using create_config():

from marvel.utils.config_tools import create_config

create_config(
    path='./marvel_example/example.cnf',
    geno_input={'chr1': './marvel_example/example_genotypes_chr1.vcf.gz'},
    variant_input={'variants': './marvel_example/gene_variants.tsv'},
    variant_output={'VarOutput': './marvel_example/variant_carriers'},
    pheno_input={
        'phenotypes': './marvel_example/phenotypes.tsv',
        'covariates': './marvel_example/covariates.tsv',
    },
    con_tests={'sbp': ['OLS', 'KW']},
    id_column='id',
)

Then run the analysis:

marvelous marvel_example/example.cnf -v --log-file marvel_example/example.log

For more examples, see resources/examples.

Change log

version 0.4.0
  • Refactored package completely to a class structure

  • FEAT - Survival (time-to-event) analysis support

  • FEAT - Stratified analyses (renamed from filtering; [Stratification] section)

  • FEAT - Variant input pre-filtering on genomic regions (prefilter_regions)

  • FIX - Pairwise missingness handling

Quick Start

This section walks through a minimal example to demonstrate MARVELous functionality.

Step 1: Prepare Your Data

MARVELous consist of two main parts, for which different requirements exist.

For variant extraction MARVELous requires:

  1. Genetic data - VCF, BGEN, or PLINK files

  2. Variant list - Tab-separated file specifying variants to extract

For association analyses MARVELous requires:

  1. Exposure data - Output of variant extraction or similar

  2. Phenotype data - Tab-separated file with outcome variables

  3. Covariate data (optional) - Tab-separated file with covariates

Both steps can also be performed at once, which would logically require the genetic data, variant list, phenotype data, and optionally covariate data.

Example variant list (variants.tsv):

chr  pos     a1      a2      gene
22   1437663 C       A       GENE1
22   28629784        C       T       GENE1
22   37632612        G       C       GENE2
22   37638692        G       C       GENE2

Example phenotype file (phenotypes.tsv):

id   blood_pressure  diabetes
ID_000001    120.5   0
ID_000002    135.2   1
ID_000003    128.7   0

Step 2: Create Configuration File

Create config.cnf replacing the paths with correct absolute paths:

[GenoInput]
chr22       /path/to/example_genotypes_chr22.vcf.gz

[VarInput]
variants    /path/to/variants.tsv

[PhenoInput]
phenotypes  /path/to/phenotypes.tsv

[ConTests]
blood_pressure      OLS;KW

[BinTests]
diabetes    GLM-Binom;CHISQ

[Covs]
Unadjusted  None

[Output]
VarOutput   /path/to/results/carriers

[Options]
extract_variants    True
association_analysis        True
id_column   id
cat_column  gene
incl_var    True
output_path /path/to/results

Step 3: Validate Configuration

Run a dry run to check your configuration:

marvelous config.cnf --dry-run -v

If successful, you’ll see something like:

=== MARVEL (marvel v0.4.0a0) ===
[info] 13:46:10 > config_file value: config.cnf
[info] 13:46:10 > dry_run value: True
[info] 13:46:10 > log_file value: None
[info] 13:46:10 > outpath value: None
[info] 13:46:10 > verbose value: True
[info] 13:46:10 > Dry run mode - validating configuration only
[info] 13:46:10 > Configuration is valid!
[info] 13:46:10 > Configuration summary: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}

Step 4: Run Analysis

Run the full pipeline:

marvelous config.cnf -v

You’ll see output similar to:

=== MARVEL (marvel v0.4.0a0) ===
[info] 14:49:35 > config_file value: config.cnf
[info] 14:49:35 > dry_run value: False
[info] 14:49:35 > log_file value: None
[info] 14:49:35 > outpath value: None
[info] 14:49:35 > verbose value: True
[info] 14:49:35 > ================================================================================
[info] 14:49:35 > MARVELous Pipeline Starting
[info] 14:49:35 > ================================================================================
[info] 14:49:35 >
[info] 14:49:35 > STEP 1: VARIANT EXTRACTION
[info] 14:49:35 > --------------------------------------------------------------------------------
[info] 14:49:35 > Starting variant extraction
[info] 14:49:35 > Set up 1 extraction task(s)
[info] 14:49:35 > Gene aggregation enabled - using two-phase approach
[info] 14:49:35 > Checkpoint directory: ./tmp_check
[info] 14:49:35 > Will skip tasks with existing output files
[info] 14:49:35 > Starting parallel extraction with 12 workers
[info] 14:49:36 > [1/1] Loaded from checkpoint: chr22_variants
[info] 14:49:36 > Loaded 1 tasks from checkpoints
[info] 14:49:37 > Aggregating 2 variants to genes...
[info] 14:49:37 > Successful: 1
[info] 14:49:37 > Saved summary: ./results/carriers_summary.tsv.gz
[info] 14:49:37 > Saved carriers: ./results/carriers_carriers.tsv.gz
[info] 14:49:37 > Extraction complete: 1 file(s) created
[info] 14:49:37 >
[info] 14:49:37 > STEP 2: ASSOCIATION TESTING
[info] 14:49:37 > --------------------------------------------------------------------------------
[info] 14:49:37 > Starting association testing
[info] 14:49:37 > Testing 5 exposure(s)
[info] 14:49:37 > Checking missingness thresholds (max allowed: 50.0%)
[warning] 14:49:37 > Found 1 exposure(s) with missingness > 50.0%:
    - 22:28629784:C:T_chr22:28629784:T:A: 1000/1000 missing (100.00%)
    These exposures will be excluded from analysis.
[info] 14:49:37 > Removed 1 exposure(s) with high missingness: 22:28629784:C:T_chr22:28629784:T:A
[info] 14:49:37 > Missingness filtering complete. Retained 2 outcomes, 4 exposures
[info] 14:49:37 > Pre-computing stratification model: 'overall'
[info] 14:49:37 > Computing valid IDs for listwise deletion
[info] 14:49:37 > Processing file: /Users/marionvanvugt/Documents/work.nosync/projects/marvel/phenotypes.tsv
[info] 14:49:37 > Final valid IDs across all files: 3
[info] 14:49:37 > Phenotype file: 3 valid samples (checked 2 outcomes + 0 time columns)
[info] 14:49:37 > Processing file: ./results/carriers_carriers.tsv.gz
[info] 14:49:37 > ID extraction: 948 valid IDs from 1000 samples (52 removed, 5.20%)
[info] 14:49:37 > Final valid IDs across all files: 948
[info] 14:49:37 > Exposure files: 948 valid samples
[info] 14:49:37 > Listwise deletion: 3 valid samples (intersection)
[info] 14:49:37 > Pre-computed 1 stratification model(s)
[info] 14:49:37 > Testing exposure: 22:1437663:C:A
[info] 14:49:37 >   Loaded: {0.0: 958, 1.0: 16}
[info] 14:49:37 >   Stratification model: 'overall'
[info] 14:49:37 > Cached exposure '22:1437663:C:A' in DataManager
[info] 14:49:37 >    Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/22_1437663_C_A_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/22_1437663_C_A_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: 22:37632612:G:C
[info] 14:49:37 >   Loaded: {0.0: 968, 1.0: 4}
[info] 14:49:37 >   Stratification model: 'overall'
[info] 14:49:37 > Cached exposure '22:37632612:G:C' in DataManager
[info] 14:49:37 >    Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/22_37632612_G_C_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/22_37632612_G_C_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: GENE1
[info] 14:49:37 >   Loaded: {0.0: 984, 1.0: 16}
[info] 14:49:37 >   Stratification model: 'overall'
[info] 14:49:37 > Cached exposure 'GENE1' in DataManager
[info] 14:49:37 >    Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/GENE1_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/GENE1_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: GENE2
[info] 14:49:37 >   Loaded: {0.0: 996, 1.0: 4}
[info] 14:49:37 >   Stratification model: 'overall'
[info] 14:49:37 > Cached exposure 'GENE2' in DataManager
[info] 14:49:37 >    Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/GENE2_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/GENE2_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
[info] 14:49:37 >   Outcomes: 6 hits, 2 misses, hit rate: 75.00%
[info] 14:49:37 >   Covariates: 0 hits, 0 misses, hit rate: 0.00%
[info] 14:49:37 >   Exposure reuses: 0
[info] 14:49:37 > Association testing complete
[info] 14:49:37 >
[info] 14:49:37 > ================================================================================
[info] 14:49:37 > MARVELous Pipeline Completed Successfully
[info] 14:49:37 > ================================================================================
[warning] 14:49:37 > Warning summary (Python warnings, repeated):
[warning] 14:49:37 >   [FutureWarning] (8x) The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
[warning] 14:49:37 >   [UserWarning] (4x) Could not determine exposed/non exposed numbers for ['diabetes']: (1, 'N (%) or Mean (SD)')
[info] 14:49:37 >
[info] 14:49:37 > Pipeline Summary:
[info] 14:49:37 >   config: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}
[info] 14:49:37 >   extraction: {'files_created': 1, 'successful_tasks': 1, 'failed_tasks': 0}
[info] 14:49:37 >   association: {'completed': 4, 'failed': 0, 'skipped': 0}

Step 5: Check Results

Output files are created in the specified output_path. The pipeline creates one file with the carriership per individuals (carriers_carriers.tsv.gz), a genotype summary (carriers_summary.tsv.gz) and a results (XXX_results.tsv.gz) and baseline file (XXX_baseline.tsv) per exposure tested. In this case, there is therefore a results and baseline file for the two variants identified in the genetics file and the two aggregates, the genes.:

ls ./results/

# Output:
# carriers_carriers.tsv.gz
# carriers_summary.tsv.gz
# 22_1437663_C_A_overall_baseline.tsv
# 22_1437663_C_A_results.tsv.gz
# 22_37632612_G_C_overall_baseline.tsv
# 22_37632612_G_C_results.tsv.gz
# GENE1_results.tsv.gz
# GENE1_baseline.tsv
# GENE2_results.tsv.gz
# GENE2_baseline.tsv

View association results:

zcat ./results/GENE1_results.tsv.gz
Model       Model name      Stratification name     Variable        Exposure        Cases   ...
NA  Unadjusted      overall blood_pressure  GENE1   NA      ...
GLM-Binom   Unadjusted      overall diabetes        Intercept       1.0     ...
GLM-Binom   Unadjusted      overall diabetes        GENE1   1.0     ...
CHISQ       Unadjusted      overall diabetes        GENE1   1.0     ...

Next Steps

Getting Help