Getting Started

MARVELous - Marion’s Analysis of Rare Variant Effects in Large data
version: 0.4.1a0
The MARVELous package is a toolkit to extract carriers of certain genetic variants and associate phenotypes with these carriers.
There is online documentation for MARVELous.
Installation instructions
At present, MARVELous is undergoing development and no packages exist yet on PyPi. Therefore it is recommended that you install in either of the two ways below. Please keep in mind that all development is carried out on MacOS and Python v3.12.
Using pip
Note: cyvcf2 is listed here but VCF example files (used in some tests) require
tabixfor CSI indexing, which is NOT bundled with the pip wheel. Install via conda to get tabix automatically: conda install -c bioconda cyvcf2 Or install htslib separately: conda install -c bioconda htslib PLINK and BGEN examples work without tabix. You can install using pip from the root of the cloned repository, first clone and cd into the repository root:
git clone git@gitlab.com:mvvugt/marvel.git
cd marvel
Install the dependencies:
python -m pip install --upgrade -r requirements.txt
Then install using pip
python -m pip install .
Or for an editable (developer) install run the command below from the root of
the repository. The difference with this is that you can just to a git pull
to update, or switch branches without re-installing:
python -m pip install -e .
Using Conda
There are also conda yaml environment files in ./resources/conda/envs that
have the same contents as requirements.txt but for conda packages, so all the
pre-requisites.
A new conda environment can be built using the command:
conda env create --file ./resources/conda/envs/conda_create.yaml
To add to an existing environment use:
conda env update --file ./resources/conda/envs/conda_update.yaml
Next the package can be installed:
make install
Verify installation
marvelous --version
marvelous --help
Development
For development work, install the package in editable mode with Git commit hooks configured:
make install-dev
This command installs the package in editable mode and configures Git commit
hooks, allowing you to run git pull to update the repository or switch
branches without reinstalling.
Alternatively, you can install manually:
python -m pip install -e .
python .setup_git_hooks.py
Git Hooks Configuration
When setting up a development environment, the setup-hooks command
configures Git hooks to enforce conventional commit message formatting and
spell check using codespell.
To view the commit message format requirements, run:
./.githooks/commit-msg -help
For frequent use, add this function to your shell configuration (~/.bashrc
or ~/.zshrc):
commit-format-help() {
local git_root
git_root=$(git rev-parse --show-toplevel 2>/dev/null)
if [ -z "$git_root" ]; then
echo "Error: Not inside a git repository"
return 1
fi
local hook_path="$git_root/.githooks/commit-msg"
if [ ! -f "$hook_path" ]; then
echo "Error: commit-msg hook not found"
return 1
fi
"$hook_path" --help
}
If you have already installed the package in editable mode without running
_setup_git_hooks.py, you can configure the hooks manually at any time by
running:
_setup_git_hooks.py
Run the tests
After installation you will be able to run the tests. If you have cloned the repository you can run the command below from the root of the repository:
pytest ./tests
Get help
To get help, type marvelous -h
Quick start example
Generate example data files:
marvelous example generate-data --outdir ./marvel_example
Create a configuration file pointing to them using create_config():
from marvel.utils.config_tools import create_config
create_config(
path='./marvel_example/example.cnf',
geno_input={'chr1': './marvel_example/example_genotypes_chr1.vcf.gz'},
variant_input={'variants': './marvel_example/gene_variants.tsv'},
variant_output={'VarOutput': './marvel_example/variant_carriers'},
pheno_input={
'phenotypes': './marvel_example/phenotypes.tsv',
'covariates': './marvel_example/covariates.tsv',
},
con_tests={'sbp': ['OLS', 'KW']},
id_column='id',
)
Then run the analysis:
marvelous marvel_example/example.cnf -v --log-file marvel_example/example.log
For more examples, see resources/examples.
Change log
version 0.4.0
Refactored package completely to a class structure
FEAT - Survival (time-to-event) analysis support
FEAT - Stratified analyses (renamed from filtering;
[Stratification]section)FEAT - Variant input pre-filtering on genomic regions (
prefilter_regions)FIX - Pairwise missingness handling
Quick Start
This section walks through a minimal example to demonstrate MARVELous functionality.
Step 1: Prepare Your Data
MARVELous consist of two main parts, for which different requirements exist.
For variant extraction MARVELous requires:
Genetic data - VCF, BGEN, or PLINK files
Variant list - Tab-separated file specifying variants to extract
For association analyses MARVELous requires:
Exposure data - Output of variant extraction or similar
Phenotype data - Tab-separated file with outcome variables
Covariate data (optional) - Tab-separated file with covariates
Both steps can also be performed at once, which would logically require the genetic data, variant list, phenotype data, and optionally covariate data.
Example variant list (variants.tsv):
chr pos a1 a2 gene
22 1437663 C A GENE1
22 28629784 C T GENE1
22 37632612 G C GENE2
22 37638692 G C GENE2
Example phenotype file (phenotypes.tsv):
id blood_pressure diabetes
ID_000001 120.5 0
ID_000002 135.2 1
ID_000003 128.7 0
Step 2: Create Configuration File
Create config.cnf replacing the paths with correct absolute paths:
[GenoInput]
chr22 /path/to/example_genotypes_chr22.vcf.gz
[VarInput]
variants /path/to/variants.tsv
[PhenoInput]
phenotypes /path/to/phenotypes.tsv
[ConTests]
blood_pressure OLS;KW
[BinTests]
diabetes GLM-Binom;CHISQ
[Covs]
Unadjusted None
[Output]
VarOutput /path/to/results/carriers
[Options]
extract_variants True
association_analysis True
id_column id
cat_column gene
incl_var True
output_path /path/to/results
Step 3: Validate Configuration
Run a dry run to check your configuration:
marvelous config.cnf --dry-run -v
If successful, you’ll see something like:
=== MARVEL (marvel v0.4.0a0) ===
[info] 13:46:10 > config_file value: config.cnf
[info] 13:46:10 > dry_run value: True
[info] 13:46:10 > log_file value: None
[info] 13:46:10 > outpath value: None
[info] 13:46:10 > verbose value: True
[info] 13:46:10 > Dry run mode - validating configuration only
[info] 13:46:10 > Configuration is valid!
[info] 13:46:10 > Configuration summary: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}
Step 4: Run Analysis
Run the full pipeline:
marvelous config.cnf -v
You’ll see output similar to:
=== MARVEL (marvel v0.4.0a0) ===
[info] 14:49:35 > config_file value: config.cnf
[info] 14:49:35 > dry_run value: False
[info] 14:49:35 > log_file value: None
[info] 14:49:35 > outpath value: None
[info] 14:49:35 > verbose value: True
[info] 14:49:35 > ================================================================================
[info] 14:49:35 > MARVELous Pipeline Starting
[info] 14:49:35 > ================================================================================
[info] 14:49:35 >
[info] 14:49:35 > STEP 1: VARIANT EXTRACTION
[info] 14:49:35 > --------------------------------------------------------------------------------
[info] 14:49:35 > Starting variant extraction
[info] 14:49:35 > Set up 1 extraction task(s)
[info] 14:49:35 > Gene aggregation enabled - using two-phase approach
[info] 14:49:35 > Checkpoint directory: ./tmp_check
[info] 14:49:35 > Will skip tasks with existing output files
[info] 14:49:35 > Starting parallel extraction with 12 workers
[info] 14:49:36 > [1/1] Loaded from checkpoint: chr22_variants
[info] 14:49:36 > Loaded 1 tasks from checkpoints
[info] 14:49:37 > Aggregating 2 variants to genes...
[info] 14:49:37 > Successful: 1
[info] 14:49:37 > Saved summary: ./results/carriers_summary.tsv.gz
[info] 14:49:37 > Saved carriers: ./results/carriers_carriers.tsv.gz
[info] 14:49:37 > Extraction complete: 1 file(s) created
[info] 14:49:37 >
[info] 14:49:37 > STEP 2: ASSOCIATION TESTING
[info] 14:49:37 > --------------------------------------------------------------------------------
[info] 14:49:37 > Starting association testing
[info] 14:49:37 > Testing 5 exposure(s)
[info] 14:49:37 > Checking missingness thresholds (max allowed: 50.0%)
[warning] 14:49:37 > Found 1 exposure(s) with missingness > 50.0%:
- 22:28629784:C:T_chr22:28629784:T:A: 1000/1000 missing (100.00%)
These exposures will be excluded from analysis.
[info] 14:49:37 > Removed 1 exposure(s) with high missingness: 22:28629784:C:T_chr22:28629784:T:A
[info] 14:49:37 > Missingness filtering complete. Retained 2 outcomes, 4 exposures
[info] 14:49:37 > Pre-computing stratification model: 'overall'
[info] 14:49:37 > Computing valid IDs for listwise deletion
[info] 14:49:37 > Processing file: /Users/marionvanvugt/Documents/work.nosync/projects/marvel/phenotypes.tsv
[info] 14:49:37 > Final valid IDs across all files: 3
[info] 14:49:37 > Phenotype file: 3 valid samples (checked 2 outcomes + 0 time columns)
[info] 14:49:37 > Processing file: ./results/carriers_carriers.tsv.gz
[info] 14:49:37 > ID extraction: 948 valid IDs from 1000 samples (52 removed, 5.20%)
[info] 14:49:37 > Final valid IDs across all files: 948
[info] 14:49:37 > Exposure files: 948 valid samples
[info] 14:49:37 > Listwise deletion: 3 valid samples (intersection)
[info] 14:49:37 > Pre-computed 1 stratification model(s)
[info] 14:49:37 > Testing exposure: 22:1437663:C:A
[info] 14:49:37 > Loaded: {0.0: 958, 1.0: 16}
[info] 14:49:37 > Stratification model: 'overall'
[info] 14:49:37 > Cached exposure '22:1437663:C:A' in DataManager
[info] 14:49:37 > Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/22_1437663_C_A_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/22_1437663_C_A_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: 22:37632612:G:C
[info] 14:49:37 > Loaded: {0.0: 968, 1.0: 4}
[info] 14:49:37 > Stratification model: 'overall'
[info] 14:49:37 > Cached exposure '22:37632612:G:C' in DataManager
[info] 14:49:37 > Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/22_37632612_G_C_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/22_37632612_G_C_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: GENE1
[info] 14:49:37 > Loaded: {0.0: 984, 1.0: 16}
[info] 14:49:37 > Stratification model: 'overall'
[info] 14:49:37 > Cached exposure 'GENE1' in DataManager
[info] 14:49:37 > Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/GENE1_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/GENE1_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > Testing exposure: GENE2
[info] 14:49:37 > Loaded: {0.0: 996, 1.0: 4}
[info] 14:49:37 > Stratification model: 'overall'
[info] 14:49:37 > Cached exposure 'GENE2' in DataManager
[info] 14:49:37 > Completed: 4 results rows generated
[info] 14:49:37 > Baseline table saved: ./results/GENE2_overall_baseline.tsv
[info] 14:49:37 > Results saved: ./results/GENE2_results.tsv.gz (4 rows from 1 stratum/strata)
[info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
[info] 14:49:37 > Outcomes: 6 hits, 2 misses, hit rate: 75.00%
[info] 14:49:37 > Covariates: 0 hits, 0 misses, hit rate: 0.00%
[info] 14:49:37 > Exposure reuses: 0
[info] 14:49:37 > Association testing complete
[info] 14:49:37 >
[info] 14:49:37 > ================================================================================
[info] 14:49:37 > MARVELous Pipeline Completed Successfully
[info] 14:49:37 > ================================================================================
[warning] 14:49:37 > Warning summary (Python warnings, repeated):
[warning] 14:49:37 > [FutureWarning] (8x) The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
[warning] 14:49:37 > [UserWarning] (4x) Could not determine exposed/non exposed numbers for ['diabetes']: (1, 'N (%) or Mean (SD)')
[info] 14:49:37 >
[info] 14:49:37 > Pipeline Summary:
[info] 14:49:37 > config: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}
[info] 14:49:37 > extraction: {'files_created': 1, 'successful_tasks': 1, 'failed_tasks': 0}
[info] 14:49:37 > association: {'completed': 4, 'failed': 0, 'skipped': 0}
Step 5: Check Results
Output files are created in the specified output_path. The pipeline
creates one file with the carriership per individuals
(carriers_carriers.tsv.gz), a genotype summary
(carriers_summary.tsv.gz) and a results (XXX_results.tsv.gz) and
baseline file (XXX_baseline.tsv) per exposure tested. In this case, there
is therefore a results and baseline file for the two variants identified in
the genetics file and the two aggregates, the genes.:
ls ./results/
# Output:
# carriers_carriers.tsv.gz
# carriers_summary.tsv.gz
# 22_1437663_C_A_overall_baseline.tsv
# 22_1437663_C_A_results.tsv.gz
# 22_37632612_G_C_overall_baseline.tsv
# 22_37632612_G_C_results.tsv.gz
# GENE1_results.tsv.gz
# GENE1_baseline.tsv
# GENE2_results.tsv.gz
# GENE2_baseline.tsv
View association results:
zcat ./results/GENE1_results.tsv.gz
Model Model name Stratification name Variable Exposure Cases ...
NA Unadjusted overall blood_pressure GENE1 NA ...
GLM-Binom Unadjusted overall diabetes Intercept 1.0 ...
GLM-Binom Unadjusted overall diabetes GENE1 1.0 ...
CHISQ Unadjusted overall diabetes GENE1 1.0 ...
Next Steps
Configuration Reference - Complete configuration reference
Usage Guide - Detailed usage guide
Statistical Tests Guide - Statistical tests reference
Advanced Features - Advanced features like gene aggregation and missingness handling
Getting Help
Documentation: https://mvvugt.gitlab.io/marvel/index.html
Help:
marvelous --help