Getting Started =============== .. include:: ../../README.md :parser: myst_parser.sphinx_ Quick Start ----------- This section walks through a minimal example to demonstrate MARVELous functionality. Step 1: Prepare Your Data ^^^^^^^^^^^^^^^^^^^^^^^^^ MARVELous consist of two main parts, for which different requirements exist. For **variant extraction** MARVELous requires: 1. **Genetic data** - VCF, BGEN, or PLINK files 2. **Variant list** - Tab-separated file specifying variants to extract For **association analyses** MARVELous requires: 1. **Exposure data** - Output of **variant extraction** or similar 2. **Phenotype data** - Tab-separated file with outcome variables 3. **Covariate data** (optional) - Tab-separated file with covariates Both steps can also be performed at once, which would logically require the genetic data, variant list, phenotype data, and optionally covariate data. **Example variant list** (``variants.tsv``): .. code-block:: text chr pos a1 a2 gene 22 1437663 C A GENE1 22 28629784 C T GENE1 22 37632612 G C GENE2 22 37638692 G C GENE2 **Example phenotype file** (``phenotypes.tsv``): .. code-block:: text id blood_pressure diabetes ID_000001 120.5 0 ID_000002 135.2 1 ID_000003 128.7 0 Step 2: Create Configuration File ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create ``config.cnf`` replacing the paths with correct absolute paths: .. code-block:: ini [GenoInput] chr22 /path/to/example_genotypes_chr22.vcf.gz [VarInput] variants /path/to/variants.tsv [PhenoInput] phenotypes /path/to/phenotypes.tsv [ConTests] blood_pressure OLS;KW [BinTests] diabetes GLM-Binom;CHISQ [Covs] Unadjusted None [Output] VarOutput /path/to/results/carriers [Options] extract_variants True association_analysis True id_column id cat_column gene incl_var True output_path /path/to/results Step 3: Validate Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Run a dry run to check your configuration: .. code-block:: bash marvelous config.cnf --dry-run -v If successful, you'll see something like: .. code-block:: text === MARVEL (marvel v0.4.0a0) === [info] 13:46:10 > config_file value: config.cnf [info] 13:46:10 > dry_run value: True [info] 13:46:10 > log_file value: None [info] 13:46:10 > outpath value: None [info] 13:46:10 > verbose value: True [info] 13:46:10 > Dry run mode - validating configuration only [info] 13:46:10 > Configuration is valid! [info] 13:46:10 > Configuration summary: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'} Step 4: Run Analysis ^^^^^^^^^^^^^^^^^^^^ Run the full pipeline: .. code-block:: bash marvelous config.cnf -v You'll see output similar to: .. code-block:: text === MARVEL (marvel v0.4.0a0) === [info] 14:49:35 > config_file value: config.cnf [info] 14:49:35 > dry_run value: False [info] 14:49:35 > log_file value: None [info] 14:49:35 > outpath value: None [info] 14:49:35 > verbose value: True [info] 14:49:35 > ================================================================================ [info] 14:49:35 > MARVELous Pipeline Starting [info] 14:49:35 > ================================================================================ [info] 14:49:35 > [info] 14:49:35 > STEP 1: VARIANT EXTRACTION [info] 14:49:35 > -------------------------------------------------------------------------------- [info] 14:49:35 > Starting variant extraction [info] 14:49:35 > Set up 1 extraction task(s) [info] 14:49:35 > Gene aggregation enabled - using two-phase approach [info] 14:49:35 > Checkpoint directory: ./tmp_check [info] 14:49:35 > Will skip tasks with existing output files [info] 14:49:35 > Starting parallel extraction with 12 workers [info] 14:49:36 > [1/1] Loaded from checkpoint: chr22_variants [info] 14:49:36 > Loaded 1 tasks from checkpoints [info] 14:49:37 > Aggregating 2 variants to genes... [info] 14:49:37 > Successful: 1 [info] 14:49:37 > Saved summary: ./results/carriers_summary.tsv.gz [info] 14:49:37 > Saved carriers: ./results/carriers_carriers.tsv.gz [info] 14:49:37 > Extraction complete: 1 file(s) created [info] 14:49:37 > [info] 14:49:37 > STEP 2: ASSOCIATION TESTING [info] 14:49:37 > -------------------------------------------------------------------------------- [info] 14:49:37 > Starting association testing [info] 14:49:37 > Testing 5 exposure(s) [info] 14:49:37 > Checking missingness thresholds (max allowed: 50.0%) [warning] 14:49:37 > Found 1 exposure(s) with missingness > 50.0%: - 22:28629784:C:T_chr22:28629784:T:A: 1000/1000 missing (100.00%) These exposures will be excluded from analysis. [info] 14:49:37 > Removed 1 exposure(s) with high missingness: 22:28629784:C:T_chr22:28629784:T:A [info] 14:49:37 > Missingness filtering complete. Retained 2 outcomes, 4 exposures [info] 14:49:37 > Pre-computing stratification model: 'overall' [info] 14:49:37 > Computing valid IDs for listwise deletion [info] 14:49:37 > Processing file: /Users/marionvanvugt/Documents/work.nosync/projects/marvel/phenotypes.tsv [info] 14:49:37 > Final valid IDs across all files: 3 [info] 14:49:37 > Phenotype file: 3 valid samples (checked 2 outcomes + 0 time columns) [info] 14:49:37 > Processing file: ./results/carriers_carriers.tsv.gz [info] 14:49:37 > ID extraction: 948 valid IDs from 1000 samples (52 removed, 5.20%) [info] 14:49:37 > Final valid IDs across all files: 948 [info] 14:49:37 > Exposure files: 948 valid samples [info] 14:49:37 > Listwise deletion: 3 valid samples (intersection) [info] 14:49:37 > Pre-computed 1 stratification model(s) [info] 14:49:37 > Testing exposure: 22:1437663:C:A [info] 14:49:37 > Loaded: {0.0: 958, 1.0: 16} [info] 14:49:37 > Stratification model: 'overall' [info] 14:49:37 > Cached exposure '22:1437663:C:A' in DataManager [info] 14:49:37 > Completed: 4 results rows generated [info] 14:49:37 > Baseline table saved: ./results/22_1437663_C_A_overall_baseline.tsv [info] 14:49:37 > Results saved: ./results/22_1437663_C_A_results.tsv.gz (4 rows from 1 stratum/strata) [info] 14:49:37 > Testing exposure: 22:37632612:G:C [info] 14:49:37 > Loaded: {0.0: 968, 1.0: 4} [info] 14:49:37 > Stratification model: 'overall' [info] 14:49:37 > Cached exposure '22:37632612:G:C' in DataManager [info] 14:49:37 > Completed: 4 results rows generated [info] 14:49:37 > Baseline table saved: ./results/22_37632612_G_C_overall_baseline.tsv [info] 14:49:37 > Results saved: ./results/22_37632612_G_C_results.tsv.gz (4 rows from 1 stratum/strata) [info] 14:49:37 > Testing exposure: GENE1 [info] 14:49:37 > Loaded: {0.0: 984, 1.0: 16} [info] 14:49:37 > Stratification model: 'overall' [info] 14:49:37 > Cached exposure 'GENE1' in DataManager [info] 14:49:37 > Completed: 4 results rows generated [info] 14:49:37 > Baseline table saved: ./results/GENE1_overall_baseline.tsv [info] 14:49:37 > Results saved: ./results/GENE1_results.tsv.gz (4 rows from 1 stratum/strata) [info] 14:49:37 > Testing exposure: GENE2 [info] 14:49:37 > Loaded: {0.0: 996, 1.0: 4} [info] 14:49:37 > Stratification model: 'overall' [info] 14:49:37 > Cached exposure 'GENE2' in DataManager [info] 14:49:37 > Completed: 4 results rows generated [info] 14:49:37 > Baseline table saved: ./results/GENE2_overall_baseline.tsv [info] 14:49:37 > Results saved: ./results/GENE2_results.tsv.gz (4 rows from 1 stratum/strata) [info] 14:49:37 > DataManager cache statistics (stratification 'overall'): [info] 14:49:37 > Outcomes: 6 hits, 2 misses, hit rate: 75.00% [info] 14:49:37 > Covariates: 0 hits, 0 misses, hit rate: 0.00% [info] 14:49:37 > Exposure reuses: 0 [info] 14:49:37 > Association testing complete [info] 14:49:37 > [info] 14:49:37 > ================================================================================ [info] 14:49:37 > MARVELous Pipeline Completed Successfully [info] 14:49:37 > ================================================================================ [warning] 14:49:37 > Warning summary (Python warnings, repeated): [warning] 14:49:37 > [FutureWarning] (8x) The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. [warning] 14:49:37 > [UserWarning] (4x) Could not determine exposed/non exposed numbers for ['diabetes']: (1, 'N (%) or Mean (SD)') [info] 14:49:37 > [info] 14:49:37 > Pipeline Summary: [info] 14:49:37 > config: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'} [info] 14:49:37 > extraction: {'files_created': 1, 'successful_tasks': 1, 'failed_tasks': 0} [info] 14:49:37 > association: {'completed': 4, 'failed': 0, 'skipped': 0} Step 5: Check Results ^^^^^^^^^^^^^^^^^^^^^ Output files are created in the specified ``output_path``. The pipeline creates one file with the carriership per individuals (``carriers_carriers.tsv.gz``), a genotype summary (``carriers_summary.tsv.gz``) and a results (``XXX_results.tsv.gz``) and baseline file (``XXX_baseline.tsv``) per exposure tested. In this case, there is therefore a results and baseline file for the two variants identified in the genetics file and the two aggregates, the genes.: .. code-block:: bash ls ./results/ # Output: # carriers_carriers.tsv.gz # carriers_summary.tsv.gz # 22_1437663_C_A_overall_baseline.tsv # 22_1437663_C_A_results.tsv.gz # 22_37632612_G_C_overall_baseline.tsv # 22_37632612_G_C_results.tsv.gz # GENE1_results.tsv.gz # GENE1_baseline.tsv # GENE2_results.tsv.gz # GENE2_baseline.tsv View association results: .. code-block:: bash zcat ./results/GENE1_results.tsv.gz .. code-block:: text Model Model name Stratification name Variable Exposure Cases ... NA Unadjusted overall blood_pressure GENE1 NA ... GLM-Binom Unadjusted overall diabetes Intercept 1.0 ... GLM-Binom Unadjusted overall diabetes GENE1 1.0 ... CHISQ Unadjusted overall diabetes GENE1 1.0 ... Next Steps ---------- - :doc:`configuration` - Complete configuration reference - :doc:`usage` - Detailed usage guide - :doc:`tests_guide` - Statistical tests reference - :doc:`advanced` - Advanced features like gene aggregation and missingness handling Getting Help ------------ - **Documentation**: https://mvvugt.gitlab.io/marvel/index.html - **Issues**: https://gitlab.com/mvvugt/marvel/-/issues - **Help**: ``marvelous --help``