Getting Started
===============

.. include:: ../../README.md
   :parser: myst_parser.sphinx_


Quick Start
-----------

This section walks through a minimal example to demonstrate MARVELous functionality.

Step 1: Prepare Your Data
^^^^^^^^^^^^^^^^^^^^^^^^^

MARVELous consist of two main parts, for which different requirements exist.

For **variant extraction** MARVELous requires:

1. **Genetic data** - VCF, BGEN, or PLINK files
2. **Variant list** - Tab-separated file specifying variants to extract

For **association analyses** MARVELous requires:

1. **Exposure data** - Output of **variant extraction** or similar
2. **Phenotype data** - Tab-separated file with outcome variables
3. **Covariate data** (optional) - Tab-separated file with covariates

Both steps can also be performed at once, which would logically require the
genetic data, variant list, phenotype data, and optionally covariate data.

**Example variant list** (``variants.tsv``):

.. code-block:: text

   chr	pos	a1	a2	gene
   22	1437663	C	A	GENE1
   22	28629784	C	T	GENE1
   22	37632612	G	C	GENE2
   22	37638692	G	C	GENE2

**Example phenotype file** (``phenotypes.tsv``):

.. code-block:: text

   id	blood_pressure	diabetes
   ID_000001	120.5	0
   ID_000002	135.2	1
   ID_000003	128.7	0


Step 2: Create Configuration File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Create ``config.cnf`` replacing the paths with correct absolute paths:

.. code-block:: ini

    [GenoInput]
    chr22	/path/to/example_genotypes_chr22.vcf.gz

    [VarInput]
    variants	/path/to/variants.tsv

    [PhenoInput]
    phenotypes	/path/to/phenotypes.tsv

    [ConTests]
    blood_pressure	OLS;KW

    [BinTests]
    diabetes	GLM-Binom;CHISQ

    [Covs]
    Unadjusted	None

    [Output]
    VarOutput	/path/to/results/carriers

    [Options]
    extract_variants	True
    association_analysis	True
    id_column	id
    cat_column	gene
    incl_var	True
    output_path	/path/to/results

Step 3: Validate Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run a dry run to check your configuration:

.. code-block:: bash

   marvelous config.cnf --dry-run -v

If successful, you'll see something like:

.. code-block:: text

    === MARVEL (marvel v0.4.0a0) ===
    [info] 13:46:10 > config_file value: config.cnf
    [info] 13:46:10 > dry_run value: True
    [info] 13:46:10 > log_file value: None
    [info] 13:46:10 > outpath value: None
    [info] 13:46:10 > verbose value: True
    [info] 13:46:10 > Dry run mode - validating configuration only
    [info] 13:46:10 > Configuration is valid!
    [info] 13:46:10 > Configuration summary: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}


Step 4: Run Analysis
^^^^^^^^^^^^^^^^^^^^

Run the full pipeline:

.. code-block:: bash

   marvelous config.cnf -v

You'll see output similar to:

.. code-block:: text

    === MARVEL (marvel v0.4.0a0) ===
    [info] 14:49:35 > config_file value: config.cnf
    [info] 14:49:35 > dry_run value: False
    [info] 14:49:35 > log_file value: None
    [info] 14:49:35 > outpath value: None
    [info] 14:49:35 > verbose value: True
    [info] 14:49:35 > ================================================================================
    [info] 14:49:35 > MARVELous Pipeline Starting
    [info] 14:49:35 > ================================================================================
    [info] 14:49:35 > 
    [info] 14:49:35 > STEP 1: VARIANT EXTRACTION
    [info] 14:49:35 > --------------------------------------------------------------------------------
    [info] 14:49:35 > Starting variant extraction
    [info] 14:49:35 > Set up 1 extraction task(s)
    [info] 14:49:35 > Gene aggregation enabled - using two-phase approach
    [info] 14:49:35 > Checkpoint directory: ./tmp_check
    [info] 14:49:35 > Will skip tasks with existing output files
    [info] 14:49:35 > Starting parallel extraction with 12 workers
    [info] 14:49:36 > [1/1] Loaded from checkpoint: chr22_variants
    [info] 14:49:36 > Loaded 1 tasks from checkpoints
    [info] 14:49:37 > Aggregating 2 variants to genes...
    [info] 14:49:37 > Successful: 1
    [info] 14:49:37 > Saved summary: ./results/carriers_summary.tsv.gz
    [info] 14:49:37 > Saved carriers: ./results/carriers_carriers.tsv.gz
    [info] 14:49:37 > Extraction complete: 1 file(s) created
    [info] 14:49:37 > 
    [info] 14:49:37 > STEP 2: ASSOCIATION TESTING
    [info] 14:49:37 > --------------------------------------------------------------------------------
    [info] 14:49:37 > Starting association testing
    [info] 14:49:37 > Testing 5 exposure(s)
    [info] 14:49:37 > Checking missingness thresholds (max allowed: 50.0%)
    [warning] 14:49:37 > Found 1 exposure(s) with missingness > 50.0%:
        - 22:28629784:C:T_chr22:28629784:T:A: 1000/1000 missing (100.00%)
        These exposures will be excluded from analysis.
    [info] 14:49:37 > Removed 1 exposure(s) with high missingness: 22:28629784:C:T_chr22:28629784:T:A
    [info] 14:49:37 > Missingness filtering complete. Retained 2 outcomes, 4 exposures
    [info] 14:49:37 > Pre-computing stratification model: 'overall'
    [info] 14:49:37 > Computing valid IDs for listwise deletion
    [info] 14:49:37 > Processing file: /Users/marionvanvugt/Documents/work.nosync/projects/marvel/phenotypes.tsv
    [info] 14:49:37 > Final valid IDs across all files: 3
    [info] 14:49:37 > Phenotype file: 3 valid samples (checked 2 outcomes + 0 time columns)
    [info] 14:49:37 > Processing file: ./results/carriers_carriers.tsv.gz
    [info] 14:49:37 > ID extraction: 948 valid IDs from 1000 samples (52 removed, 5.20%)
    [info] 14:49:37 > Final valid IDs across all files: 948
    [info] 14:49:37 > Exposure files: 948 valid samples
    [info] 14:49:37 > Listwise deletion: 3 valid samples (intersection)
    [info] 14:49:37 > Pre-computed 1 stratification model(s)
    [info] 14:49:37 > Testing exposure: 22:1437663:C:A
    [info] 14:49:37 >   Loaded: {0.0: 958, 1.0: 16}
    [info] 14:49:37 >   Stratification model: 'overall'
    [info] 14:49:37 > Cached exposure '22:1437663:C:A' in DataManager
    [info] 14:49:37 >    Completed: 4 results rows generated
    [info] 14:49:37 > Baseline table saved: ./results/22_1437663_C_A_overall_baseline.tsv
    [info] 14:49:37 > Results saved: ./results/22_1437663_C_A_results.tsv.gz (4 rows from 1 stratum/strata)
    [info] 14:49:37 > Testing exposure: 22:37632612:G:C
    [info] 14:49:37 >   Loaded: {0.0: 968, 1.0: 4}
    [info] 14:49:37 >   Stratification model: 'overall'
    [info] 14:49:37 > Cached exposure '22:37632612:G:C' in DataManager
    [info] 14:49:37 >    Completed: 4 results rows generated
    [info] 14:49:37 > Baseline table saved: ./results/22_37632612_G_C_overall_baseline.tsv
    [info] 14:49:37 > Results saved: ./results/22_37632612_G_C_results.tsv.gz (4 rows from 1 stratum/strata)
    [info] 14:49:37 > Testing exposure: GENE1
    [info] 14:49:37 >   Loaded: {0.0: 984, 1.0: 16}
    [info] 14:49:37 >   Stratification model: 'overall'
    [info] 14:49:37 > Cached exposure 'GENE1' in DataManager
    [info] 14:49:37 >    Completed: 4 results rows generated
    [info] 14:49:37 > Baseline table saved: ./results/GENE1_overall_baseline.tsv
    [info] 14:49:37 > Results saved: ./results/GENE1_results.tsv.gz (4 rows from 1 stratum/strata)
    [info] 14:49:37 > Testing exposure: GENE2
    [info] 14:49:37 >   Loaded: {0.0: 996, 1.0: 4}
    [info] 14:49:37 >   Stratification model: 'overall'
    [info] 14:49:37 > Cached exposure 'GENE2' in DataManager
    [info] 14:49:37 >    Completed: 4 results rows generated
    [info] 14:49:37 > Baseline table saved: ./results/GENE2_overall_baseline.tsv
    [info] 14:49:37 > Results saved: ./results/GENE2_results.tsv.gz (4 rows from 1 stratum/strata)
    [info] 14:49:37 > DataManager cache statistics (stratification 'overall'):
    [info] 14:49:37 >   Outcomes: 6 hits, 2 misses, hit rate: 75.00%
    [info] 14:49:37 >   Covariates: 0 hits, 0 misses, hit rate: 0.00%
    [info] 14:49:37 >   Exposure reuses: 0
    [info] 14:49:37 > Association testing complete
    [info] 14:49:37 > 
    [info] 14:49:37 > ================================================================================
    [info] 14:49:37 > MARVELous Pipeline Completed Successfully
    [info] 14:49:37 > ================================================================================
    [warning] 14:49:37 > Warning summary (Python warnings, repeated):
    [warning] 14:49:37 >   [FutureWarning] (8x) The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
    [warning] 14:49:37 >   [UserWarning] (4x) Could not determine exposed/non exposed numbers for ['diabetes']: (1, 'N (%) or Mean (SD)')
    [info] 14:49:37 > 
    [info] 14:49:37 > Pipeline Summary:
    [info] 14:49:37 >   config: {'extract_variants': True, 'association_analysis': True, 'output_path': './results'}
    [info] 14:49:37 >   extraction: {'files_created': 1, 'successful_tasks': 1, 'failed_tasks': 0}
    [info] 14:49:37 >   association: {'completed': 4, 'failed': 0, 'skipped': 0}


Step 5: Check Results
^^^^^^^^^^^^^^^^^^^^^

Output files are created in the specified ``output_path``. The pipeline
creates one file with the carriership per individuals
(``carriers_carriers.tsv.gz``), a genotype summary
(``carriers_summary.tsv.gz``) and a results (``XXX_results.tsv.gz``) and
baseline file (``XXX_baseline.tsv``) per exposure tested. In this case, there
is therefore a results and baseline file for the two variants identified in
the genetics file and the two aggregates, the genes.:

.. code-block:: bash

   ls ./results/

   # Output:
   # carriers_carriers.tsv.gz
   # carriers_summary.tsv.gz
   # 22_1437663_C_A_overall_baseline.tsv
   # 22_1437663_C_A_results.tsv.gz
   # 22_37632612_G_C_overall_baseline.tsv
   # 22_37632612_G_C_results.tsv.gz
   # GENE1_results.tsv.gz
   # GENE1_baseline.tsv
   # GENE2_results.tsv.gz
   # GENE2_baseline.tsv

View association results:

.. code-block:: bash

   zcat ./results/GENE1_results.tsv.gz

.. code-block:: text

    Model	Model name	Stratification name	Variable	Exposure	Cases	...
    NA	Unadjusted	overall	blood_pressure	GENE1	NA	...
    GLM-Binom	Unadjusted	overall	diabetes	Intercept	1.0	...
    GLM-Binom	Unadjusted	overall	diabetes	GENE1	1.0	...
    CHISQ	Unadjusted	overall	diabetes	GENE1	1.0	...


Next Steps
----------

- :doc:`configuration` - Complete configuration reference
- :doc:`usage` - Detailed usage guide
- :doc:`tests_guide` - Statistical tests reference
- :doc:`advanced` - Advanced features like gene aggregation and missingness handling


Getting Help
------------

- **Documentation**: https://mvvugt.gitlab.io/marvel/index.html
- **Issues**: https://gitlab.com/mvvugt/marvel/-/issues
- **Help**: ``marvelous --help``