Configuration Reference ======================= MARVELous uses INI-style configuration files (``.cnf``) to control all aspects of the pipeline. This page provides a complete reference for all configuration sections and options. File Format ----------- Configuration files use a simple INI format with sections enclosed in square brackets and tab-separated key-value pairs: .. code-block:: ini [SectionName] key1 value1 key2 value2 .. note:: Keys and values must be separated by **tabs**, not spaces. Lines starting with ``#`` are treated as comments. Configuration Sections ---------------------- .. _config-genoinput: [GenoInput] ^^^^^^^^^^^ Specifies genetic input files (VCF, BGEN, or PLINK format) for variant extraction. **Format:** ``identifierpath`` Each entry maps a unique identifier to a genetic file path. The identifier is used internally to track which file variants were extracted from. **Supported file types:** - ``.vcf``, ``.vcf.gz`` - VCF files - ``.bgen`` - BGEN files - ``.bed``, ``.bim``, ``.fam`` - PLINK files **Example:** .. code-block:: ini [GenoInput] chr20 /path/to/chr20.vcf.gz chr21 /path/to/chr21.vcf.gz chr22 /path/to/chr22.vcf.gz .. _config-varinput: [VarInput] ^^^^^^^^^^ Specifies variant definition files that list which variants to extract. **Format:** ``identifierpath`` Variant files should be tab-separated with columns for chromosome, position, reference allele, and alternate allele. The column names are configurable via the ``[Options]`` section. **Example:** .. code-block:: ini [VarInput] variants /path/to/variant_list.tsv **Variant file format:** .. code-block:: text chr pos a1 a2 gene 20 1234567 A G GENE1 20 2345678 C T GENE1 21 3456789 G A GENE2 .. _config-expinput: [ExpInput] ^^^^^^^^^^ Specifies pre-extracted exposure/variant carrier files for association analysis. Use this when you have already extracted variants and want to run association testing only. **Format:** ``identifierpath`` **Example:** .. code-block:: ini [ExpInput] exposures /path/to/extracted_carriers.tsv.gz .. _config-phenoinput: [PhenoInput] ^^^^^^^^^^^^ Specifies phenotype and covariate input files. **Required keys:** - ``phenotypes`` - Path to file containing outcome variables - ``covariates`` - Path to file containing covariate variables (required if using covariate models) **Example:** .. code-block:: ini [PhenoInput] phenotypes /path/to/phenotypes.tsv covariates /path/to/covariates.tsv Both files should be tab-separated with an ID column matching the genetic data. .. _config-contests: [ConTests] ^^^^^^^^^^ Defines continuous outcomes and the statistical tests to perform. **Format:** ``outcome_nametest1;test2;test3`` Tests are separated by semicolons. See :doc:`tests_guide` for available tests. **Example:** .. code-block:: ini [ConTests] blood_pressure OLS;KW;MWU;T cholesterol OLS;KW;MWU;T bmi OLS .. _config-cattests: [CatTests] ^^^^^^^^^^ Defines categorical outcomes and the statistical tests to perform. **Format:** ``outcome_nametest1;test2`` **Example:** .. code-block:: ini [CatTests] disease_severity CHISQ treatment_response CHISQ .. _config-bintests: [BinTests] ^^^^^^^^^^ Defines binary outcomes and the statistical tests to perform. **Format:** ``outcome_nametest1;test2;test3`` **Example:** .. code-block:: ini [BinTests] diabetes GLM-Binom;CHISQ;FISHER hypertension GLM-Binom;CHISQ mortality FISHER .. _config-survtests: [SurvTests] ^^^^^^^^^^^ Defines survival (time-to-event) outcomes and statistical tests. **Format:** ``event_column;time_columntest1;test2`` Note the special syntax: event and time columns are separated by a semicolon. **Requirements:** - Event column must be binary (0 = censored, 1 = event) - Time column must be continuous and positive - Both columns must exist in the phenotypes file **Example:** .. code-block:: ini [SurvTests] death;follow_up_years Cox-PH cvd_event;time_to_cvd Cox-PH cancer;age_at_diagnosis Cox-PH In this example: - ``death`` is the event indicator, ``follow_up_years`` is the time variable - Multiple survival outcomes can be specified - All covariate models from ``[Covs]`` will be applied .. _config-covs: [Covs] ^^^^^^ Defines covariate models for adjusted analyses. Each model specifies a set of covariates to include in regression analyses. **Format:** ``model_namecovariate1;covariate2;covariate3`` Use ``None`` for an unadjusted model. **Example:** .. code-block:: ini [Covs] Unadjusted None Age_Sex age;sex Full age;sex;bmi;smoking_status;PC1;PC2;PC3;PC4 Each model is run separately, allowing comparison of adjusted and unadjusted results. .. _config-filter: [Stratification] ^^^^^^^^^^^^^^^^ Defines filter models for running separate association analyses on subgroups of individuals. Each filter model specifies criteria to select a subset of individuals from the phenotype file. The analysis is then run independently for each filter model, similar to how each covariate model is run separately. **Format:** ``filter_namecolumn=value1,value2;column2=value3`` - Semicolons (``;``) separate criteria across different columns (AND logic) - Commas (``,``) separate multiple values within a column (OR logic) - Prefix values with ``!`` to exclude instead of include .. note:: Filtering applies to **association testing only**, not to variant extraction. **Example:** .. code-block:: ini [Stratification] females sex=female males sex=male young_females sex=female;age_group=young,middle In this example: - ``females``: include only individuals where ``sex`` equals ``female`` - ``males``: include only individuals where ``sex`` equals ``male`` - ``young_females``: include individuals where ``sex`` equals ``female`` AND ``age_group`` equals ``young`` or ``middle`` To exclude values, prefix them with ``!``: .. code-block:: ini [Stratification] non_smokers smoking=!current By default, an unfiltered "overall" analysis is also run alongside each filter model. This can be disabled with the ``stratify_overall`` option (see :ref:`config-options`). .. _config-output: [Output] ^^^^^^^^ Specifies output file paths. **Available keys:** - ``VarOutput`` - Base path for variant extraction output files - ``TestOutput`` - (Legacy) Path for test output **Example:** .. code-block:: ini [Output] VarOutput /path/to/results/extracted_variants This creates: - Carrier matrix - ``/path/to/results/extracted_variants_carriers.tsv.gz`` - Extraction summary - ``/path/to/results/extracted_variants_summary.tsv.gz`` .. _config-options: [Options] ^^^^^^^^^ Controls pipeline behavior with various configuration options. Core Pipeline Options """"""""""""""""""""" .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``extract_variants`` - bool - ``True`` - Enable variant extraction step * - ``association_analysis`` - bool - ``True`` - Enable association testing step * - ``id_column`` - str - ``id`` - Name of the sample ID column * - ``output_path`` - str - ``.`` - Directory for output files * - ``tmp_dir`` - str - ``./tmp`` - Directory for temporary files Column Configuration """""""""""""""""""" These options configure column names in your variant input files: .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``chr_column`` - str - ``chr`` - Chromosome column name * - ``pos_column`` - str - ``pos`` - Position column name * - ``ref_column`` - str - ``a1`` - Reference allele column name * - ``alt_column`` - str - ``a2`` - Alternate allele column name * - ``var_column`` - str - ``ID`` - Variant ID column name * - ``chr_pos_column`` - str - ``chr_pos`` - Combined chr:pos column name * - ``cat_column`` - str - ``None`` - Category/gene column for aggregation * - ``var_sep`` - str - ``:`` - Separator for variant IDs (chr:pos:ref:alt) Missingness Handling """""""""""""""""""" .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``missingness_strategy`` - bool - ``True`` - ``True`` for listwise deletion, ``False`` for pairwise * - ``max_missingness`` - float - ``0.5`` - Maximum allowed missingness proportion (0-1) * - ``cov_miss_error`` - bool - ``True`` - Raise error if covariates exceed missingness threshold Extraction Options """""""""""""""""" .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``incl_var`` - bool - ``True`` - Include individual variants in output (alongside gene aggregations) * - ``reverse`` - bool - ``True`` - Also check for ref/alt swapped variants * - ``region`` - str - ``None`` - Genomic region to extract (e.g., ``20:1000000-2000000``) * - ``prefilter_regions`` - bool - ``False`` - Boolean. When ``True``, pre-filters the input variants list to chromosomes/positions present in the genetic file before extraction. Use to skip variants not in the file and speed up large extractions. * - ``neg_geno`` - bool / ``None`` / str / int / float - ``True`` - What negative genotypes should be set to. ``True`` = keep unchanged, ``None`` = set to ``NA``, any other value = set to this value * - ``sum_geno`` - bool / ``None`` / str / int / float - ``True`` - What genotypes larger than 1 should be set to. ``True`` = keep unchanged, ``None`` = set to ``NA``, any other value = set to this value Testing Options """"""""""""""" .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``min_group_size`` - int - ``1`` - Minimum samples required per exposure group * - ``raise_on_error`` - bool - ``True`` - Raise exception on test failures * - ``stratify_overall`` - bool - ``True`` - Also run an unfiltered "overall" analysis alongside filtered models * - ``exposures`` - str - ``None`` - Semicolon-separated list of specific exposures to test Parallel Execution """""""""""""""""" .. list-table:: :header-rows: 1 :widths: 25 15 15 45 * - Option - Type - Default - Description * - ``n_jobs`` - int - ``-1`` - Number of parallel jobs (-1 = all CPUs) * - ``checkpoint_dir`` - str - ``./tmp_check`` - Directory for checkpointing * - ``force_rerun`` - bool - ``False`` - Force re-extraction even if checkpoints exist **Example:** .. code-block:: ini [Options] extract_variants True association_analysis True id_column sample_id output_path ./results cat_column gene missingness_strategy True max_missingness 0.3 n_jobs 4 Complete Example ---------------- Below is a working example combining all sections. This full reference configuration file with every available option and its default value can be downloaded here: :download:`full_config.cnf <../../resources/examples/full_config.cnf>` .. include:: ../../resources/examples/full_config.cnf :literal: Configuration for Specific Use Cases ------------------------------------ Extraction Only ^^^^^^^^^^^^^^^ To run variant extraction without association testing: .. code-block:: ini [GenoInput] chr1 /data/chr1.vcf.gz [VarInput] variants /data/variants.tsv [Output] VarOutput /results/extracted [Options] extract_variants True association_analysis False id_column IID Association Only ^^^^^^^^^^^^^^^^ To run association testing on pre-extracted data: .. code-block:: ini [ExpInput] carriers /data/extracted_carriers.tsv.gz [PhenoInput] phenotypes /data/outcomes.tsv covariates /data/covariates.tsv [BinTests] disease GLM-Binom;FISHER [Covs] Adjusted age;sex [Options] extract_variants False association_analysis True id_column IID output_path /results Stratified Analysis ^^^^^^^^^^^^^^^^^^^ To run association testing separately for subgroups of individuals: .. code-block:: ini [ExpInput] carriers /data/extracted_carriers.tsv.gz [PhenoInput] phenotypes /data/outcomes.tsv covariates /data/covariates.tsv [BinTests] disease GLM-Binom;FISHER [Covs] Adjusted age;sex [Stratification] females sex=female males sex=male non_smokers smoking=!current [Options] extract_variants False association_analysis True id_column IID output_path /results stratify_overall True This runs the association analysis four times: once unfiltered ("overall"), once for females only, once for males only, and once excluding current smokers. Results include a ``Filter name`` column identifying which filter model was applied. Creation of configuration file ------------------------------ The configuration file can be created manually with the described formatting rules. To aid the creation, a helper function is available, which is illustrated below. This includes a check of the written file, which checks that: - at least ``extract_variants`` or ``association_analysis`` is enabled - genetic files have supported formats (based on their extension) - the variant output identifiers are as expected - a covariate file is supplied if adjusted models are specified - the survival formatting is right - the specified tests are recognised - specified input files and columns are present .. code-block:: python # imports from marvel.utils.config_tools import ( create_config, ConfigParser, check_config_file, ) # set some dictionaries geno_input = { 'chr21': 'path/to/chr21.vcf.gz', 'chr22': 'path/to/chr22.vcf.gz', } pheno_input = { 'phenotypes': 'path/to/outcomes.tsv', 'covariates': 'path/to/covariates.tsv', } con_tests = { 'sbp' : ['OLS', 'KW'], 'ldl_cholesterol' : ['OLS', 'AOV'], 'crp_level' : ['OLS', 'KW', 'AOV'], } cat_tests = { 'disease_severity' : ['CHISQ'], 'treatment_response' : ['CHISQ'], 'risk_category' : ['CHISQ'], } bin_tests = { 'hypertension' : ['GLM-Binom', 'CHISQ'], 'diabetes' : ['GLM-Binom', 'FISHER'], 'cvd' : ['GLM-Binom', 'CHISQ', 'FISHER'], } covs = { 'Unadjusted' : 'None', 'Adjusted' : ['age', 'sex'], } # create the configuration file, which will be written to the `path` create_config( path='path/to/config.cnf', extract_variants=True, geno_input=geno_input, variant_input={'variants': 'path/to/variants.tsv'}, variant_output={'VarOutput': '/path/to/prefix'}, association_analysis=True, pheno_input=pheno_input, con_tests=con_tests, cat_tests=cat_tests, bin_tests=bin_tests, covs=covs, id_column = 'id', raise_on_error = True, ref_column = 'a1', alt_column = 'a2', cat_column = 'gene', incl_var = False, output_path = 'path/to/output', ) # do some basic checks on the configuration file config = ConfigParser('path/to/config.cnf') check_config_file(config) See Also -------- - :doc:`usage` - Command-line usage - :doc:`tests_guide` - Statistical tests reference - :doc:`advanced` - Advanced features