marvel.utils package

marvel.utils.config_tools module

Tools to deal with configuration files

marvel.utils.config_tools.ConfigParser(path: str | None = None, encoding: str = 'utf-8') dict

Parses configuration files into structured data and optionally assigns this to a user supplied mapper instance.

Parameters:
  • path (str | None) – Path to the configuration file to be parsed or None if example configuration file should be used.

  • required_headers (set | list(set) | None) – Set or list of sets with required headers in the configuration file.

  • encoding (str) – Encoding, default is utf-8.

Raises:
  • ValueError – if a parsed line does not contain a tab delimiter.

  • InputValidationError – If a value-key pair is duplicated within the same section.

class marvel.utils.config_tools.PipelineConfig(extract_variants: bool = True, association_analysis: bool = True, n_jobs: int = -1, checkpoint_dir: str = './tmp_check', force_rerun: bool = False, raise_on_error: bool = True, id_column: str = 'id', output_path: str = '.', min_group_size: int = 1, tmp_dir: str = './tmp', listwise: bool = True, max_missingness: float = 0.5, cov_miss_error: bool = True, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', cat_column: str | None = None, var_sep: str = ':', file_sep: str = '\t', incl_var: bool = True, reverse: bool = True, region: str | None = None, neg_geno: object = True, sum_geno: object = True, stratify_models: Dict[str, Dict[str, List[str]] | None] | None = None, stratify_overall: bool = True, prefilter_regions: bool = False, geno_files: Dict[str, str] = None, var_files: Dict[str, str] = None, exp_files: Dict[str, str] = None, pheno_file: str | None = None, cov_file: str | None = None, variant_output: Dict[str, str] | None = None, exposures: List[str] | None = None, outcomes: Dict[str, dict] = None, covariate_models: Dict[str, List[str] | None] = None)

Bases: object

Configuration for MARVELous pipeline

alt_column: str = 'a2'
association_analysis: bool = True
cat_column: str | None = None
checkpoint_dir: str = './tmp_check'
chr_column: str = 'chr'
chr_pos_column: str = 'chr_pos'
cov_file: str | None = None
cov_miss_error: bool = True
covariate_models: Dict[str, List[str] | None] = None
exp_files: Dict[str, str] = None
exposures: List[str] | None = None
extract_variants: bool = True
file_sep: str = '\t'
force_rerun: bool = False
classmethod from_config_dict(config: dict) PipelineConfig

Create PipelineConfig from configuration dictionary

Parameters:

config (dict) – Parsed configuration dictionary

Returns:

Configuration object

Return type:

PipelineConfig

geno_files: Dict[str, str] = None
id_column: str = 'id'
incl_var: bool = True
listwise: bool = True
max_missingness: float = 0.5
min_group_size: int = 1
n_jobs: int = -1
neg_geno: object = True
outcomes: Dict[str, dict] = None
output_path: str = '.'
pheno_file: str | None = None
pos_column: str = 'pos'
prefilter_regions: bool = False
raise_on_error: bool = True
ref_column: str = 'a1'
region: str | None = None
reverse: bool = True
stratify_models: Dict[str, Dict[str, List[str]] | None] | None = None
stratify_overall: bool = True
sum_geno: object = True
tmp_dir: str = './tmp'
validate()

Validate configuration

var_column: str = 'ID'
var_files: Dict[str, str] = None
var_sep: str = ':'
variant_output: Dict[str, str] | None = None
marvel.utils.config_tools.check_config_file(config: dict | str, key_sep: str = ';') bool

QC of the configuration file

By doing this QC first basic checks are made to see if the pipeline can be executed successfully. This prevents silly errors after a long run-time.

Parameters:
  • config (dict or str) – Dictionary of the config-file or path to the file.

  • key_sep (str) – Separator for the RHS values

Returns:

True if configuration file passed QC

Return type:

bool

Raises:
  • FileNotFoundError – If configuration file is not found

  • TypeError – If genetic input files are not supported

  • ConfigHeaderMissingError – If headers are missing

  • InputValidationError – If none of the extract_variants or association_analysis options are True

marvel.utils.config_tools.cnf_extract_outcome_tests(config: dict, header_name)
marvel.utils.config_tools.cnf_extract_survival_tests(config: dict, header_name)

Extract survival test definitions from config

The LHS key is event_col;time_col (semicolon-separated pair). The RHS is the test list (same as other sections).

Returns a dict keyed by event_col with value:

{ColType: 'survival', Tests: [...], TimeCol: time_col}
marvel.utils.config_tools.create_config(path: str, extract_variants: bool = True, geno_input: dict | None = None, variant_input: dict | None = None, variant_output: dict | None = None, association_analysis: bool = True, exp_input: dict | None = None, pheno_input: dict | None = None, con_tests: dict | None = None, cat_tests: dict | None = None, bin_tests: dict | None = None, surv_tests: dict | None = None, covs: dict | None = None, stratify: dict | None = None, key_sep: str = ';', **other_options)

Create a MARVELous configuration file

Parameters:
  • path (str) – Path where the configuration file will be written

  • extract_variants (bool, default True) – Whether to extract variants. If True, at least the geno_input, variant_input, and variant_output arguments should be specified.

  • geno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of genetic input files (bgen/vcf) to extract variants from (values)

  • variant_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of variant definition input files (values)

  • variant_output (str or None, default None) – Dictionary mapping identifiers (keys) to paths of output files for extracted variants

  • association_analysis (bool, default True) – Whether to perform association analysis. If True, at least the pheno_input, and one of con_tests, cat_tests, or bin_tests arguments should be specified.

  • exp_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of extracted variant input files (values)

  • pheno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of phenotype input files (values). Key ‘phenotypes’ is required if association_analysis is True and ‘covariates’ if association_analysis is True and models are specified using covs

  • con_tests (dict or None, default None) – Dictionary mapping continuous outcomes (keys) to semicolon-separated test strings (values)

  • cat_tests (dict or None, default None) – Dictionary mapping categorical outcomes (keys) to semicolon-separated test strings (values)

  • bin_tests (dict or None, default None) – Dictionary mapping binary outcomes (keys) to semicolon-separated test strings (values)

  • covs (dict or None, default None) – Dictionary mapping model names (keys) to semicolon-separated covariate strings

  • stratify (dict or None, default None) – Dictionary mapping stratification names (keys) to semicolon-separated stratification options.

  • key_sep (str) – Separatar between RHS strings

  • **other_options – Additional options for the [Options] section

Examples

>>> create_config(
...     'example.cnf',
...     geno_input={'chr1': '/path/to/chr1.bgen'},
...     var_input={'gene_vars': '/path/to/variants.txt'},
...     variant_output={'variant_output' : '/path/to/output.txt'},
...     extract_variants=True
... )
marvel.utils.config_tools.custom_config(path=None, text: None | dict[str, list[str]] = None) str

Returns a toy example of a data configuration file as a string or uses user-supplied text if provided in dictionary format.

Parameters:
  • path (str, default None) – An optional path to write the text to disk. Uses utf-8 file encoding.

  • user_text (dict [str, list [str]], default None) – Configuration content specified as a dictionary. Each key represents a header (enclosed in square brackets in the output), and the corresponding value is a list of attributes or entries. Tab characters in list items are preserved, enabling target-source assignments in the generated file.

Returns:

The formatted configuration file.

Return type:

str

Examples

>>> custom_text = {
    ...     "CustomSection1": ["CustomAttribute1", "CustomAttribute2"],
    ...     "MetaData": ["CustomData        CustomValue"],
    ...     "AdditionalInfo": ["Info1", "Info2"],
    ... }
>>> print(data_configuration(user_text=custom_text))
[CustomSection1]
CustomAttribute1
CustomAttribute2
[MetaData]
CustomData  CustomValue
[AdditionalInfo]
Info1
Info2
marvel.utils.config_tools.update_config_file(config: dict, path: str | None = None)

Update the configuration file

After extraction of variants, the extract_variants option should be set to False and the file containing the variants should be supplied. This function updates the configuration file to perform this.

Parameters:
  • config (dict) – Parsed dictionary of the current configuration file.

  • path (str, default None) – An optional path to write the text to disk

Returns:

config – Parsed dictionary of the updated configuration file.

Return type:

dict

marvel.utils.data_manager module

Memory-efficient data manager with intelligent caching

class marvel.utils.data_manager.DataManager(pheno_file: str | None = None, cov_file: str | None = None, id_column: str = 'id', valid_ids: Set | None = None, outcome_cache_size: int = 20, covariate_cache_size: int = 20, sep: str = '\t')

Bases: object

Memory-efficient data manager for phenotype, covariate, and genetic data.

Uses intelligent caching to minimize both memory usage and file I/O: - Current exposure is cached (reused across all outcomes) - LRU cache for recently accessed outcomes - LRU cache for recently accessed covariate sets

clear_caches()

Clear all caches

get_cache_stats() dict

Get cache performance statistics

Returns:

Statistics about cache hits/misses

Return type:

dict

get_exposure() DataFrame | None

Get current cached exposure data

Returns:

Cached exposure data

Return type:

pd.DataFrame or None

get_merged_data(exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) Tuple[DataFrame, DataFrame | None]

Get merged data for a single test: exposure + outcome + covariates

Parameters:
  • exposure (str) – Exposure name (must be set via set_exposure first)

  • outcome (str) – Outcome name

  • covariates (list of str, optional) – Covariate names

  • extra_outcome_cols (list of str, optional) – Additional columns to load from the phenotype file alongside the outcome (e.g. survival time column)

Returns:

(data_df, cov_df) where: - data_df has ID + exposure + outcome (+ extra_outcome_cols) - cov_df has ID + covariates (or None)

Return type:

tuple

load_covariates(covariates: List[str], stratify_to_valid: bool = True, refactor: bool = True) DataFrame | None

Load covariates from covariate file

Parameters:
  • covariates (list of str) – Covariate column names

  • stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + covariate columns, or None if no covariates

Return type:

pd.DataFrame or None

load_outcome(outcome: str, stratify_to_valid: bool = True) DataFrame

Load a single outcome from phenotype file

Parameters:
  • outcome (str) – Outcome column name

  • stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + outcome columns

Return type:

pd.DataFrame

load_outcomes(outcomes: List[str], stratify_to_valid: bool = True) DataFrame

Load multiple outcomes from phenotype file

Parameters:
  • outcomes (list of str) – Outcome column names

  • stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + outcome columns

Return type:

pd.DataFrame

set_exposure(exposure: str, exposure_data: DataFrame)

Cache current exposure data

Parameters:
  • exposure (str) – Exposure name

  • exposure_data (pd.DataFrame) – Exposure dataframe (ID + exposure column)

class marvel.utils.data_manager.LRUCache(capacity: int = 20)

Bases: object

Simple LRU (Least Recently Used) cache

clear()

Clear all cached items

get(key: str) DataFrame | None

Get item from cache

Parameters:

key (str) – Cache key

Returns:

Cached dataframe if exists, None otherwise

Return type:

pd.DataFrame or None

put(key: str, value: DataFrame)

Put item in cache

Parameters:
  • key (str) – Cache key

  • value (pd.DataFrame) – Dataframe to cache

marvel.utils.missingness module

Missingness handling for MARVELous Pipeline

class marvel.utils.missingness.MissingnessHandler(listwise: bool = True, verbose: bool = True)

Bases: object

Handle missing data using listwise or pairwise deletion strategies.

Listwise deletion removes samples with missing values across all relevant columns before any analysis (typically applied once at the pipeline level).

Pairwise deletion removes samples with missing values only for the specific exposure-outcome-covariates combination being tested (applied per test).

filter_by_missingness(data: DataFrame, columns: List[str], max_missingness: float, column_type: str = 'column', raise_error: bool = False) tuple[List[str], List[str]]

Filter columns by missingness threshold.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • columns (list of str) – Columns to check

  • max_missingness (float) – Maximum allowed missingness proportion (0-1)

  • column_type (str) – Type of columns for reporting (e.g., “outcome”, “exposure”, “covariate”)

  • raise_error (bool) – If True, raise ValueError when columns exceed threshold. If False, log warning and return filtered list.

Returns:

(valid_columns, removed_columns)

Return type:

tuple of (list of str, list of str)

Raises:

ValueError – If raise_error=True and any columns exceed the threshold

get_complete_columns(data: DataFrame, columns: List[str]) List[str]

Get list of columns that have no missing values.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • columns (list of str) – Columns to check

Returns:

Columns with no missing values

Return type:

list of str

get_missing_summary(data: DataFrame, columns: List[str] | None = None) DataFrame

Get summary of missing data patterns.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • columns (list of str, optional) – Columns to summarize. If None, uses all columns.

Returns:

Summary with columns: column, n_missing, pct_missing

Return type:

pd.DataFrame

get_valid_ids(data: DataFrame, id_column: str, columns: List[str]) set

Get set of valid IDs (IDs with complete data in specified columns). Memory-efficient method for large dataframes.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • id_column (str) – ID column name

  • columns (list of str) – Columns to check for missing values (excluding id_column)

Returns:

Set of valid IDs with complete data

Return type:

set

get_valid_ids_from_files(files: List[str], id_column: str, columns: List[str], sep: str = '\t', chunksize: int | None = None) set

Get valid IDs by iterating over files in chunks. Very memory-efficient for extremely large files.

Parameters:
  • files (list of str) – List of file paths to process

  • id_column (str) – ID column name

  • columns (list of str) – Columns to check for missing values

  • sep (str) – File separator (default: tab)

  • chunksize (int, optional) – Number of rows to read at a time. If None, reads entire file.

Returns:

Set of valid IDs across all files

Return type:

set

handle(data: DataFrame, columns: List[str] | None = None, exposure: str | None = None, outcome: str | None = None, covariates: List[str] | None = None, id_column: str | None = None) DataFrame

Apply the configured missingness handling strategy.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • columns (list of str, optional) – Columns to check (for listwise deletion)

  • exposure (str, optional) – Exposure column name (for pairwise deletion)

  • outcome (str, optional) – Outcome column name (for pairwise deletion)

  • covariates (list of str, optional) – Covariate column names (for pairwise deletion)

  • id_column (str, optional) – ID column name (for reporting purposes)

Returns:

Dataframe with missing data handled according to strategy

Return type:

pd.DataFrame

handle_listwise(data: DataFrame, columns: List[str]) DataFrame

Apply listwise deletion across specified columns.

This removes any sample that has missing data in ANY of the specified columns. Should be applied once at the pipeline level before iterating over exposures.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • columns (list of str) – Columns to check for missing values

Returns:

Dataframe with complete cases only

Return type:

pd.DataFrame

handle_pairwise(data: DataFrame, exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) DataFrame

Apply pairwise deletion for a specific exposure-outcome-covariate combination.

This removes only samples with missing data in the exposure, outcome, or covariates being tested. Should be applied per-test in the testing loop.

Parameters:
  • data (pd.DataFrame) – Input dataframe

  • exposure (str) – Exposure column name

  • outcome (str) – Outcome column name

  • covariates (list of str, optional) – Covariate column names

  • extra_outcome_cols (list of str, optional) – Additional outcome-related columns to check for missing values (e.g., survival time columns that should be treated like outcome columns)

  • id_column (str, optional) – ID column name (for reporting purposes)

Returns:

Dataframe with complete cases for this test

Return type:

pd.DataFrame

marvel.utils.utils module

Collection of useful functions for the marvel package

marvel.utils.utils.check_environ(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py') str

Will check if the environ_variable is set, and otherwise tries a fall_back path.

Parameters:
  • environ_variable (str,) – The environmental variable to check for.

  • fall_back (str,) – A fall back option that can be return if the environ_variable is not set. Supply NoneType to ignore and return an error instead.

Return type:

The environ_variable or fall_back content as string.

Raises:

TypeError – Raised if environ_variable is not set and fall_back is set to NoneType.

marvel.utils.utils.check_extension(file_name: str, extension: str | list[str], value: bool = False)

Check whether the extension of a file is as expected

Parameters:
  • file_name (str) – Name of the file to check

  • extension (str or list of str) – The extension(s) to be tested

  • value (bool, default is False) – Return the file extension if True

Returns:

  • True or the file extension if the file_name has one of the given

  • extensions. Raises InputValidationError otherwise.

marvel.utils.utils.has_rows(dfs: DataFrame | list[DataFrame])

Check if dataframe has rows

Parameters:

dfs (pd.DataFrame or list of pd.DataFrame) – Dataframes to be checked

Raises:

InputValidationError

marvel.utils.utils.infer_column_types(df)

Infer the column types of columns in a dataframe

Parameters:

df (pd.DataFrame) – DataFrame of the data to be inferred.

Returns:

  • Dictionary with the keys being column types and the values being column

  • names.

marvel.utils.utils.is_gzip(file_name: str)

Check whether a file_name is referring to a gzipped file

Parameters:

file_name (str) – Name of the file to check

Return type:

True if the file_name has ‘gz’ as extension, otherwise False.

marvel.utils.utils.load_custom(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py')

Loads a user-defined module from an environmental variable or fall_back path.

marvel.utils.utils.merge_dfs(data: list[DataFrame], id_cols: list[str | bool] | bool | None = None, **kwargs)

Merge several files to one dataframe

Parameters:
  • data (list of str or list of pd.DataFrame) – List of dataframes

  • id_cols (list of str or list of bool or bool, default None) – List of column names which represent individual IDs. Should be as long as the list of data. If the index contains the individual IDs, the specific value of this data should be set to True. If all data have individual IDs in the index, set to True or None.

  • **kwargs – Additional keyword arguments that will be passed on to merge

Returns:

merged_data – Merged dataframe with all the input

Return type:

pd.DataFrame

marvel.utils.utils.open_file(filepath: str, **kwargs)

Opens a gzipped or non-gzipped file

Parameters:
  • filepath (str) – Path to the file to read.

  • **kwargs – Additional keyword arguments passed on to the actual open function.

Returns:

file object

Return type:

Opened file object (either plain or gzipped)

Examples

>>> with open_file("yourfile.tsv.gz") as f:
>>>     for line in f:
>>>         print(line.strip())
marvel.utils.utils.qc_dict(in_dict: dict, key_names: list, required: bool = True, warning: bool = False) dict

Check if key_names are present in dictionary

Parameters:
  • in_dict (dict) – Input dictionary.

  • key_names (list) – List of keys that should be present in in_dict.

  • required (bool) – Whether the keys should be set to None if non-existent.

  • warning (bool) – Whether a warning should be raised in stead of an error if in_dict contains invalid keys

Returns:

in_dict – Dictionary with all key_names or the subset of key_names already present in in_dict.

Return type:

dict

marvel.utils.utils.read_if_new(new_file_name: str, old_file_name: str | None, old_df: DataFrame, **kwargs)

Read dataframe if new

Parameters:
  • new_file_name (str) – Path to the file name that should be read if different from old_file_name.

  • old_file_name (str or None) – Path to the file name that is currently read.

  • old_df (pd.DataFrame) – The dataframe read from the path of old_file_name.

  • **kwargs – Additional keyword arguments passed on to pd.read_csv()

Returns:

df – A dataframe. The old_df if the file_names were equal, or the data in new_file_name of the file_names were not equal.

Return type:

pd.DataFrame

marvel.utils.utils.refactor_col(col: Series)

Create category of dataframe column

Parameters:

col (pd.Series) – Dataframe column

Return type:

The same column, but then turned into categories

marvel.utils.utils.return_header_list(file, sep='\t')