`marvel.utils` package

`marvel.utils.config_tools` module

Tools to deal with configuration files

marvel.utils.config_tools.ConfigParser(path: str | None = None, encoding: str = 'utf-8') → dict

Parses configuration files into structured data and optionally assigns this to a user supplied mapper instance.

Parameters:

path (str | None) – Path to the configuration file to be parsed or None if example configuration file should be used.
required_headers (set | list(set) | None) – Set or list of sets with required headers in the configuration file.
encoding (str) – Encoding, default is utf-8.

Raises:

ValueError – if a parsed line does not contain a tab delimiter.
InputValidationError – If a value-key pair is duplicated within the same section.

class marvel.utils.config_tools.PipelineConfig(extract_variants: bool = True, association_analysis: bool = True, n_jobs: int = -1, checkpoint_dir: str = './tmp_check', force_rerun: bool = False, raise_on_error: bool = True, id_column: str = 'id', output_path: str = '.', min_group_size: int = 1, tmp_dir: str = './tmp', listwise: bool = True, max_missingness: float = 0.5, cov_miss_error: bool = True, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', cat_column: str | None = None, var_sep: str = ':', file_sep: str = '\t', incl_var: bool = True, reverse: bool = True, region: str | None = None, neg_geno: object = True, sum_geno: object = True, stratify_models: Dict[str, Dict[str, List[str]] | None] | None = None, stratify_overall: bool = True, prefilter_regions: bool = False, geno_files: Dict[str, str] = None, var_files: Dict[str, str] = None, exp_files: Dict[str, str] = None, pheno_file: str | None = None, cov_file: str | None = None, variant_output: Dict[str, str] | None = None, exposures: List[str] | None = None, outcomes: Dict[str, dict] = None, covariate_models: Dict[str, List[str] | None] = None)

Bases: object

Configuration for MARVELous pipeline

alt_column: str = 'a2'

association_analysis: bool = True

cat_column: str | None = None

checkpoint_dir: str = './tmp_check'

chr_column: str = 'chr'

chr_pos_column: str = 'chr_pos'

cov_file: str | None = None

cov_miss_error: bool = True

covariate_models: Dict[str, List[str] | None] = None

exp_files: Dict[str, str] = None

exposures: List[str] | None = None

extract_variants: bool = True

file_sep: str = '\t'

force_rerun: bool = False

classmethod from_config_dict(config: dict) → PipelineConfig

Create PipelineConfig from configuration dictionary

Parameters:: config (dict) – Parsed configuration dictionary
Returns:: Configuration object
Return type:: PipelineConfig

geno_files: Dict[str, str] = None

id_column: str = 'id'

incl_var: bool = True

listwise: bool = True

max_missingness: float = 0.5

min_group_size: int = 1

n_jobs: int = -1

neg_geno: object = True

outcomes: Dict[str, dict] = None

output_path: str = '.'

pheno_file: str | None = None

pos_column: str = 'pos'

prefilter_regions: bool = False

raise_on_error: bool = True

ref_column: str = 'a1'

region: str | None = None

reverse: bool = True

stratify_models: Dict[str, Dict[str, List[str]] | None] | None = None

stratify_overall: bool = True

sum_geno: object = True

tmp_dir: str = './tmp'

validate(): Validate configuration

var_column: str = 'ID'

var_files: Dict[str, str] = None

var_sep: str = ':'

variant_output: Dict[str, str] | None = None

marvel.utils.config_tools.check_config_file(config: dict | str, key_sep: str = ';') → bool

QC of the configuration file

By doing this QC first basic checks are made to see if the pipeline can be executed successfully. This prevents silly errors after a long run-time.

Parameters:

config (dict or str) – Dictionary of the config-file or path to the file.
key_sep (str) – Separator for the RHS values

Returns:

True if configuration file passed QC

Return type:

bool

Raises:

FileNotFoundError – If configuration file is not found
TypeError – If genetic input files are not supported
ConfigHeaderMissingError – If headers are missing
InputValidationError – If none of the extract_variants or association_analysis options are True

marvel.utils.config_tools.cnf_extract_outcome_tests(config: dict, header_name)

marvel.utils.config_tools.cnf_extract_survival_tests(config: dict, header_name)

Extract survival test definitions from config

The LHS key is event_col;time_col (semicolon-separated pair). The RHS is the test list (same as other sections).

Returns a dict keyed by event_col with value:

{ColType: 'survival', Tests: [...], TimeCol: time_col}

marvel.utils.config_tools.create_config(path: str, extract_variants: bool = True, geno_input: dict | None = None, variant_input: dict | None = None, variant_output: dict | None = None, association_analysis: bool = True, exp_input: dict | None = None, pheno_input: dict | None = None, con_tests: dict | None = None, cat_tests: dict | None = None, bin_tests: dict | None = None, surv_tests: dict | None = None, covs: dict | None = None, stratify: dict | None = None, key_sep: str = ';', **other_options)

Create a MARVELous configuration file

Parameters:

path (str) – Path where the configuration file will be written
extract_variants (bool, default True) – Whether to extract variants. If True, at least the geno_input, variant_input, and variant_output arguments should be specified.
geno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of genetic input files (bgen/vcf) to extract variants from (values)
variant_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of variant definition input files (values)
variant_output (str or None, default None) – Dictionary mapping identifiers (keys) to paths of output files for extracted variants
association_analysis (bool, default True) – Whether to perform association analysis. If True, at least the pheno_input, and one of con_tests, cat_tests, or bin_tests arguments should be specified.
exp_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of extracted variant input files (values)
pheno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of phenotype input files (values). Key ‘phenotypes’ is required if association_analysis is True and ‘covariates’ if association_analysis is True and models are specified using covs
con_tests (dict or None, default None) – Dictionary mapping continuous outcomes (keys) to semicolon-separated test strings (values)
cat_tests (dict or None, default None) – Dictionary mapping categorical outcomes (keys) to semicolon-separated test strings (values)
bin_tests (dict or None, default None) – Dictionary mapping binary outcomes (keys) to semicolon-separated test strings (values)
covs (dict or None, default None) – Dictionary mapping model names (keys) to semicolon-separated covariate strings
stratify (dict or None, default None) – Dictionary mapping stratification names (keys) to semicolon-separated stratification options.
key_sep (str) – Separatar between RHS strings
**other_options – Additional options for the [Options] section

Examples

>>> create_config(
...     'example.cnf',
...     geno_input={'chr1': '/path/to/chr1.bgen'},
...     var_input={'gene_vars': '/path/to/variants.txt'},
...     variant_output={'variant_output' : '/path/to/output.txt'},
...     extract_variants=True
... )

marvel.utils.config_tools.custom_config(path=None, text: None | dict[str, list[str]] = None) → str

Returns a toy example of a data configuration file as a string or uses user-supplied text if provided in dictionary format.

Parameters:

path (str, default None) – An optional path to write the text to disk. Uses utf-8 file encoding.
user_text (dict [str, list [str]], default None) – Configuration content specified as a dictionary. Each key represents a header (enclosed in square brackets in the output), and the corresponding value is a list of attributes or entries. Tab characters in list items are preserved, enabling target-source assignments in the generated file.

Returns:

The formatted configuration file.

Return type:

str

Examples

>>> custom_text = {
    ...     "CustomSection1": ["CustomAttribute1", "CustomAttribute2"],
    ...     "MetaData": ["CustomData        CustomValue"],
    ...     "AdditionalInfo": ["Info1", "Info2"],
    ... }
>>> print(data_configuration(user_text=custom_text))
[CustomSection1]
CustomAttribute1
CustomAttribute2
[MetaData]
CustomData  CustomValue
[AdditionalInfo]
Info1
Info2

marvel.utils.config_tools.update_config_file(config: dict, path: str | None = None)

Update the configuration file

After extraction of variants, the extract_variants option should be set to False and the file containing the variants should be supplied. This function updates the configuration file to perform this.

Parameters:

config (dict) – Parsed dictionary of the current configuration file.
path (str, default None) – An optional path to write the text to disk

Returns:

config – Parsed dictionary of the updated configuration file.

Return type:

dict

`marvel.utils.data_manager` module

Memory-efficient data manager with intelligent caching

class marvel.utils.data_manager.DataManager(pheno_file: str | None = None, cov_file: str | None = None, id_column: str = 'id', valid_ids: Set | None = None, outcome_cache_size: int = 20, covariate_cache_size: int = 20, sep: str = '\t')

Bases: object

Memory-efficient data manager for phenotype, covariate, and genetic data.

Uses intelligent caching to minimize both memory usage and file I/O: - Current exposure is cached (reused across all outcomes) - LRU cache for recently accessed outcomes - LRU cache for recently accessed covariate sets

clear_caches(): Clear all caches

get_cache_stats() → dict

Get cache performance statistics

Returns:: Statistics about cache hits/misses
Return type:: dict

get_exposure() → DataFrame | None

Get current cached exposure data

Returns:: Cached exposure data
Return type:: pd.DataFrame or None

get_merged_data(exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) → Tuple[DataFrame, DataFrame | None]

Get merged data for a single test: exposure + outcome + covariates

Parameters:

exposure (str) – Exposure name (must be set via set_exposure first)
outcome (str) – Outcome name
covariates (list of str, optional) – Covariate names
extra_outcome_cols (list of str, optional) – Additional columns to load from the phenotype file alongside the outcome (e.g. survival time column)

Returns:

(data_df, cov_df) where: - data_df has ID + exposure + outcome (+ extra_outcome_cols) - cov_df has ID + covariates (or None)

Return type:

tuple

load_covariates(covariates: List[str], stratify_to_valid: bool = True, refactor: bool = True) → DataFrame | None

Load covariates from covariate file

Parameters:

covariates (list of str) – Covariate column names
stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + covariate columns, or None if no covariates

Return type:

pd.DataFrame or None

load_outcome(outcome: str, stratify_to_valid: bool = True) → DataFrame

Load a single outcome from phenotype file

Parameters:

outcome (str) – Outcome column name
stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + outcome columns

Return type:

pd.DataFrame

load_outcomes(outcomes: List[str], stratify_to_valid: bool = True) → DataFrame

Load multiple outcomes from phenotype file

Parameters:

outcomes (list of str) – Outcome column names
stratify_to_valid (bool) – Whether to stratify to valid_ids

Returns:

DataFrame with ID + outcome columns

Return type:

pd.DataFrame

set_exposure(exposure: str, exposure_data: DataFrame)

Cache current exposure data

Parameters:

exposure (str) – Exposure name
exposure_data (pd.DataFrame) – Exposure dataframe (ID + exposure column)

class marvel.utils.data_manager.LRUCache(capacity: int = 20)

Bases: object

Simple LRU (Least Recently Used) cache

clear(): Clear all cached items

get(key: str) → DataFrame | None

Get item from cache

Parameters:: key (str) – Cache key
Returns:: Cached dataframe if exists, None otherwise
Return type:: pd.DataFrame or None

put(key: str, value: DataFrame)

Put item in cache

Parameters:

key (str) – Cache key
value (pd.DataFrame) – Dataframe to cache

`marvel.utils.missingness` module

Missingness handling for MARVELous Pipeline

class marvel.utils.missingness.MissingnessHandler(listwise: bool = True, verbose: bool = True)

Bases: object

Handle missing data using listwise or pairwise deletion strategies.

Listwise deletion removes samples with missing values across all relevant columns before any analysis (typically applied once at the pipeline level).

Pairwise deletion removes samples with missing values only for the specific exposure-outcome-covariates combination being tested (applied per test).

filter_by_missingness(data: DataFrame, columns: List[str], max_missingness: float, column_type: str = 'column', raise_error: bool = False) → tuple[List[str], List[str]]

Filter columns by missingness threshold.

Parameters:

data (pd.DataFrame) – Input dataframe
columns (list of str) – Columns to check
max_missingness (float) – Maximum allowed missingness proportion (0-1)
column_type (str) – Type of columns for reporting (e.g., “outcome”, “exposure”, “covariate”)
raise_error (bool) – If True, raise ValueError when columns exceed threshold. If False, log warning and return filtered list.

Returns:

(valid_columns, removed_columns)

Return type:

tuple of (list of str, list of str)

Raises:

ValueError – If raise_error=True and any columns exceed the threshold

get_complete_columns(data: DataFrame, columns: List[str]) → List[str]

Get list of columns that have no missing values.

Parameters:

data (pd.DataFrame) – Input dataframe
columns (list of str) – Columns to check

Returns:

Columns with no missing values

Return type:

list of str

get_missing_summary(data: DataFrame, columns: List[str] | None = None) → DataFrame

Get summary of missing data patterns.

Parameters:

data (pd.DataFrame) – Input dataframe
columns (list of str, optional) – Columns to summarize. If None, uses all columns.

Returns:

Summary with columns: column, n_missing, pct_missing

Return type:

pd.DataFrame

get_valid_ids(data: DataFrame, id_column: str, columns: List[str]) → set

Get set of valid IDs (IDs with complete data in specified columns). Memory-efficient method for large dataframes.

Parameters:

data (pd.DataFrame) – Input dataframe
id_column (str) – ID column name
columns (list of str) – Columns to check for missing values (excluding id_column)

Returns:

Set of valid IDs with complete data

Return type:

set

get_valid_ids_from_files(files: List[str], id_column: str, columns: List[str], sep: str = '\t', chunksize: int | None = None) → set

Get valid IDs by iterating over files in chunks. Very memory-efficient for extremely large files.

Parameters:

files (list of str) – List of file paths to process
id_column (str) – ID column name
columns (list of str) – Columns to check for missing values
sep (str) – File separator (default: tab)
chunksize (int, optional) – Number of rows to read at a time. If None, reads entire file.

Returns:

Set of valid IDs across all files

Return type:

set

Apply the configured missingness handling strategy.

Parameters:

data (pd.DataFrame) – Input dataframe
columns (list of str, optional) – Columns to check (for listwise deletion)
exposure (str, optional) – Exposure column name (for pairwise deletion)
outcome (str, optional) – Outcome column name (for pairwise deletion)
covariates (list of str, optional) – Covariate column names (for pairwise deletion)
id_column (str, optional) – ID column name (for reporting purposes)

Returns:

Dataframe with missing data handled according to strategy

Return type:

pd.DataFrame

handle_listwise(data: DataFrame, columns: List[str]) → DataFrame

Apply listwise deletion across specified columns.

This removes any sample that has missing data in ANY of the specified columns. Should be applied once at the pipeline level before iterating over exposures.

Parameters:

data (pd.DataFrame) – Input dataframe
columns (list of str) – Columns to check for missing values

Returns:

Dataframe with complete cases only

Return type:

pd.DataFrame

handle_pairwise(data: DataFrame, exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) → DataFrame

Apply pairwise deletion for a specific exposure-outcome-covariate combination.

This removes only samples with missing data in the exposure, outcome, or covariates being tested. Should be applied per-test in the testing loop.

Parameters:

data (pd.DataFrame) – Input dataframe
exposure (str) – Exposure column name
outcome (str) – Outcome column name
covariates (list of str, optional) – Covariate column names
extra_outcome_cols (list of str, optional) – Additional outcome-related columns to check for missing values (e.g., survival time columns that should be treated like outcome columns)
id_column (str, optional) – ID column name (for reporting purposes)

Returns:

Dataframe with complete cases for this test

Return type:

pd.DataFrame

`marvel.utils.utils` module

Collection of useful functions for the marvel package

marvel.utils.utils.check_environ(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py') → str

Will check if the environ_variable is set, and otherwise tries a fall_back path.

Parameters:

environ_variable (str,) – The environmental variable to check for.
fall_back (str,) – A fall back option that can be return if the environ_variable is not set. Supply NoneType to ignore and return an error instead.

Return type:

The environ_variable or fall_back content as string.

Raises:

TypeError – Raised if environ_variable is not set and fall_back is set to NoneType.

marvel.utils.utils.check_extension(file_name: str, extension: str | list[str], value: bool = False)

Check whether the extension of a file is as expected

Parameters:

file_name (str) – Name of the file to check
extension (str or list of str) – The extension(s) to be tested
value (bool, default is False) – Return the file extension if True

Returns:

True or the file extension if the file_name has one of the given
extensions. Raises InputValidationError otherwise.

marvel.utils.utils.has_rows(dfs: DataFrame | list[DataFrame])

Check if dataframe has rows

Parameters:: dfs (pd.DataFrame or list of pd.DataFrame) – Dataframes to be checked
Raises:: InputValidationError –

marvel.utils.utils.infer_column_types(df)

Infer the column types of columns in a dataframe

Parameters:

df (pd.DataFrame) – DataFrame of the data to be inferred.

Returns:

Dictionary with the keys being column types and the values being column
names.

marvel.utils.utils.is_gzip(file_name: str)

Check whether a file_name is referring to a gzipped file

Parameters:: file_name (str) – Name of the file to check
Return type:: True if the file_name has ‘gz’ as extension, otherwise False.

marvel.utils.utils.load_custom(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py'): Loads a user-defined module from an environmental variable or fall_back path.

marvel.utils.utils.merge_dfs(data: list[DataFrame], id_cols: list[str | bool] | bool | None = None, **kwargs)

Merge several files to one dataframe

Parameters:

data (list of str or list of pd.DataFrame) – List of dataframes
id_cols (list of str or list of bool or bool, default None) – List of column names which represent individual IDs. Should be as long as the list of data. If the index contains the individual IDs, the specific value of this data should be set to True. If all data have individual IDs in the index, set to True or None.
**kwargs – Additional keyword arguments that will be passed on to merge

Returns:

merged_data – Merged dataframe with all the input

Return type:

pd.DataFrame

marvel.utils.utils.open_file(filepath: str, **kwargs)

Opens a gzipped or non-gzipped file

Parameters:

filepath (str) – Path to the file to read.
**kwargs – Additional keyword arguments passed on to the actual open function.

Returns:

file object

Return type:

Opened file object (either plain or gzipped)

Examples

>>> with open_file("yourfile.tsv.gz") as f:
>>>     for line in f:
>>>         print(line.strip())

marvel.utils.utils.qc_dict(in_dict: dict, key_names: list, required: bool = True, warning: bool = False) → dict

Check if key_names are present in dictionary

Parameters:

in_dict (dict) – Input dictionary.
key_names (list) – List of keys that should be present in in_dict.
required (bool) – Whether the keys should be set to None if non-existent.
warning (bool) – Whether a warning should be raised in stead of an error if in_dict contains invalid keys

Returns:

in_dict – Dictionary with all key_names or the subset of key_names already present in in_dict.

Return type:

dict

marvel.utils.utils.read_if_new(new_file_name: str, old_file_name: str | None, old_df: DataFrame, **kwargs)

Read dataframe if new

Parameters:

new_file_name (str) – Path to the file name that should be read if different from old_file_name.
old_file_name (str or None) – Path to the file name that is currently read.
old_df (pd.DataFrame) – The dataframe read from the path of old_file_name.
**kwargs – Additional keyword arguments passed on to pd.read_csv()

Returns:

df – A dataframe. The old_df if the file_names were equal, or the data in new_file_name of the file_names were not equal.

Return type:

pd.DataFrame

marvel.utils.utils.refactor_col(col: Series)

Create category of dataframe column

Parameters:: col (pd.Series) – Dataframe column
Return type:: The same column, but then turned into categories

marvel.utils.utils.return_header_list(file, sep='\t')

marvel.utils package

marvel.utils.config_tools module

marvel.utils.data_manager module

marvel.utils.missingness module

marvel.utils.utils module

`marvel.utils` package

`marvel.utils.config_tools` module

`marvel.utils.data_manager` module

`marvel.utils.missingness` module

`marvel.utils.utils` module