marvel.utils package
marvel.utils.config_tools module
Tools to deal with configuration files
- marvel.utils.config_tools.ConfigParser(path: str | None = None, encoding: str = 'utf-8') dict
Parses configuration files into structured data and optionally assigns this to a user supplied mapper instance.
- Parameters:
path (str | None) – Path to the configuration file to be parsed or None if example configuration file should be used.
required_headers (set | list(set) | None) – Set or list of sets with required headers in the configuration file.
encoding (str) – Encoding, default is
utf-8.
- Raises:
ValueError – if a parsed line does not contain a tab delimiter.
InputValidationError – If a value-key pair is duplicated within the same section.
- class marvel.utils.config_tools.PipelineConfig(extract_variants: bool = True, association_analysis: bool = True, n_jobs: int = -1, checkpoint_dir: str = './tmp_check', force_rerun: bool = False, raise_on_error: bool = True, id_column: str = 'id', output_path: str = '.', min_group_size: int = 1, tmp_dir: str = './tmp', listwise: bool = True, max_missingness: float = 0.5, cov_miss_error: bool = True, chr_column: str = 'chr', pos_column: str = 'pos', chr_pos_column: str = 'chr_pos', ref_column: str = 'a1', alt_column: str = 'a2', var_column: str = 'ID', cat_column: str | None = None, var_sep: str = ':', file_sep: str = '\t', incl_var: bool = True, reverse: bool = True, region: str | None = None, neg_geno: object = True, sum_geno: object = True, stratify_models: Dict[str, Dict[str, List[str]] | None] | None = None, stratify_overall: bool = True, prefilter_regions: bool = False, geno_files: Dict[str, str] = None, var_files: Dict[str, str] = None, exp_files: Dict[str, str] = None, pheno_file: str | None = None, cov_file: str | None = None, variant_output: Dict[str, str] | None = None, exposures: List[str] | None = None, outcomes: Dict[str, dict] = None, covariate_models: Dict[str, List[str] | None] = None)
Bases:
objectConfiguration for MARVELous pipeline
- classmethod from_config_dict(config: dict) PipelineConfig
Create PipelineConfig from configuration dictionary
- Parameters:
config (dict) – Parsed configuration dictionary
- Returns:
Configuration object
- Return type:
- validate()
Validate configuration
- marvel.utils.config_tools.check_config_file(config: dict | str, key_sep: str = ';') bool
QC of the configuration file
By doing this QC first basic checks are made to see if the pipeline can be executed successfully. This prevents silly errors after a long run-time.
- Parameters:
config (dict or str) – Dictionary of the config-file or path to the file.
key_sep (str) – Separator for the RHS values
- Returns:
True if configuration file passed QC
- Return type:
- Raises:
FileNotFoundError – If configuration file is not found
TypeError – If genetic input files are not supported
ConfigHeaderMissingError – If headers are missing
InputValidationError – If none of the extract_variants or association_analysis options are True
- marvel.utils.config_tools.cnf_extract_survival_tests(config: dict, header_name)
Extract survival test definitions from config
The LHS key is
event_col;time_col(semicolon-separated pair). The RHS is the test list (same as other sections).Returns a dict keyed by event_col with value:
{ColType: 'survival', Tests: [...], TimeCol: time_col}
- marvel.utils.config_tools.create_config(path: str, extract_variants: bool = True, geno_input: dict | None = None, variant_input: dict | None = None, variant_output: dict | None = None, association_analysis: bool = True, exp_input: dict | None = None, pheno_input: dict | None = None, con_tests: dict | None = None, cat_tests: dict | None = None, bin_tests: dict | None = None, surv_tests: dict | None = None, covs: dict | None = None, stratify: dict | None = None, key_sep: str = ';', **other_options)
Create a MARVELous configuration file
- Parameters:
path (str) – Path where the configuration file will be written
extract_variants (bool, default True) – Whether to extract variants. If True, at least the geno_input, variant_input, and variant_output arguments should be specified.
geno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of genetic input files (bgen/vcf) to extract variants from (values)
variant_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of variant definition input files (values)
variant_output (str or None, default None) – Dictionary mapping identifiers (keys) to paths of output files for extracted variants
association_analysis (bool, default True) – Whether to perform association analysis. If True, at least the pheno_input, and one of con_tests, cat_tests, or bin_tests arguments should be specified.
exp_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of extracted variant input files (values)
pheno_input (dict or None, default None) – Dictionary mapping identifiers (keys) to paths of phenotype input files (values). Key ‘phenotypes’ is required if association_analysis is True and ‘covariates’ if association_analysis is True and models are specified using covs
con_tests (dict or None, default None) – Dictionary mapping continuous outcomes (keys) to semicolon-separated test strings (values)
cat_tests (dict or None, default None) – Dictionary mapping categorical outcomes (keys) to semicolon-separated test strings (values)
bin_tests (dict or None, default None) – Dictionary mapping binary outcomes (keys) to semicolon-separated test strings (values)
covs (dict or None, default None) – Dictionary mapping model names (keys) to semicolon-separated covariate strings
stratify (dict or None, default None) – Dictionary mapping stratification names (keys) to semicolon-separated stratification options.
key_sep (str) – Separatar between RHS strings
**other_options – Additional options for the [Options] section
Examples
>>> create_config( ... 'example.cnf', ... geno_input={'chr1': '/path/to/chr1.bgen'}, ... var_input={'gene_vars': '/path/to/variants.txt'}, ... variant_output={'variant_output' : '/path/to/output.txt'}, ... extract_variants=True ... )
- marvel.utils.config_tools.custom_config(path=None, text: None | dict[str, list[str]] = None) str
Returns a toy example of a data configuration file as a string or uses user-supplied text if provided in dictionary format.
- Parameters:
path (str, default None) – An optional path to write the text to disk. Uses utf-8 file encoding.
user_text (dict [str, list [str]], default None) – Configuration content specified as a dictionary. Each key represents a header (enclosed in square brackets in the output), and the corresponding value is a list of attributes or entries. Tab characters in list items are preserved, enabling target-source assignments in the generated file.
- Returns:
The formatted configuration file.
- Return type:
Examples
>>> custom_text = { ... "CustomSection1": ["CustomAttribute1", "CustomAttribute2"], ... "MetaData": ["CustomData CustomValue"], ... "AdditionalInfo": ["Info1", "Info2"], ... } >>> print(data_configuration(user_text=custom_text)) [CustomSection1] CustomAttribute1 CustomAttribute2 [MetaData] CustomData CustomValue [AdditionalInfo] Info1 Info2
- marvel.utils.config_tools.update_config_file(config: dict, path: str | None = None)
Update the configuration file
After extraction of variants, the extract_variants option should be set to False and the file containing the variants should be supplied. This function updates the configuration file to perform this.
- Parameters:
config (dict) – Parsed dictionary of the current configuration file.
path (str, default None) – An optional path to write the text to disk
- Returns:
config – Parsed dictionary of the updated configuration file.
- Return type:
dict
marvel.utils.data_manager module
Memory-efficient data manager with intelligent caching
- class marvel.utils.data_manager.DataManager(pheno_file: str | None = None, cov_file: str | None = None, id_column: str = 'id', valid_ids: Set | None = None, outcome_cache_size: int = 20, covariate_cache_size: int = 20, sep: str = '\t')
Bases:
objectMemory-efficient data manager for phenotype, covariate, and genetic data.
Uses intelligent caching to minimize both memory usage and file I/O: - Current exposure is cached (reused across all outcomes) - LRU cache for recently accessed outcomes - LRU cache for recently accessed covariate sets
- clear_caches()
Clear all caches
- get_cache_stats() dict
Get cache performance statistics
- Returns:
Statistics about cache hits/misses
- Return type:
- get_exposure() DataFrame | None
Get current cached exposure data
- Returns:
Cached exposure data
- Return type:
pd.DataFrame or None
- get_merged_data(exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) Tuple[DataFrame, DataFrame | None]
Get merged data for a single test: exposure + outcome + covariates
- Parameters:
- Returns:
(data_df, cov_df) where: - data_df has ID + exposure + outcome (+ extra_outcome_cols) - cov_df has ID + covariates (or None)
- Return type:
- load_covariates(covariates: List[str], stratify_to_valid: bool = True, refactor: bool = True) DataFrame | None
Load covariates from covariate file
- load_outcome(outcome: str, stratify_to_valid: bool = True) DataFrame
Load a single outcome from phenotype file
- class marvel.utils.data_manager.LRUCache(capacity: int = 20)
Bases:
objectSimple LRU (Least Recently Used) cache
- clear()
Clear all cached items
marvel.utils.missingness module
Missingness handling for MARVELous Pipeline
- class marvel.utils.missingness.MissingnessHandler(listwise: bool = True, verbose: bool = True)
Bases:
objectHandle missing data using listwise or pairwise deletion strategies.
Listwise deletion removes samples with missing values across all relevant columns before any analysis (typically applied once at the pipeline level).
Pairwise deletion removes samples with missing values only for the specific exposure-outcome-covariates combination being tested (applied per test).
- filter_by_missingness(data: DataFrame, columns: List[str], max_missingness: float, column_type: str = 'column', raise_error: bool = False) tuple[List[str], List[str]]
Filter columns by missingness threshold.
- Parameters:
data (pd.DataFrame) – Input dataframe
max_missingness (float) – Maximum allowed missingness proportion (0-1)
column_type (str) – Type of columns for reporting (e.g., “outcome”, “exposure”, “covariate”)
raise_error (bool) – If True, raise ValueError when columns exceed threshold. If False, log warning and return filtered list.
- Returns:
(valid_columns, removed_columns)
- Return type:
- Raises:
ValueError – If raise_error=True and any columns exceed the threshold
- get_complete_columns(data: DataFrame, columns: List[str]) List[str]
Get list of columns that have no missing values.
- get_missing_summary(data: DataFrame, columns: List[str] | None = None) DataFrame
Get summary of missing data patterns.
- get_valid_ids(data: DataFrame, id_column: str, columns: List[str]) set
Get set of valid IDs (IDs with complete data in specified columns). Memory-efficient method for large dataframes.
- get_valid_ids_from_files(files: List[str], id_column: str, columns: List[str], sep: str = '\t', chunksize: int | None = None) set
Get valid IDs by iterating over files in chunks. Very memory-efficient for extremely large files.
- Parameters:
- Returns:
Set of valid IDs across all files
- Return type:
- handle(data: DataFrame, columns: List[str] | None = None, exposure: str | None = None, outcome: str | None = None, covariates: List[str] | None = None, id_column: str | None = None) DataFrame
Apply the configured missingness handling strategy.
- Parameters:
data (pd.DataFrame) – Input dataframe
columns (list of str, optional) – Columns to check (for listwise deletion)
exposure (str, optional) – Exposure column name (for pairwise deletion)
outcome (str, optional) – Outcome column name (for pairwise deletion)
covariates (list of str, optional) – Covariate column names (for pairwise deletion)
id_column (str, optional) – ID column name (for reporting purposes)
- Returns:
Dataframe with missing data handled according to strategy
- Return type:
pd.DataFrame
- handle_listwise(data: DataFrame, columns: List[str]) DataFrame
Apply listwise deletion across specified columns.
This removes any sample that has missing data in ANY of the specified columns. Should be applied once at the pipeline level before iterating over exposures.
- handle_pairwise(data: DataFrame, exposure: str, outcome: str, covariates: List[str] | None = None, extra_outcome_cols: List[str] | None = None) DataFrame
Apply pairwise deletion for a specific exposure-outcome-covariate combination.
This removes only samples with missing data in the exposure, outcome, or covariates being tested. Should be applied per-test in the testing loop.
- Parameters:
data (pd.DataFrame) – Input dataframe
exposure (str) – Exposure column name
outcome (str) – Outcome column name
extra_outcome_cols (list of str, optional) – Additional outcome-related columns to check for missing values (e.g., survival time columns that should be treated like outcome columns)
id_column (str, optional) – ID column name (for reporting purposes)
- Returns:
Dataframe with complete cases for this test
- Return type:
pd.DataFrame
marvel.utils.utils module
Collection of useful functions for the marvel package
- marvel.utils.utils.check_environ(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py') str
Will check if the environ_variable is set, and otherwise tries a fall_back path.
- Parameters:
- Return type:
The environ_variable or fall_back content as string.
- Raises:
TypeError – Raised if environ_variable is not set and fall_back is set to NoneType.
- marvel.utils.utils.check_extension(file_name: str, extension: str | list[str], value: bool = False)
Check whether the extension of a file is as expected
- Parameters:
file_name (str) – Name of the file to check
extension (str or list of str) – The extension(s) to be tested
value (bool, default is False) – Return the file extension if True
- Returns:
True or the file extension if the file_name has one of the given
extensions. Raises InputValidationError otherwise.
- marvel.utils.utils.has_rows(dfs: DataFrame | list[DataFrame])
Check if dataframe has rows
- Parameters:
dfs (pd.DataFrame or list of pd.DataFrame) – Dataframes to be checked
- Raises:
InputValidationError –
- marvel.utils.utils.infer_column_types(df)
Infer the column types of columns in a dataframe
- Parameters:
df (pd.DataFrame) – DataFrame of the data to be inferred.
- Returns:
Dictionary with the keys being column types and the values being column
names.
- marvel.utils.utils.is_gzip(file_name: str)
Check whether a file_name is referring to a gzipped file
- Parameters:
file_name (str) – Name of the file to check
- Return type:
True if the file_name has ‘gz’ as extension, otherwise False.
- marvel.utils.utils.load_custom(environ_variable: str = 'MARVEL_TEST_DEFS', fall_back: str | None = '/usr/local/lib/python3.12/site-packages/marvel/association/tests.py')
Loads a user-defined module from an environmental variable or fall_back path.
- marvel.utils.utils.merge_dfs(data: list[DataFrame], id_cols: list[str | bool] | bool | None = None, **kwargs)
Merge several files to one dataframe
- Parameters:
data (list of str or list of pd.DataFrame) – List of dataframes
id_cols (list of str or list of bool or bool, default None) – List of column names which represent individual IDs. Should be as long as the list of
data. If the index contains the individual IDs, the specific value of this data should be set to True. If all data have individual IDs in the index, set to True or None.**kwargs – Additional keyword arguments that will be passed on to merge
- Returns:
merged_data – Merged dataframe with all the input
- Return type:
pd.DataFrame
- marvel.utils.utils.open_file(filepath: str, **kwargs)
Opens a gzipped or non-gzipped file
- Parameters:
filepath (str) – Path to the file to read.
**kwargs – Additional keyword arguments passed on to the actual open function.
- Returns:
file object
- Return type:
Opened file object (either plain or gzipped)
Examples
>>> with open_file("yourfile.tsv.gz") as f: >>> for line in f: >>> print(line.strip())
- marvel.utils.utils.qc_dict(in_dict: dict, key_names: list, required: bool = True, warning: bool = False) dict
Check if key_names are present in dictionary
- Parameters:
in_dict (dict) – Input dictionary.
key_names (list) – List of keys that should be present in
in_dict.required (bool) – Whether the keys should be set to None if non-existent.
warning (bool) – Whether a warning should be raised in stead of an error if
in_dictcontains invalid keys
- Returns:
in_dict – Dictionary with all
key_namesor the subset ofkey_namesalready present inin_dict.- Return type:
dict
- marvel.utils.utils.read_if_new(new_file_name: str, old_file_name: str | None, old_df: DataFrame, **kwargs)
Read dataframe if new
- Parameters:
new_file_name (str) – Path to the file name that should be read if different from
old_file_name.old_file_name (str or None) – Path to the file name that is currently read.
old_df (pd.DataFrame) – The dataframe read from the path of
old_file_name.**kwargs – Additional keyword arguments passed on to pd.read_csv()
- Returns:
df – A dataframe. The
old_dfif the file_names were equal, or the data innew_file_nameof the file_names were not equal.- Return type:
pd.DataFrame
- marvel.utils.utils.refactor_col(col: Series)
Create category of dataframe column
- Parameters:
col (pd.Series) – Dataframe column
- Return type:
The same column, but then turned into categories
- marvel.utils.utils.return_header_list(file, sep='\t')