biomedical_data_generator.utils.correlation_tools

Correlation analysis and seed search utilities (no plotting).

This module provides functions to compute correlation metrics, assess correlation quality, search for random seeds that yield desired correlation properties, and slice DataFrames by cluster.

Functions

assess_correlation_quality(X, ...[, ...])

Assess how well the empirical correlation of X matches the target.

compute_correlation_matrix(df_like, *[, method])

Compute the correlation matrix from a DataFrame-like object.

compute_correlation_matrix_for_cluster(df, ...)

Compute the correlation matrix for a specific cluster in the DataFrame.

compute_correlation_metrics(corr_matrix)

Compute summary metrics from a correlation matrix.

find_best_seed_for_correlation(max_tries, ...)

Find the seed that yields the closest empirical correlation to the target over n_trials.

find_seed_for_correlation(n_samples, ...[, ...])

Find a random seed that yields a cluster with desired correlation properties.

get_cluster_column_names(df, meta, cluster_id, *)

Get column names for a given cluster ID from DataFrame and metadata.

get_cluster_frame(df, meta, cluster_id, *[, ...])

Get DataFrame slice for a given cluster ID.

parse_cluster_id(name[, prefix_corr])

Parse cluster ID from column name like 'corr3_anchor' or 'corr2_5'.

pc1_share(X, *[, method, rowvar])

Compute the share of variance explained by the first principal component from data X.

pc1_share_from_corr(C)

Compute the share of variance explained by the first principal component from a correlation matrix C.

variance_partition_pc1(X, *[, method, rowvar])

Compute variance partitioning summary based on PC1 share.

biomedical_data_generator.utils.correlation_tools.assess_correlation_quality(X, correlation_target, *, tolerance=0.05, structure='equicorrelated')[source]

Assess how well the empirical correlation of X matches the target.

Parameters:
  • X (ndarray[tuple[Any, ...], dtype[float64]]) – Feature matrix of shape (n_samples, n_features).

  • correlation_target (float) – Target correlation value.

  • tolerance (float) – Acceptable deviation from target.

  • structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).

Returns:

  • mean_offdiag

  • std_offdiag

  • min_offdiag

  • max_offdiag

  • range_offdiag

  • n_offdiag

  • target

  • deviation_offdiag

  • within_tolerance

  • structure

Return type:

Dictionary with keys

biomedical_data_generator.utils.correlation_tools.compute_correlation_matrix(df_like, *, method='spearman')[source]

Compute the correlation matrix from a DataFrame-like object.

Parameters:
  • df_like (DataFrame) – DataFrame-like object with features as columns.

  • method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).

Returns:

Tuple of (correlation matrix as 2D NumPy array, list of column labels).

Return type:

tuple[ndarray[tuple[Any, …], dtype[float64]], list[str]]

biomedical_data_generator.utils.correlation_tools.compute_correlation_matrix_for_cluster(df, meta, cluster_id, *, method='spearman', anchor_first=True, natural_sort_rest=True)[source]

Compute the correlation matrix for a specific cluster in the DataFrame.

Parameters:
  • df (DataFrame) – DataFrame with all features.

  • meta (Any) – Metadata object with cluster information.

  • cluster_id (int) – ID of the cluster to extract.

  • method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).

  • anchor_first (bool) – If True, anchor feature is placed first (if available).

  • natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.

Returns:

Tuple of (correlation matrix as 2D NumPy array, list of column labels

for the specified cluster).

Return type:

tuple[ndarray[tuple[Any, …], dtype[float64]], list[str]]

biomedical_data_generator.utils.correlation_tools.compute_correlation_metrics(corr_matrix)[source]

Compute summary metrics from a correlation matrix.

Parameters:

corr_matrix (ndarray[tuple[Any, ...], dtype[floating[Any]]]) – Square correlation matrix of shape (p, p).

Returns:

  • mean_offdiag

  • std_offdiag

  • min_offdiag

  • max_offdiag

  • range_offdiag

  • n_offdiag

Return type:

Dictionary with keys

biomedical_data_generator.utils.correlation_tools.find_best_seed_for_correlation(max_tries, n_samples, n_cluster_features, correlation, structure='equicorrelated', *, start_seed=0)[source]

Find the seed that yields the closest empirical correlation to the target over n_trials.

Parameters:
  • max_tries (int) – Number of random seeds to try.

  • n_samples (int) – Number of samples to generate.

  • n_cluster_features (int) – Number of features in the cluster.

  • correlation (float) – Target correlation value.

  • structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).

  • start_seed (int) – Seed to start searching from.

Returns:

Tuple of (best_seed, metrics dictionary).

Return type:

tuple[int, dict[str, float]]

biomedical_data_generator.utils.correlation_tools.find_seed_for_correlation(n_samples, n_cluster_features, correlation, structure='equicorrelated', *, metric='mean_offdiag', tolerance=0.02, threshold=None, op='>=', start_seed=0, max_tries=200, return_best_on_fail=True, return_matrix=False, enforce_p_le_n_in_tolerance=True)[source]

Find a random seed that yields a cluster with desired correlation properties.

Parameters:
  • n_samples (int) – Number of samples to generate.

  • n_cluster_features (int) – Number of features in the cluster.

  • correlation (float) – Target correlation value.

  • structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).

  • metric (Literal['mean_offdiag', 'min_offdiag', 'max_offdiag', 'std_offdiag']) – Correlation metric to evaluate (“mean_offdiag”, “min_offdiag”, “max_offdiag”, “std_offdiag”).

  • tolerance (float | None) – Acceptable deviation from target (for “tolerance” mode).

  • threshold (float | None) – Metric threshold (for “threshold” mode).

  • op (Literal['>=', '<=']) – Operator for threshold comparison (“>=” or “<=”).

  • start_seed (int) – Seed to start searching from.

  • max_tries (int) – Maximum number of seeds to try.

  • return_best_on_fail (bool) – If True, return best found seed if none satisfy criterion.

  • return_matrix (bool) – If True, include correlation matrix in metadata.

  • enforce_p_le_n_in_tolerance (bool) – If True, enforce n_features <= n_samples in tolerance mode.

Returns:

Tuple of (seed, metadata dictionary).

Return type:

tuple[int, dict[str, Any]]

biomedical_data_generator.utils.correlation_tools.get_cluster_column_names(df, meta, cluster_id, *, anchor_first=True, natural_sort_rest=True)[source]

Get column names for a given cluster ID from DataFrame and metadata.

Parameters:
  • df (DataFrame) – DataFrame with all features.

  • meta (Any) – Metadata object with cluster information.

  • cluster_id (int) – ID of the cluster to extract.

  • anchor_first (bool) – If True, anchor feature is placed first (if available).

  • natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.

Returns:

List of column names in the specified order.

Return type:

list[str]

biomedical_data_generator.utils.correlation_tools.get_cluster_frame(df, meta, cluster_id, *, anchor_first=True, natural_sort_rest=True)[source]

Get DataFrame slice for a given cluster ID.

Parameters:
  • df (DataFrame) – DataFrame with all features.

  • meta (Any) – Metadata object with cluster information.

  • cluster_id (int) – ID of the cluster to extract.

  • anchor_first (bool) – If True, anchor feature is placed first (if available).

  • natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.

Returns:

DataFrame slice with columns for the specified cluster.

Return type:

DataFrame

biomedical_data_generator.utils.correlation_tools.parse_cluster_id(name, prefix_corr='corr')[source]

Parse cluster ID from column name like ‘corr3_anchor’ or ‘corr2_5’.

Parameters:
  • name (str) – Column name string.

  • prefix_corr (str) – Prefix indicating correlated features.

Returns:

Cluster ID as integer, or None if not matching.

Return type:

int | None

biomedical_data_generator.utils.correlation_tools.pc1_share(X, *, method='pearson', rowvar=False)[source]

Compute the share of variance explained by the first principal component from data X.

Parameters:
  • X (DataFrame | ndarray) – Data matrix (DataFrame or 2D array).

  • method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).

  • rowvar (bool) – If True, rows represent variables (features), otherwise columns do.

Returns:

Share of variance explained by the first principal component (float in [0, 1]).

Return type:

float

biomedical_data_generator.utils.correlation_tools.pc1_share_from_corr(C)[source]

Compute the share of variance explained by the first principal component from a correlation matrix C.

Parameters:

C (ndarray) – Square correlation matrix of shape (p, p).

Returns:

Share of variance explained by the first principal component (float in [0, 1]).

Return type:

float

biomedical_data_generator.utils.correlation_tools.variance_partition_pc1(X, *, method='pearson', rowvar=False)[source]

Compute variance partitioning summary based on PC1 share.

Parameters:
  • X (DataFrame | ndarray) – Data matrix (DataFrame or 2D array).

  • method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).

  • rowvar (bool) – If True, rows represent variables (features), otherwise columns do.

Returns:

  • n_features

  • pc1_evr

  • pc1_var_ratio

Return type:

Dictionary with keys