biomedical_data_generator.utils.correlation_tools
Correlation analysis and seed search utilities (no plotting).
This module provides functions to compute correlation metrics, assess correlation quality, search for random seeds that yield desired correlation properties, and slice DataFrames by cluster.
Functions
|
Assess how well the empirical correlation of X matches the target. |
|
Compute the correlation matrix from a DataFrame-like object. |
Compute the correlation matrix for a specific cluster in the DataFrame. |
|
|
Compute summary metrics from a correlation matrix. |
|
Find the seed that yields the closest empirical correlation to the target over n_trials. |
|
Find a random seed that yields a cluster with desired correlation properties. |
|
Get column names for a given cluster ID from DataFrame and metadata. |
|
Get DataFrame slice for a given cluster ID. |
|
Parse cluster ID from column name like 'corr3_anchor' or 'corr2_5'. |
|
Compute the share of variance explained by the first principal component from data X. |
Compute the share of variance explained by the first principal component from a correlation matrix C. |
|
|
Compute variance partitioning summary based on PC1 share. |
- biomedical_data_generator.utils.correlation_tools.assess_correlation_quality(X, correlation_target, *, tolerance=0.05, structure='equicorrelated')[source]
Assess how well the empirical correlation of X matches the target.
- Parameters:
X (ndarray[tuple[Any, ...], dtype[float64]]) – Feature matrix of shape (n_samples, n_features).
correlation_target (float) – Target correlation value.
tolerance (float) – Acceptable deviation from target.
structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).
- Returns:
mean_offdiag
std_offdiag
min_offdiag
max_offdiag
range_offdiag
n_offdiag
target
deviation_offdiag
within_tolerance
structure
- Return type:
Dictionary with keys
- biomedical_data_generator.utils.correlation_tools.compute_correlation_matrix(df_like, *, method='spearman')[source]
Compute the correlation matrix from a DataFrame-like object.
- Parameters:
df_like (DataFrame) – DataFrame-like object with features as columns.
method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).
- Returns:
Tuple of (correlation matrix as 2D NumPy array, list of column labels).
- Return type:
- biomedical_data_generator.utils.correlation_tools.compute_correlation_matrix_for_cluster(df, meta, cluster_id, *, method='spearman', anchor_first=True, natural_sort_rest=True)[source]
Compute the correlation matrix for a specific cluster in the DataFrame.
- Parameters:
df (DataFrame) – DataFrame with all features.
meta (Any) – Metadata object with cluster information.
cluster_id (int) – ID of the cluster to extract.
method (Literal['pearson', 'kendall', 'spearman']) – Correlation method (“pearson”, “kendall”, or “spearman”).
anchor_first (bool) – If True, anchor feature is placed first (if available).
natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.
- Returns:
- Tuple of (correlation matrix as 2D NumPy array, list of column labels
for the specified cluster).
- Return type:
- biomedical_data_generator.utils.correlation_tools.compute_correlation_metrics(corr_matrix)[source]
Compute summary metrics from a correlation matrix.
- biomedical_data_generator.utils.correlation_tools.find_best_seed_for_correlation(max_tries, n_samples, n_cluster_features, correlation, structure='equicorrelated', *, start_seed=0)[source]
Find the seed that yields the closest empirical correlation to the target over n_trials.
- Parameters:
max_tries (int) – Number of random seeds to try.
n_samples (int) – Number of samples to generate.
n_cluster_features (int) – Number of features in the cluster.
correlation (float) – Target correlation value.
structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).
start_seed (int) – Seed to start searching from.
- Returns:
Tuple of (best_seed, metrics dictionary).
- Return type:
- biomedical_data_generator.utils.correlation_tools.find_seed_for_correlation(n_samples, n_cluster_features, correlation, structure='equicorrelated', *, metric='mean_offdiag', tolerance=0.02, threshold=None, op='>=', start_seed=0, max_tries=200, return_best_on_fail=True, return_matrix=False, enforce_p_le_n_in_tolerance=True)[source]
Find a random seed that yields a cluster with desired correlation properties.
- Parameters:
n_samples (int) – Number of samples to generate.
n_cluster_features (int) – Number of features in the cluster.
correlation (float) – Target correlation value.
structure (Literal['equicorrelated', 'toeplitz']) – Correlation structure (“equicorrelated” or “toeplitz”).
metric (Literal['mean_offdiag', 'min_offdiag', 'max_offdiag', 'std_offdiag']) – Correlation metric to evaluate (“mean_offdiag”, “min_offdiag”, “max_offdiag”, “std_offdiag”).
tolerance (float | None) – Acceptable deviation from target (for “tolerance” mode).
threshold (float | None) – Metric threshold (for “threshold” mode).
op (Literal['>=', '<=']) – Operator for threshold comparison (“>=” or “<=”).
start_seed (int) – Seed to start searching from.
max_tries (int) – Maximum number of seeds to try.
return_best_on_fail (bool) – If True, return best found seed if none satisfy criterion.
return_matrix (bool) – If True, include correlation matrix in metadata.
enforce_p_le_n_in_tolerance (bool) – If True, enforce n_features <= n_samples in tolerance mode.
- Returns:
Tuple of (seed, metadata dictionary).
- Return type:
- biomedical_data_generator.utils.correlation_tools.get_cluster_column_names(df, meta, cluster_id, *, anchor_first=True, natural_sort_rest=True)[source]
Get column names for a given cluster ID from DataFrame and metadata.
- Parameters:
df (DataFrame) – DataFrame with all features.
meta (Any) – Metadata object with cluster information.
cluster_id (int) – ID of the cluster to extract.
anchor_first (bool) – If True, anchor feature is placed first (if available).
natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.
- Returns:
List of column names in the specified order.
- Return type:
- biomedical_data_generator.utils.correlation_tools.get_cluster_frame(df, meta, cluster_id, *, anchor_first=True, natural_sort_rest=True)[source]
Get DataFrame slice for a given cluster ID.
- Parameters:
df (DataFrame) – DataFrame with all features.
meta (Any) – Metadata object with cluster information.
cluster_id (int) – ID of the cluster to extract.
anchor_first (bool) – If True, anchor feature is placed first (if available).
natural_sort_rest (bool) – If True, non-anchor features are sorted naturally.
- Returns:
DataFrame slice with columns for the specified cluster.
- Return type:
DataFrame
- biomedical_data_generator.utils.correlation_tools.parse_cluster_id(name, prefix_corr='corr')[source]
Parse cluster ID from column name like ‘corr3_anchor’ or ‘corr2_5’.
Compute the share of variance explained by the first principal component from data X.
- Parameters:
- Returns:
Share of variance explained by the first principal component (float in [0, 1]).
- Return type:
Compute the share of variance explained by the first principal component from a correlation matrix C.