biomedical_data_generator.features.correlated

Generation of correlated feature clusters simulating pathway-like modules.

Overview

This module generates correlated Gaussian feature clusters that can be interpreted as simplified “pathway-like” modules (e.g., sets of co-expressed genes or co-regulated proteins).

Each cluster is defined by:

A correlation structure (equicorrelated or Toeplitz/AR(1)).
A correlation strength parameter correlation.
Optionally class-specific correlation strengths to mimic activation in specific biological conditions (e.g., tumors vs controls).
An anchor feature with class-specific mean shifts representing diagnostic strength (e.g., biomarker concentration changes).

The resulting clusters are concatenated horizontally.

Statistical model

At the core, each cluster implements a multivariate Gaussian model:

For a given cluster with n_features (p) and a correlation matrix \(\Sigma\), we generate samples according to

\[x \sim \mathcal{N}_p(\mu_c, \Sigma_c),\]

where \(\mu_c\) and \(\Sigma_c\) depend on class \(c\).
Two correlation structures are supported:
- Equicorrelated: All off-diagonal entries are equal to the correlation parameter:
  
  \[\begin{split}\Sigma_{ij} = \begin{cases} 1 & i = j, \\ \rho & i \neq j. \end{cases}\end{split}\]
  
  where \(\rho\) is the correlation parameter.
- Toeplitz / AR(1): Correlation decays with distance:
  
  \[\Sigma_{ij} = \rho^{\lvert i - j \rvert}.\]
  
  where \(\rho\) is the correlation parameter.

Anchor effects (mean shifts)

When anchor_role="informative" and anchor_effect_size is specified, the anchor feature receives a class-specific mean shift:

\[\mu_{anchor, c} = \text{anchor_effect_size} \cdot \mathbb{1}_{c = anchor\_class}.\]

Proxy features inherit this shift through correlation but with attenuated magnitude proportional to their correlation with the anchor.

Configuration semantics (enforced by CorrClusterConfig validation):

anchor_role="noise" → no mean shift (effect_size ignored if present)
anchor_role="informative" → MUST have anchor_effect_size > 0

Limitations and biological realism

See module docstring for detailed discussion of simplifications. Key points:

Gaussian marginals (real data is often skewed, zero-inflated)
Linear dependence only (no thresholds, saturation)
Independent clusters (no pathway crosstalk)
Blockwise effects (partial activation not modeled)
No sample-level heterogeneity (no subtypes)

Intended use

Realistic enough for teaching and benchmarking, but not a fully realistic generative model for complex omics data.

Functions

`apply_anchor_effects`(x, y, cluster_configs)	Apply class-specific mean shifts to anchor features.
`build_correlation_matrix`(n_features, correlation)	Build a correlation matrix with specified structure.
`sample_all_correlated_clusters`(cfg[, y, rng])	Generate and assemble all correlated feature clusters for a dataset.
`sample_correlated_data`(n_samples, ...[, ...])	Sample correlated Gaussian data with zero mean and unit variance.

biomedical_data_generator.features.correlated.apply_anchor_effects(x, y, cluster_configs)[source]

Apply class-specific mean shifts to anchor features.

This function modifies the data matrix in-place by adding mean shifts to anchor features based on their configured effect sizes and target classes.

The anchor feature (typically the first feature in each cluster) receives the full effect size, while correlated proxy features receive attenuated shifts proportional to their empirical correlation with the anchor.

Effect application logic:

anchor_role=”noise” → no shift (effect_size ignored)
anchor_role=”informative” + anchor_effect_size > 0 → apply shift
Due to CorrClusterConfig validation, informative anchors always have anchor_effect_size != None

Parameters:

x (ndarray) – Feature matrix of shape (n_samples, n_features). Modified in-place.
y (ndarray) – Class labels of shape (n_samples,).
cluster_configs (list[CorrClusterConfig]) – List of cluster configurations with anchor metadata.

Returns:

The modified feature matrix (same object as input x).

Return type:

ndarray

biomedical_data_generator.features.correlated.build_correlation_matrix(n_features, correlation, structure='equicorrelated')[source]

Build a correlation matrix with specified structure.

Parameters:

n_features (int) – Number of features (matrix dimension).
correlation (float) – Correlation parameter.
structure (str) – Either ‘equicorrelated’ or ‘toeplitz’.

Returns:

Correlation matrix of shape (n_features, n_features).

Raises:

ValueError – If structure is unknown or correlation is out of bounds.

Return type:

ndarray

biomedical_data_generator.features.correlated.sample_all_correlated_clusters(cfg, y=None, rng=None)[source]

Generate and assemble all correlated feature clusters for a dataset.

This function connects the abstract configuration with the actual data matrix and cluster-level metadata. It supports both global and class-specific correlation modes, and automatically applies anchor effects based on cluster configuration.

Anchor effect application:

Anchor effects are applied automatically based on cluster configuration: - If anchor_role=”noise” → no mean shift - If anchor_role=”informative” → mean shift applied (effect_size validated to be != None)

No separate parameter is needed because the semantics are enforced by CorrClusterConfig validation.

Parameters:

cfg (DatasetConfig) – Dataset configuration with corr_clusters field.
y (ndarray | None) – Class labels as a 1D NumPy array of length n_samples. If None, generates labels from cfg.class_configs in sequential order.
rng (Generator | None) – Optional random number generator. If None, creates a new one.

Returns:

x_clusters: Array of shape (n_samples, n_corr_features) with correlated clusters including anchor effects where configured.
cluster_meta: Dictionary with cluster-level metadata:
- ”anchor_role”: cluster_id -> anchor_role
- ”anchor_effect_size”: cluster_id -> effect_size
- ”anchor_class”: cluster_id -> target_class
- ”label”: cluster_id -> human-readable label

Return type:

A tuple (x_clusters, cluster_meta) where

Examples

>>> # Pure correlation (noise anchor, no mean shift)
>>> cfg = DatasetConfig(
...     class_configs=[ClassConfig(50), ClassConfig(50)],
...     corr_clusters=[
...         CorrClusterConfig(
...             n_cluster_features=5,
...             correlation=0.8,
...             anchor_role="noise"  # No shift
...         )
...     ]
... )
>>> x, meta = sample_all_correlated_clusters(cfg, rng) # y auto-generated

>>> # Correlation + diagnostic signal (informative anchor with shift)
>>> cfg = DatasetConfig(
...     class_configs=[ClassConfig(50), ClassConfig(50)],
...     corr_clusters=[
...         CorrClusterConfig(
...             n_cluster_features=5,
...             correlation=0.8,
...             anchor_role="informative",
...             anchor_effect_size="medium",  # Required for informative
...             anchor_class=1
...         )
...     ]
... )
>>> x, meta = sample_all_correlated_clusters(cfg, rng=rng) # y auto-generated

>>> # Advanced: provide custom labels
>>> y_custom = np.array([...])
>>> x, meta = sample_all_correlated_clusters(cfg, y_custom, rng)

biomedical_data_generator.features.correlated.sample_correlated_data(n_samples, n_features, correlation, *, structure='equicorrelated', rng=None)[source]

Sample correlated Gaussian data with zero mean and unit variance.

This function generates the Gaussian core for correlated feature clusters.

Parameters:

n_samples (int) – Number of samples to generate.
n_features (int) – Number of features.
correlation (float) – Correlation parameter.
structure (str) – Correlation structure (‘equicorrelated’ or ‘toeplitz’).
rng (Generator | None) – Random number generator. If None, creates a new one.

Returns:

Array of shape (n_samples, n_features) with standard normal marginals and specified correlation structure.

Raises:

ValueError – If structure is invalid or correlation out of bounds.

Return type:

ndarray