biomedical_data_generator.utils.sklearn_compat

Sklearn-like convenience wrapper around biomedical-data-generator.

This module provides a single entry point make_biomedical_dataset() that mimics sklearn.datasets.make_classification() while mapping cleanly to the new DatasetConfig / generate_dataset() API of biomedical_data_generator.

The goals are:

Familiar, scikit-learn-style signature for quick experimentation.
A thin translation layer to DatasetConfig, so that users can “graduate” to the full configuration model once they need more control.
Numpy / pandas outputs that plug directly into scikit-learn pipelines.

Functions

make_biomedical_dataset([n_samples, ...])

Sklearn-like convenience wrapper around the biomedical-data-generator.

biomedical_data_generator.utils.sklearn_compat.make_biomedical_dataset(n_samples=30, n_features=200, n_informative=5, n_redundant=0, n_classes=2, class_sep=1.2, weights=None, random_state=42, n_noise=0, noise_distribution='normal', noise_distribution_params=None, batch_effect=False, n_batches=1, batch_effect_strength=0.5, confounding_with_class=0.0, return_meta=False, return_pandas=False, **kwargs)[source]

Sklearn-like convenience wrapper around the biomedical-data-generator.

Parameters broadly mirror sklearn.datasets.make_classification() where sensible, but are translated to the new DatasetConfig / generate_dataset() design.

Redundant features

n_redundant is implemented via a single correlated feature cluster:

One informative anchor (shared signal)
n_redundant proxy features that are strongly correlated (equicorrelated with a high correlation)

In terms of DatasetConfig, this means:

n_features = n_informative + n_noise + proxies_from_clusters

and the proxies contributed by this wrapper are exactly n_redundant.

Notes:

n_features must equal n_informative + n_redundant + n_noise in this wrapper (no repeated features). If n_noise == 0, it is inferred as n_features - n_informative - n_redundant.
If you pass corr_clusters explicitly via **kwargs, then n_redundant must be 0; you are responsible for defining the cluster layout yourself in that advanced mode.

By default the function returns (X, y) using NumPy arrays for broad compatibility with scikit-learn. Set return_pandas=True to obtain a DataFrame and Series instead. Set return_meta=True to additionally return the DatasetMeta object.

Returns:

(X, y) or (X, y, meta): Depending on return_meta. X is a NumPy array or pandas DataFrame; y is a NumPy array or pandas Series.

Parameters:

n_samples (int)
n_features (int)
n_informative (int)
n_redundant (int)
n_classes (int)
class_sep (float)
weights (tuple[float, ...] | None)
random_state (int | None)
n_noise (int)
noise_distribution (str)
noise_distribution_params (dict[str, Any] | None)
batch_effect (bool)
n_batches (int)
batch_effect_strength (float)
confounding_with_class (float)
return_meta (bool)
return_pandas (bool)
kwargs (Any)

Return type:

tuple[Any, Any] | tuple[Any, Any, object]