biomedical_data_generator.utils.sklearn_compat

Sklearn-like convenience wrapper around biomedical-data-generator.

This module provides a single entry point make_biomedical_dataset() that mimics sklearn.datasets.make_classification() while mapping cleanly to the new DatasetConfig / generate_dataset() API of biomedical_data_generator.

The goals are:

  • Familiar, scikit-learn-style signature for quick experimentation.

  • A thin translation layer to DatasetConfig, so that users can “graduate” to the full configuration model once they need more control.

  • Numpy / pandas outputs that plug directly into scikit-learn pipelines.

Functions

make_biomedical_dataset([n_samples, ...])

Sklearn-like convenience wrapper around the biomedical-data-generator.

biomedical_data_generator.utils.sklearn_compat.make_biomedical_dataset(n_samples=30, n_features=200, n_informative=5, n_redundant=0, n_classes=2, class_sep=1.2, weights=None, random_state=42, n_noise=0, noise_distribution='normal', noise_distribution_params=None, batch_effect=False, n_batches=1, batch_effect_strength=0.5, confounding_with_class=0.0, return_meta=False, return_pandas=False, **kwargs)[source]

Sklearn-like convenience wrapper around the biomedical-data-generator.

Parameters broadly mirror sklearn.datasets.make_classification() where sensible, but are translated to the new DatasetConfig / generate_dataset() design.

Redundant features

n_redundant is implemented via a single correlated feature cluster:

  • One informative anchor (shared signal)

  • n_redundant proxy features that are strongly correlated (equicorrelated with a high correlation)

In terms of DatasetConfig, this means:

n_features = n_informative + n_noise + proxies_from_clusters

and the proxies contributed by this wrapper are exactly n_redundant.

Notes:

  • n_features must equal n_informative + n_redundant + n_noise in this wrapper (no repeated features). If n_noise == 0, it is inferred as n_features - n_informative - n_redundant.

  • If you pass corr_clusters explicitly via **kwargs, then n_redundant must be 0; you are responsible for defining the cluster layout yourself in that advanced mode.

By default the function returns (X, y) using NumPy arrays for broad compatibility with scikit-learn. Set return_pandas=True to obtain a DataFrame and Series instead. Set return_meta=True to additionally return the DatasetMeta object.

Returns:

(X, y) or (X, y, meta)

Depending on return_meta. X is a NumPy array or pandas DataFrame; y is a NumPy array or pandas Series.

Parameters:
  • n_samples (int)

  • n_features (int)

  • n_informative (int)

  • n_redundant (int)

  • n_classes (int)

  • class_sep (float)

  • weights (tuple[float, ...] | None)

  • random_state (int | None)

  • n_noise (int)

  • noise_distribution (str)

  • noise_distribution_params (dict[str, Any] | None)

  • batch_effect (bool)

  • n_batches (int)

  • batch_effect_strength (float)

  • confounding_with_class (float)

  • return_meta (bool)

  • return_pandas (bool)

  • kwargs (Any)

Return type:

tuple[Any, Any] | tuple[Any, Any, object]