biomedical_data_generator.utils.sklearn_compat
Sklearn-like convenience wrapper around biomedical-data-generator.
This module provides a single entry point make_biomedical_dataset()
that mimics sklearn.datasets.make_classification() while mapping
cleanly to the new DatasetConfig / generate_dataset()
API of biomedical_data_generator.
The goals are:
Familiar, scikit-learn-style signature for quick experimentation.
A thin translation layer to
DatasetConfig, so that users can “graduate” to the full configuration model once they need more control.Numpy / pandas outputs that plug directly into scikit-learn pipelines.
Functions
|
Sklearn-like convenience wrapper around the biomedical-data-generator. |
- biomedical_data_generator.utils.sklearn_compat.make_biomedical_dataset(n_samples=30, n_features=200, n_informative=5, n_redundant=0, n_classes=2, class_sep=1.2, weights=None, random_state=42, n_noise=0, noise_distribution='normal', noise_distribution_params=None, batch_effect=False, n_batches=1, batch_effect_strength=0.5, confounding_with_class=0.0, return_meta=False, return_pandas=False, **kwargs)[source]
Sklearn-like convenience wrapper around the biomedical-data-generator.
Parameters broadly mirror
sklearn.datasets.make_classification()where sensible, but are translated to the newDatasetConfig/generate_dataset()design.Redundant features
n_redundantis implemented via a single correlated feature cluster:One informative anchor (shared signal)
n_redundantproxy features that are strongly correlated (equicorrelated with a highcorrelation)
In terms of
DatasetConfig, this means:n_features = n_informative + n_noise + proxies_from_clusters
and the proxies contributed by this wrapper are exactly
n_redundant.Notes:
n_featuresmust equaln_informative + n_redundant + n_noisein this wrapper (no repeated features). Ifn_noise == 0, it is inferred asn_features - n_informative - n_redundant.If you pass
corr_clustersexplicitly via**kwargs, thenn_redundantmust be 0; you are responsible for defining the cluster layout yourself in that advanced mode.
By default the function returns
(X, y)using NumPy arrays for broad compatibility with scikit-learn. Setreturn_pandas=Trueto obtain aDataFrameandSeriesinstead. Setreturn_meta=Trueto additionally return theDatasetMetaobject.Returns:
- (X, y) or (X, y, meta)
Depending on
return_meta.Xis a NumPy array or pandasDataFrame;yis a NumPy array or pandasSeries.
- Parameters:
n_samples (int)
n_features (int)
n_informative (int)
n_redundant (int)
n_classes (int)
class_sep (float)
random_state (int | None)
n_noise (int)
noise_distribution (str)
batch_effect (bool)
n_batches (int)
batch_effect_strength (float)
confounding_with_class (float)
return_meta (bool)
return_pandas (bool)
kwargs (Any)
- Return type: