biomedical_data_generator.generate_dataset
- biomedical_data_generator.generate_dataset(cfg, /, *, return_dataframe=True, **overrides)[source]
Generate an n-class classification dataset with optional correlated clusters.
Features are ordered as: cluster features (anchors first within each cluster), then free informative features, then free pseudo features, then noise features. Labels y must be explicitly specified via cfg.class_counts (exact per-class sample counts). Class-wise shifts are then applied to informative features and cluster anchors (via anchor_effect_size) to create class separation. Reproducibility is controlled by cfg.random_state and optional per-cluster seeds. Feature names follow either a “prefixed” scheme (e.g., i*, corr{cid}_k, p*, n*) or a generic feature_1..p. The returned meta includes role masks, cluster indices, empirical class proportions, and the resolved configuration.
- Parameters:
cfg (
DatasetConfig) – Configuration including feature counts, cluster layout, correlation parameters, naming policy, randomness controls, n_classes, and class_counts. The class_counts parameter is required and must be a dict mapping class indices to exact sample counts (e.g., {0: 50, 1: 50}).return_dataframe (
bool, optional) – If True (default), return X as a pandas.DataFrame with column names. If False, return X as a NumPy array in the same column order.**overrides – Optional config overrides merged into cfg (e.g., n_samples=…).
- Return type:
tuple[DataFrame | ndarray[tuple[Any, …], dtype[float64]], ndarray[tuple[Any, …], dtype[int64]], DatasetMeta]
Returns:
- tuple:
X (pandas.DataFrame | np.ndarray): Shape (n_samples, n_features). By default a DataFrame with feature names in canonical order (clusters → free informative → free pseudo → noise).
y (np.ndarray): Shape (n_samples,). Integer labels in {0, 1, …, n_classes-1}.
meta (DatasetMeta): Metadata including role masks, cluster indices/labels, empirical class weights, and the resolved configuration.
Raises:
ValueError: If class_counts is not specified or if sum(class_counts) != n_samples.