biomedical_data_generator.generate_dataset

biomedical_data_generator.generate_dataset(cfg, return_dataframe=True)[source]

Generate synthetic biomedical dataset with specified feature structure.

Creates a classification dataset with configurable informative features, noise, correlated feature clusters (e.g., biological pathways), and optional batch effects.

Parameters:
  • cfg – Configuration object defining dataset the structure. See DatasetConfig for details.

  • return_dataframe – If True, return features as a pandas.DataFrame with named columns. If False, return as a NumPy array.

Returns:

A 3-tuple containing:

  • x (pandas.DataFrame or numpy.ndarray): Feature matrix of shape (n_samples, n_features). Each row represents one sample (e.g., patient), each column represents one feature (e.g., biomarker, gene expression value). When returned as DataFrame, column names depend on cfg.feature_naming: “prefixed” (default) uses type-based prefixes (i for informative, corr for correlated clusters, n for noise), yielding names like i1, corr1_anchor, n1. “sequential” uses sequential numbering feature_1, feature_2, ....

  • y (numpy.ndarray): Class labels of shape (n_samples,) with integer values 0, 1, ..., n_classes-1.

  • meta (DatasetMeta): Metadata object containing feature masks (informative, correlated, noise, batch-specific), correlation block specifications, batch assignments, and complete generation configuration.

Return type:

tuple

Examples

>>> from biomedical_data_generator.config import DatasetConfig, ClassConfig
>>> data_cfg_1 = DatasetConfig(
...     n_informative=5,
...     n_noise=10,
...     class_configs=[ClassConfig(n_samples=100, label="healthy"),
...                    ClassConfig(n_samples=100, label="diseased")],
...     random_state=42
... )
>>> x1, y1, meta_data1 = generate_dataset(data_cfg_1)