biomedical_data_generator.generate_dataset
- biomedical_data_generator.generate_dataset(cfg, return_dataframe=True)[source]
Generate synthetic biomedical dataset with specified feature structure.
Creates a classification dataset with configurable informative features, noise, correlated feature clusters (e.g., biological pathways), and optional batch effects.
- Parameters:
cfg – Configuration object defining dataset the structure. See
DatasetConfigfor details.return_dataframe – If
True, return features as apandas.DataFramewith named columns. IfFalse, return as a NumPy array.
- Returns:
A 3-tuple containing:
x (
pandas.DataFrameornumpy.ndarray): Feature matrix of shape(n_samples, n_features). Each row represents one sample (e.g., patient), each column represents one feature (e.g., biomarker, gene expression value). When returned as DataFrame, column names depend oncfg.feature_naming: “prefixed” (default) uses type-based prefixes (ifor informative,corrfor correlated clusters,nfor noise), yielding names likei1, corr1_anchor, n1. “sequential” uses sequential numberingfeature_1, feature_2, ....y (
numpy.ndarray): Class labels of shape(n_samples,)with integer values0, 1, ..., n_classes-1.meta (
DatasetMeta): Metadata object containing feature masks (informative, correlated, noise, batch-specific), correlation block specifications, batch assignments, and complete generation configuration.
- Return type:
Examples
>>> from biomedical_data_generator.config import DatasetConfig, ClassConfig >>> data_cfg_1 = DatasetConfig( ... n_informative=5, ... n_noise=10, ... class_configs=[ClassConfig(n_samples=100, label="healthy"), ... ClassConfig(n_samples=100, label="diseased")], ... random_state=42 ... ) >>> x1, y1, meta_data1 = generate_dataset(data_cfg_1)