Biomedical Data Generator
Generate reproducible, labeled synthetic datasets for machine learning with a focus on biomedical applications.
Key Features
Correlated feature clusters with equicorrelated and Toeplitz structures
Class-specific correlation patterns (e.g., pathways active only in diseased class)
Batch effects simulation with controllable confounding
Ground-truth metadata capturing complete generative process
Scikit-learn compatible output for seamless integration
Configurable feature roles (informative, noise, proxy)
Quick Example
from biomedical_data_generator import DatasetConfig, ClassConfig, generate_dataset
cfg = DatasetConfig(
n_informative=5,
n_noise=10,
class_configs=[
ClassConfig(n_samples=50, label="healthy"),
ClassConfig(n_samples=50, label="diseased"),
],
class_sep=1.5,
random_state=42,
)
X, y, meta = generate_dataset(cfg)
Installation
pip install biomedical-data-generator
Requirements: Python 3.11+
Documentation Contents
Reference
External Links
Use Cases
This package is designed for:
Generating reproducible datasets
Simulating high-dimensional data with known ground truth feature roles
Simulating multi-class problems with class-specific correlation structures
Creating datasets with controlled signal-to-noise ratios
Benchmarking feature selection and classification methods
Evaluating methods for handling small sample sizes in high dimensions
Testing robustness under correlation and non-causal variation
Validating feature importance metrics against known ground truth
Studying stability of selected features across resamples
Exploring effects of batch confounding on model performance
Illustrating impact of correlated features on model interpretability
Prototyping new algorithms for biomedical data
Generating data for domain adaptation experiments with batch effects
Teaching machine learning concepts with transparent ground truth
Demonstrating cross-validation pitfalls in high-dimensional settings
Scientific Context
Many biomedical machine learning problems operate in p >> n settings: thousands of variables but only dozens of samples. In these regimes, model performance and feature selection stability are heavily influenced by:
Correlated feature clusters (e.g., pathways or co-expressed genes)
Non-causal variation (batch effects, site differences)
Noise features appearing discriminative by chance
Small changes in class balance or effect size
This generator provides a configurable, transparent way to simulate such scenarios with complete ground truth for validation.
Architecture
The generator follows a clean 6-phase pipeline:
Label generation: Create class labels with exact counts
Informative features: Generate features with class separation
Correlated clusters: Create feature blocks with within-cluster correlations
Noise features: Generate independent uninformative features
Assembly: Concatenate all feature blocks in defined order
Batch effects (optional): Apply technical overlays
Each module has single responsibility:
features/informative.py: Labels and class separationfeatures/correlated.py: Cluster generation with class-specific correlationsfeatures/noise.py: Pure noise generationeffects/batch.py: Technical overlays (batch effects)generator.py: Pipeline orchestrationconfig.py: Configuration models with validationmeta.py: Ground truth capture
Citation
If you use this package in a scientific publication, please cite:
@software{biomedical_data_generator,
author = {May, Sigrun},
title = {biomedical-data-generator: Synthetic biomedical data
generator for benchmarking and teaching},
year = {2025},
url = {https://github.com/sigrun-may/biomedical-data-generator},
version = {1.0.0}
}