biomedical_data_generator.DatasetConfig
- class biomedical_data_generator.DatasetConfig(*, n_informative=2, n_noise=0, class_configs=[ClassConfig(n_samples=30, class_distribution='normal', class_distribution_params={'loc': 0, 'scale': 1}, label='healthy'), ClassConfig(n_samples=30, class_distribution='normal', class_distribution_params={'loc': 0, 'scale': 1}, label='diseased')], class_sep=<factory>, noise_distribution='normal', noise_distribution_params=<factory>, prefixed_feature_naming=True, prefix_informative='i', prefix_noise='n', prefix_corr='corr', corr_clusters=<factory>, corr_between=0.0, batch_effects=None, random_state=None)[source]
Bases:
BaseModelConfiguration for synthetic dataset generation.
This model defines the input-level controls for building a synthetic dataset. It combines:
Base role counts: n_informative and n_noise
Correlated clusters: corr_clusters (each 1 anchor + (k−1) proxies)
Class definitions: class_configs (with per-class n_samples and labels)
Optional batch effects
- Examples (counting):
One cluster k=4 with an informative anchor, plus n_informative=3, n_noise=2 proxies_from_clusters = (4−1) = 3 n_features_expected = 3 + 2 + 3 = 8 Breakdown:
informative_anchors = 1 → free_informative = 3 − 1 = 2
noise_anchors = 0 → free_noise = 2 − 0 = 2
Two clusters: k=5 (informative anchor), k=3 (“noise” anchor), base n_informative=4, n_noise=3 proxies_from_clusters = (5−1) + (3−1) = 6 n_features_expected = 4 + 3 + 6 = 13 Breakdown:
informative_anchors = 1 → free_informative = 4 − 1 = 3
noise_anchors = 1 → free_noise = 3 − 1 = 2
- Derived quantities:
These attributes are derived and must not be passed by the user:
n_samples(int): Total samples (derived from class_configs).n_features(int): Total number of features of the complete the dataset(derived from n_informative, n_noise, and corr_clusters).
n_classes(int): Number of classes (derived from class_configs).n_informative_free(int): Informative features not used as anchors.n_noise_free(int): Noise features not used as anchors.
- Parameters:
n_informative (
int) – Number of base informative features (not in clusters).n_noise (
int) – Number of base noise features (not in clusters).class_configs (
list[ClassConfig]) – List of class definitions.class_sep (
float | Sequence[float]) – Class separation values (length n_classes - 1); scalar is broadcast.corr_clusters (
list[CorrClusterConfig]) – List of CorrClusterConfig defining correlated feature clusters.corr_between (
float) – Correlation between different clusters/roles (0 = independent).noise_distribution (Literal['normal', 'lognormal', 'exp_normal', 'uniform', 'exponential', 'laplace']) – (str): Distribution for noise features. Can be any supported DistributionType.
noise_distribution_params (
dict) – Parameters for noise distribution.prefixed_feature_naming (
bool) –- If True, role-based prefixed feature names:
Free informative: i1, i2, …
Free noise: n1, n2, …
Correlated: corr{cid}_anchor, corr{cid}_2, …, corr{cid}_k
If False, use generic feature_{i} naming. Default: True.
prefix_informative (
str) – Prefix for informative features (if prefixed_feature_naming=True). Default: “i”.prefix_noise (
str) – Prefix for noise features (if prefixed_feature_naming=True). Default: “n”.prefix_corr (
str) – Prefix for correlated cluster features (if prefixed_feature_naming=True). Default: “corr”.batch_effects (
BatchEffectsConfig) – Optional BatchEffectsConfig for simulating batch effects.random_state (
int | None) – Global random seed for dataset generation.
- count_informative_anchors()[source]
Return number of informative anchors across all clusters.
- Return type:
- Validation:
- Before model construction:
Forbid manual n_samples, n_classes, n_features.
Normalize class_sep: broadcast scalar to length n_classes - 1 or validate sequence length.
- After model construction:
Ensure n_informative >= #informative_anchors and n_noise >= #noise_anchors.
Check corr_between in [-1, 1].
Ensure anchor_class indices < n_classes.
Require at least one non-zero class_sep if n_informative_free > 0.
Auto-generate missing class labels as
class_{idx}.
- Raises:
ValueError – On invalid numeric ranges or inconsistent counts.
TypeError – For invalid types in class_configs or class_sep.
- Parameters:
n_informative (int)
n_noise (int)
class_configs (list[ClassConfig])
noise_distribution (Literal['normal', 'lognormal', 'exp_normal', 'uniform', 'exponential', 'laplace'])
prefixed_feature_naming (bool)
prefix_informative (str)
prefix_noise (str)
prefix_corr (str)
corr_clusters (list[CorrClusterConfig])
corr_between (float)
batch_effects (BatchEffectsConfig | None)
random_state (int | None)
Examples
>>> # Basic dataset with two classes >>> cfg = DatasetConfig( ... n_informative=5, ... n_noise=3, ... class_configs=[ ... ClassConfig(n_samples=50, label="healthy"), ... ClassConfig(n_samples=50, label="diseased"), ... ], ... corr_clusters=[ ... CorrClusterConfig( ... n_cluster_features=4, ... correlation=0.8, ... anchor_role="informative", ... anchor_effect_size="medium", ... anchor_class=1, ... label="Metabolic Pathway A" ... ), ... CorrClusterConfig( ... n_cluster_features=3, ... correlation=0.5, ... anchor_role="noise", ... label="Random Noise Cluster" ... ) ... ], ... corr_between=0.1, ... noise_distribution="normal", ... noise_distribution_params={"loc": 0, "scale": 1}, ... prefixed_feature_naming=True, ... random_state=42 ... )
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- __init__(**data)
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
data (Any)
- Return type:
None
Methods
__init__(**data)Create a new model by parsing and validating input data from keyword arguments.
Structured feature counts incl.
construct([_fields_set])copy(*[, include, exclude, update, deep])Returns a copy of the model.
Count clusters whose anchor contributes as 'informative'.
Count clusters whose anchor is 'noise' (non-informative anchor).
dict(*[, include, exclude, by_alias, ...])from_orm(obj)from_yaml(path)Load from YAML and validate via the same pipeline.
json(*[, include, exclude, by_alias, ...])model_construct([_fields_set])Creates a new instance of the Model class with validated data.
model_copy(*[, update, deep])!!! abstract "Usage Documentation"
model_dump(*[, mode, include, exclude, ...])!!! abstract "Usage Documentation"
model_dump_json(*[, indent, ensure_ascii, ...])!!! abstract "Usage Documentation"
model_json_schema([by_alias, ref_template, ...])Generates a JSON schema for a model class.
model_parametrized_name(params)Compute the class name for parametrizations of generic classes.
model_post_init(context, /)Override this method to perform additional initialization after __init__ and model_construct.
model_rebuild(*[, force, raise_errors, ...])Try to rebuild the pydantic-core schema for the model.
model_validate(obj, *[, strict, extra, ...])Validate a pydantic model instance.
model_validate_json(json_data, *[, strict, ...])!!! abstract "Usage Documentation"
model_validate_strings(obj, *[, strict, ...])Validate the given object with string data against the Pydantic model.
parse_file(path, *[, content_type, ...])parse_obj(obj)parse_raw(b, *[, content_type, encoding, ...])schema([by_alias, ref_template])schema_json(*[, by_alias, ref_template])update_forward_refs(**localns)validate(value)Attributes
n_samples}.
List of class labels (auto-generated or user-provided).
model_computed_fieldsConfiguration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
model_extraGet extra fields set during validation.
model_fieldsmodel_fields_setReturns the set of fields that have been explicitly set on this model instance.
Number of classes (derived from class_configs).
Total number of features (informative + noise + cluster proxies).
Informative features outside clusters (excludes informative anchors).
Independent noise features (excludes noise anchors).
Total samples (derived from class_configs).
n_informativen_noiseclass_configsclass_sepnoise_distributionnoise_distribution_paramsprefixed_feature_namingprefix_informativeprefix_noiseprefix_corrcorr_clusterscorr_betweenbatch_effectsrandom_state- breakdown()[source]
Structured feature counts incl. cluster proxies and anchor split.
- Returns:
n_informative_total
n_informative_anchors
n_informative_free
n_noise_total
n_noise_anchors
n_noise_free
proxies_from_clusters
n_features
- Return type:
A dict with keys
- count_informative_anchors()[source]
Count clusters whose anchor contributes as ‘informative’.
Note: This is a subset of n_informative, not a separate count.
- Returns:
The number of clusters with anchor_role == “informative”.
- Return type:
Note
If you want the number of additional features contributed by clusters, use self._proxies_from_clusters(self.corr_clusters).
- count_noise_anchors()[source]
Count clusters whose anchor is ‘noise’ (non-informative anchor).
- Returns:
The number of clusters with anchor_role == “noise”.
- Return type:
- classmethod from_yaml(path)[source]
Load from YAML and validate via the same pipeline.
- Parameters:
path (str)
- Return type:
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].