biomedical_data_generator.DatasetConfig

class biomedical_data_generator.DatasetConfig(*, n_informative=2, n_noise=0, class_configs=[ClassConfig(n_samples=30, class_distribution='normal', class_distribution_params={'loc': 0, 'scale': 1}, label='healthy'), ClassConfig(n_samples=30, class_distribution='normal', class_distribution_params={'loc': 0, 'scale': 1}, label='diseased')], class_sep=<factory>, noise_distribution='normal', noise_distribution_params=<factory>, prefixed_feature_naming=True, prefix_informative='i', prefix_noise='n', prefix_corr='corr', corr_clusters=<factory>, corr_between=0.0, batch_effects=None, random_state=None)[source]

Bases: BaseModel

Configuration for synthetic dataset generation.

This model defines the input-level controls for building a synthetic dataset. It combines:

  • Base role counts: n_informative and n_noise

  • Correlated clusters: corr_clusters (each 1 anchor + (k−1) proxies)

  • Class definitions: class_configs (with per-class n_samples and labels)

  • Optional batch effects

Examples (counting):
  1. One cluster k=4 with an informative anchor, plus n_informative=3, n_noise=2 proxies_from_clusters = (4−1) = 3 n_features_expected = 3 + 2 + 3 = 8 Breakdown:

    • informative_anchors = 1 → free_informative = 3 − 1 = 2

    • noise_anchors = 0 → free_noise = 2 − 0 = 2

  2. Two clusters: k=5 (informative anchor), k=3 (“noise” anchor), base n_informative=4, n_noise=3 proxies_from_clusters = (5−1) + (3−1) = 6 n_features_expected = 4 + 3 + 6 = 13 Breakdown:

    • informative_anchors = 1 → free_informative = 4 − 1 = 3

    • noise_anchors = 1 → free_noise = 3 − 1 = 2

Derived quantities:

These attributes are derived and must not be passed by the user:

  • n_samples (int): Total samples (derived from class_configs).

  • n_features (int): Total number of features of the complete the dataset

    (derived from n_informative, n_noise, and corr_clusters).

  • n_classes (int): Number of classes (derived from class_configs).

  • n_informative_free (int): Informative features not used as anchors.

  • n_noise_free (int): Noise features not used as anchors.

Parameters:
  • n_informative (int) – Number of base informative features (not in clusters).

  • n_noise (int) – Number of base noise features (not in clusters).

  • class_configs (list[ClassConfig]) – List of class definitions.

  • class_sep (float | Sequence[float]) – Class separation values (length n_classes - 1); scalar is broadcast.

  • corr_clusters (list[CorrClusterConfig]) – List of CorrClusterConfig defining correlated feature clusters.

  • corr_between (float) – Correlation between different clusters/roles (0 = independent).

  • noise_distribution (Literal['normal', 'lognormal', 'exp_normal', 'uniform', 'exponential', 'laplace']) – (str): Distribution for noise features. Can be any supported DistributionType.

  • noise_distribution_params (dict) – Parameters for noise distribution.

  • prefixed_feature_naming (bool) –

    If True, role-based prefixed feature names:
    • Free informative: i1, i2, …

    • Free noise: n1, n2, …

    • Correlated: corr{cid}_anchor, corr{cid}_2, …, corr{cid}_k

    If False, use generic feature_{i} naming. Default: True.

  • prefix_informative (str) – Prefix for informative features (if prefixed_feature_naming=True). Default: “i”.

  • prefix_noise (str) – Prefix for noise features (if prefixed_feature_naming=True). Default: “n”.

  • prefix_corr (str) – Prefix for correlated cluster features (if prefixed_feature_naming=True). Default: “corr”.

  • batch_effects (BatchEffectsConfig) – Optional BatchEffectsConfig for simulating batch effects.

  • random_state (int | None) – Global random seed for dataset generation.

count_informative_anchors()[source]

Return number of informative anchors across all clusters.

Return type:

int

count_noise_anchors()[source]

Return number of noise anchors across all clusters.

Return type:

int

breakdown()[source]

Return dict with detailed feature/class counts.

Return type:

dict[str, int]

Validation:
Before model construction:
  • Forbid manual n_samples, n_classes, n_features.

  • Normalize class_sep: broadcast scalar to length n_classes - 1 or validate sequence length.

After model construction:
  • Ensure n_informative >= #informative_anchors and n_noise >= #noise_anchors.

  • Check corr_between in [-1, 1].

  • Ensure anchor_class indices < n_classes.

  • Require at least one non-zero class_sep if n_informative_free > 0.

  • Auto-generate missing class labels as class_{idx}.

Raises:
  • ValueError – On invalid numeric ranges or inconsistent counts.

  • TypeError – For invalid types in class_configs or class_sep.

Parameters:

Examples

>>> # Basic dataset with two classes
>>> cfg = DatasetConfig(
...     n_informative=5,
...     n_noise=3,
...     class_configs=[
...         ClassConfig(n_samples=50, label="healthy"),
...         ClassConfig(n_samples=50, label="diseased"),
...     ],
...     corr_clusters=[
...         CorrClusterConfig(
...             n_cluster_features=4,
...             correlation=0.8,
...             anchor_role="informative",
...             anchor_effect_size="medium",
...             anchor_class=1,
...             label="Metabolic Pathway A"
...         ),
...         CorrClusterConfig(
...             n_cluster_features=3,
...             correlation=0.5,
...             anchor_role="noise",
...             label="Random Noise Cluster"
...         )
...     ],
...     corr_between=0.1,
...     noise_distribution="normal",
...     noise_distribution_params={"loc": 0, "scale": 1},
...     prefixed_feature_naming=True,
...     random_state=42
... )

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__init__(**data)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

Return type:

None

Methods

__init__(**data)

Create a new model by parsing and validating input data from keyword arguments.

breakdown()

Structured feature counts incl.

construct([_fields_set])

copy(*[, include, exclude, update, deep])

Returns a copy of the model.

count_informative_anchors()

Count clusters whose anchor contributes as 'informative'.

count_noise_anchors()

Count clusters whose anchor is 'noise' (non-informative anchor).

dict(*[, include, exclude, by_alias, ...])

from_orm(obj)

from_yaml(path)

Load from YAML and validate via the same pipeline.

json(*[, include, exclude, by_alias, ...])

model_construct([_fields_set])

Creates a new instance of the Model class with validated data.

model_copy(*[, update, deep])

!!! abstract "Usage Documentation"

model_dump(*[, mode, include, exclude, ...])

!!! abstract "Usage Documentation"

model_dump_json(*[, indent, ensure_ascii, ...])

!!! abstract "Usage Documentation"

model_json_schema([by_alias, ref_template, ...])

Generates a JSON schema for a model class.

model_parametrized_name(params)

Compute the class name for parametrizations of generic classes.

model_post_init(context, /)

Override this method to perform additional initialization after __init__ and model_construct.

model_rebuild(*[, force, raise_errors, ...])

Try to rebuild the pydantic-core schema for the model.

model_validate(obj, *[, strict, extra, ...])

Validate a pydantic model instance.

model_validate_json(json_data, *[, strict, ...])

!!! abstract "Usage Documentation"

model_validate_strings(obj, *[, strict, ...])

Validate the given object with string data against the Pydantic model.

parse_file(path, *[, content_type, ...])

parse_obj(obj)

parse_raw(b, *[, content_type, encoding, ...])

schema([by_alias, ref_template])

schema_json(*[, by_alias, ref_template])

update_forward_refs(**localns)

validate(value)

Attributes

class_counts

n_samples}.

class_labels

List of class labels (auto-generated or user-provided).

model_computed_fields

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_extra

Get extra fields set during validation.

model_fields

model_fields_set

Returns the set of fields that have been explicitly set on this model instance.

n_classes

Number of classes (derived from class_configs).

n_features

Total number of features (informative + noise + cluster proxies).

n_informative_free

Informative features outside clusters (excludes informative anchors).

n_noise_free

Independent noise features (excludes noise anchors).

n_samples

Total samples (derived from class_configs).

n_informative

n_noise

class_configs

class_sep

noise_distribution

noise_distribution_params

prefixed_feature_naming

prefix_informative

prefix_noise

prefix_corr

corr_clusters

corr_between

batch_effects

random_state

breakdown()[source]

Structured feature counts incl. cluster proxies and anchor split.

Returns:

  • n_informative_total

  • n_informative_anchors

  • n_informative_free

  • n_noise_total

  • n_noise_anchors

  • n_noise_free

  • proxies_from_clusters

  • n_features

Return type:

A dict with keys

property class_counts: dict[int, int]

n_samples}.

Type:

Class counts as dict {class_idx

property class_labels: list[str]

List of class labels (auto-generated or user-provided).

count_informative_anchors()[source]

Count clusters whose anchor contributes as ‘informative’.

Note: This is a subset of n_informative, not a separate count.

Returns:

The number of clusters with anchor_role == “informative”.

Return type:

int

Note

If you want the number of additional features contributed by clusters, use self._proxies_from_clusters(self.corr_clusters).

count_noise_anchors()[source]

Count clusters whose anchor is ‘noise’ (non-informative anchor).

Returns:

The number of clusters with anchor_role == “noise”.

Return type:

int

classmethod from_yaml(path)[source]

Load from YAML and validate via the same pipeline.

Parameters:

path (str)

Return type:

DatasetConfig

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property n_classes: int

Number of classes (derived from class_configs).

property n_features: int

Total number of features (informative + noise + cluster proxies).

property n_informative_free: int

Informative features outside clusters (excludes informative anchors).

property n_noise_free: int

Independent noise features (excludes noise anchors).

property n_samples: int

Total samples (derived from class_configs).