biomedical_data_generator.DatasetConfig

class biomedical_data_generator.DatasetConfig(*, n_samples=100, n_features=None, n_informative=2, n_pseudo=0, n_noise=0, noise_distribution=NoiseDistribution.normal, noise_scale=1.0, noise_params=None, n_classes=2, weights=None, class_counts=None, class_sep=1.5, feature_naming='prefixed', prefix_informative='i', prefix_pseudo='p', prefix_noise='n', prefix_corr='corr', corr_clusters=<factory>, corr_between=0.0, anchor_mode='equalized', effect_size='medium', batch=None, random_state=None)[source]

Bases: BaseModel

Configuration for synthetic dataset generation.

The strict mode=”before” normalizer fills/validates n_features without Pydantic warnings.

Note: - The ‘before’ validator normalizes raw inputs:

  • fills n_features if omitted,

  • enforces n_features >= minimal requirement (strict).

  • Use DatasetConfig.relaxed(…) if you want silent auto-fix instead of a validation error.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:
  • n_samples (int)

  • n_features (int | None)

  • n_informative (int)

  • n_pseudo (int)

  • n_noise (int)

  • noise_distribution (NoiseDistribution)

  • noise_scale (float)

  • noise_params (Mapping[str, Any] | None)

  • n_classes (int)

  • weights (list[float] | None)

  • class_counts (dict[int, int] | None)

  • class_sep (float)

  • feature_naming (Literal['prefixed', 'simple'])

  • prefix_informative (str)

  • prefix_pseudo (str)

  • prefix_noise (str)

  • prefix_corr (str)

  • corr_clusters (list[CorrCluster])

  • corr_between (float)

  • anchor_mode (Literal['equalized', 'strong'])

  • effect_size (Literal['small', 'medium', 'large'])

  • batch (BatchConfig | None)

  • random_state (int | None)

__init__(**data)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

Return type:

None

Methods

__init__(**data)

Create a new model by parsing and validating input data from keyword arguments.

breakdown()

Return a structured breakdown of feature counts, incl.

construct([_fields_set])

copy(*[, include, exclude, update, deep])

Returns a copy of the model.

count_informative_anchors()

Count clusters whose anchor contributes as 'informative'.

dict(*[, include, exclude, by_alias, ...])

from_orm(obj)

from_yaml(path)

Load from YAML and validate via the same 'before' pipeline.

json(*[, include, exclude, by_alias, ...])

model_construct([_fields_set])

Creates a new instance of the Model class with validated data.

model_copy(*[, update, deep])

!!! abstract "Usage Documentation"

model_dump(*[, mode, include, exclude, ...])

!!! abstract "Usage Documentation"

model_dump_json(*[, indent, ensure_ascii, ...])

!!! abstract "Usage Documentation"

model_json_schema([by_alias, ref_template, ...])

Generates a JSON schema for a model class.

model_parametrized_name(params)

Compute the class name for parametrizations of generic classes.

model_post_init(context, /)

Override this method to perform additional initialization after __init__ and model_construct.

model_rebuild(*[, force, raise_errors, ...])

Try to rebuild the pydantic-core schema for the model.

model_validate(obj, *[, strict, extra, ...])

Validate a pydantic model instance.

model_validate_json(json_data, *[, strict, ...])

!!! abstract "Usage Documentation"

model_validate_strings(obj, *[, strict, ...])

Validate the given object with string data against the Pydantic model.

parse_file(path, *[, content_type, ...])

parse_obj(obj)

parse_raw(b, *[, content_type, encoding, ...])

relaxed(**kwargs)

Create a configuration, silently autofixing n_features to the required minimum.

schema([by_alias, ref_template])

schema_json(*[, by_alias, ref_template])

summary(*[, per_cluster, as_markdown])

Return a human-readable summary of the configuration.

update_forward_refs(**localns)

validate(value)

Attributes

model_computed_fields

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_extra

Get extra fields set during validation.

model_fields

model_fields_set

Returns the set of fields that have been explicitly set on this model instance.

n_samples

n_features

n_informative

n_pseudo

n_noise

noise_distribution

noise_scale

noise_params

n_classes

weights

class_counts

class_sep

feature_naming

prefix_informative

prefix_pseudo

prefix_noise

prefix_corr

corr_clusters

corr_between

anchor_mode

effect_size

batch

random_state

breakdown()[source]

Return a structured breakdown of feature counts, incl. cluster proxies.

Returns:

A dict with keys: - n_informative_total - n_informative_anchors - n_informative_free - n_pseudo_free - n_noise - proxies_from_clusters - n_features_expected - n_features_configured

Raises:

ValueError: If self.n_features is inconsistent (should not happen if validated).

This is a safeguard against manual tampering with the instance attributes. This should not happen if the instance was created via the normal validators. If you encounter this, please report a bug.

Note

n_features_expected = n_informative + n_pseudo + n_noise + proxies_from_clusters n_features_configured = self.n_features (may be larger than expected)

Return type:

dict[str, int]

count_informative_anchors()[source]

Count clusters whose anchor contributes as ‘informative’.

Note: This is a subset of n_informative, not a separate count.

Returns:

The number of clusters with anchor_role == “informative”.

Note

If you want the number of additional features contributed by clusters, use self._proxies_from_clusters(self.corr_clusters).

Return type:

int

classmethod from_yaml(path)[source]

Load from YAML and validate via the same ‘before’ pipeline.

Parameters:
  • cls – The DatasetConfig class.

  • path (str) – Path to a YAML file.

Return type:

DatasetConfig

Returns:

A validated DatasetConfig instance.

Raises:

FileNotFoundError: If the file does not exist. yaml.YAMLError: If the file cannot be parsed as YAML. pydantic.ValidationError: If the loaded config is invalid.

Note

This requires PyYAML to be installed.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'use_enum_values': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod relaxed(**kwargs)[source]

Create a configuration, silently autofixing n_features to the required minimum.

Convenience factory that silently ‘autofixes’ n_features to the required minimum. Prefer this in teaching notebooks to avoid interruptions.

Parameters:

**kwargs (Any) – Any valid DatasetConfig field.

Return type:

DatasetConfig

Returns:

A validated DatasetConfig instance with n_features >= required minimum.

Note

This does NOT modify the original kwargs dict.

summary(*, per_cluster=False, as_markdown=False)[source]

Return a human-readable summary of the configuration.

Parameters:
  • per_cluster (bool) – Include one line per cluster (size/role/rho/etc.).

  • as_markdown (bool) – Render as a Markdown table-like text.

Return type:

str

Returns:

A formatted string summarizing the feature layout and counts.