biomedical_data_generator.CorrCluster

class biomedical_data_generator.CorrCluster(*, n_cluster_features, structure='equicorrelated', rho=0.8, class_structure=None, class_rho=None, rho_baseline=0.0, anchor_role='informative', anchor_effect_size=None, anchor_class=None, random_state=None, label=None)[source]

Bases: BaseModel

Correlated feature cluster simulating coordinated biomarker patterns.

A cluster represents a group of biomarkers that move together, such as markers in a metabolic pathway or proteins in a signaling cascade. One marker acts as the “anchor” (driver), while the others are “proxies” (followers).

Parameters:

n_cluster_features (int) – Number of biomarkers in the cluster (including anchor). Must be >= 1.
rho (float) – Correlation strength between biomarkers in the cluster. - 0.0 = independent - 0.5 = moderate correlation - 0.8+ = strong correlation (typical for pathway markers) - Range: [0, 1) for equicorrelated; (-1, 1) for toeplitz Default is 0.8.
structure (Literal['equicorrelated', 'toeplitz']) – Pattern of correlation within the cluster. - “equicorrelated”: all pairs have the same correlation (default) - “toeplitz”: correlation decreases with distance
class_structure (dict[int, Literal['equicorrelated', 'toeplitz']] | None) – Mapping of class index to correlation structure.
class_rho (dict[int, float] | None) – Mapping of class index to correlation strength.
rho_baseline (float) – Baseline correlation for other classes if class_rho is set. Default is 0.0 (independent).
anchor_role (Literal['informative', 'pseudo', 'noise']) – Biological relevance of the anchor marker. - “informative”: true biomarker (predictive of disease) - “pseudo”: confounding variable (correlated but not causal) - “noise”: random measurement (no biological signal)
anchor_effect_size (Literal['small', 'medium', 'large'] | float | None) – Strength of the anchor’s disease association. Can be specified as: - Preset: “small” (0.5), “medium” (1.0), “large” (1.5) - Custom float: any positive value - None: defaults to “medium” (1.0) Only relevant when anchor_role=”informative”.
anchor_class (int | None) – Disease class that this anchor predicts (0, 1, 2, …). If None, the anchor contributes to all classes. Only used when anchor_role=”informative”.
random_state (int | None) – Random seed for reproducibility of this specific cluster. If None, uses the global dataset seed.
label (str | None) – Descriptive name for documentation (e.g., “Inflammation markers”).

Examples:

Strong inflammatory pathway in diseased patients:

>>> inflammation = CorrCluster(
...     n_cluster_features=5,
...     rho=0.8,
...     anchor_role="informative",
...     anchor_effect_size="large",
...     anchor_class=1,  # disease class
...     label="Inflammation pathway"
... )

Confounding variables (e.g., age-related markers):

>>> age_confounders = CorrCluster(
...     n_cluster_features=3,
...     rho=0.6,
...     anchor_role="pseudo",
...     label="Age-related markers"
... )

Weak disease signal with custom effect size:

>>> weak_signal = CorrCluster(
...     n_cluster_features=4,
...     rho=0.5,
...     anchor_role="informative",
...     anchor_effect_size=0.3,  # custom weak effect
...     label="Subtle biomarkers"
... )

Notes:

Medical interpretation: - Anchor: The primary biomarker (e.g., CRP in inflammation) - Proxies: Secondary markers that follow the anchor (e.g., IL-6, TNF-α) - rho=0.8: Typical for tightly regulated biological pathways - rho=0.5: Moderate biological coupling - effect_size=”large”: Strong disease association (easy to detect) - effect_size=”small”: Subtle signal (requires large sample size)

Technical details: - Cluster contributes n_cluster_features features to the dataset - Anchor appears first, followed by (n_cluster_features-1) proxies - Only the anchor has predictive power; proxies are correlated distractors - Proxies count as additional features beyond n_informative/n_pseudo/n_noise

See Also:

DatasetConfig : Overall dataset configuration generate_dataset : Main generation function

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__init__(**data)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)
Return type:: None

Methods

`__init__`(**data)	Create a new model by parsing and validating input data from keyword arguments.
`construct`([_fields_set])
`copy`(*[, include, exclude, update, deep])	Returns a copy of the model.
`dict`(*[, include, exclude, by_alias, ...])
`from_orm`(obj)
`get_rho_for_class`(class_idx)	Get correlation strength for a specific class.
`get_structure_for_class`(class_idx)	Get correlation structure for a specific class.
`is_class_specific`()	Check if this cluster uses class-specific correlation.
`json`(*[, include, exclude, by_alias, ...])
`model_construct`([_fields_set])	Creates a new instance of the Model class with validated data.
`model_copy`(*[, update, deep])	!!! abstract "Usage Documentation"
`model_dump`(*[, mode, include, exclude, ...])	!!! abstract "Usage Documentation"
`model_dump_json`(*[, indent, ensure_ascii, ...])	!!! abstract "Usage Documentation"
`model_json_schema`([by_alias, ref_template, ...])	Generates a JSON schema for a model class.
`model_parametrized_name`(params)	Compute the class name for parametrizations of generic classes.
`model_post_init`(context, /)	Override this method to perform additional initialization after __init__ and model_construct.
`model_rebuild`(*[, force, raise_errors, ...])	Try to rebuild the pydantic-core schema for the model.
`model_validate`(obj, *[, strict, extra, ...])	Validate a pydantic model instance.
`model_validate_json`(json_data, *[, strict, ...])	!!! abstract "Usage Documentation"
`model_validate_strings`(obj, *[, strict, ...])	Validate the given object with string data against the Pydantic model.
`parse_file`(path, *[, content_type, ...])
`parse_obj`(obj)
`parse_raw`(b, *[, content_type, encoding, ...])
`resolve_anchor_effect_size`()	Convert anchor_effect_size to numeric value.
`schema`([by_alias, ref_template])
`schema_json`(*[, by_alias, ref_template])
`summary`()	Return human-readable summary in medical terms.
`update_forward_refs`(**localns)
`validate`(value)

Attributes

`model_computed_fields`
`model_config`	Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
`model_extra`	Get extra fields set during validation.
`model_fields`
`model_fields_set`	Returns the set of fields that have been explicitly set on this model instance.
`n_cluster_features`
`structure`
`rho`
`class_structure`
`class_rho`
`rho_baseline`
`anchor_role`
`anchor_effect_size`
`anchor_class`
`random_state`
`label`

get_rho_for_class(class_idx)[source]

Get correlation strength for a specific class.

Parameters:: class_idx (int) – Class label (0, 1, 2, …).
Returns:: Correlation strength for this class.
Return type:: float

get_structure_for_class(class_idx)[source]

Get correlation structure for a specific class.

Parameters:: class_idx (int) – Class label (0, 1, 2, …).
Returns:: Structure type for this class.
Return type:: Literal[‘equicorrelated’, ‘toeplitz’]

is_class_specific()[source]

Check if this cluster uses class-specific correlation.

Returns:: True if class_rho is set (activates class-specific mode).
Return type:: bool

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

resolve_anchor_effect_size()[source]

Convert anchor_effect_size to numeric value.

Returns:: Numeric effect size for calculations. - “small” → 0.5 - “medium” → 1.0 (default) - “large” → 1.5 - custom float → as specified
Return type:: float

Examples:

>>> c = CorrCluster(n_cluster_features=3, rho=0.7, anchor_effect_size="large")
>>> c.resolve_anchor_effect_size()
1.5

>>> c = CorrCluster(n_cluster_features=3, rho=0.7, anchor_effect_size=0.8)
>>> c.resolve_anchor_effect_size()
0.8

>>> c = CorrCluster(n_cluster_features=3, rho=0.7)  # default
>>> c.resolve_anchor_effect_size()
1.0

summary()[source]

Return human-readable summary in medical terms.

Returns:: Formatted summary of cluster configuration.
Return type:: str