biomedical_data_generator.effects.batch

Batch effect simulation for synthetic biomedical datasets.

This module provides functionality to add realistic batch effects (site differences, instrument variations, recruitment cohorts) to generated datasets.

Public API:
High-level (recommended):
  • apply_batch_effects_from_config: Orchestrate assignment + effects

  • generate_batch_assignments: Unified batch assignment (random or confounded)

  • apply_batch_effects: Apply systematic feature differences

Low-level (for experimentation):
  • random_batch_assignment: Create independent batch labels

  • confounded_batch_assignment: Create class-correlated batch labels

Typical usage (high-level):
>>> from batch import apply_batch_effects_from_config
>>> X_batch, batches, batch_effects = apply_batch_effects_from_config(
...     x=X, y=y, batch_config=cfg.batch_effects,
...     informative_indices=inf_idx, rng=rng
... )
Educational usage (low-level):
>>> # Experiment with random vs. confounded assignments
>>> rng = np.random.default_rng(42)
>>> random_batches = random_batch_assignment(100, n_batches=3, rng=rng)
>>> confounded_batches = confounded_batch_assignment(
...     y, n_batches=3, confounding_with_class=0.8, rng=rng
... )
Reproducibility:

All functions accept np.random.Generator objects to ensure proper state propagation. The top-level code (generator.py) creates ONE generator and passes it through the call chain. This ensures: - No duplicate random sequences - Perfect reproducibility from config.random_state - Proper state evolution across all operations

Functions

apply_batch_effects(X, batch_assignments, rng)

Apply batch effects to a feature matrix.

apply_batch_effects_from_config(x, y, ...)

Apply batch effects based on a configuration object.

confounded_batch_assignment(class_labels, ...)

Create batch assignments confounded with class labels.

generate_batch_assignments(n_samples, ...[, ...])

Generate batch assignments for samples.

random_batch_assignment(n_samples, ...[, ...])

Create random batch assignments without confounding.

biomedical_data_generator.effects.batch.apply_batch_effects(X, batch_assignments, rng, effect_type='additive', effect_strength=0.5, affected_features='all', effect_granularity='per_feature')[source]

Apply batch effects to a feature matrix.

Apply systematic batch effects to specified features in the dataset, simulating site differences or instrument variations.

Parameters:
  • X (DataFrame | ndarray[tuple[Any, ...], dtype[float64]]) – Feature matrix (DataFrame or array).

  • batch_assignments (ndarray[tuple[Any, ...], dtype[int64]]) – Array of batch assignments per sample.

  • rng (Generator) – NumPy random number generator used to sample batch effects.

  • effect_type (Literal['additive', 'multiplicative']) –

    Type of batch effect to apply: - "additive": Adds a batch-specific shift to affected features. - "multiplicative": Multiplies affected features by a batch-specific

    scaling factor around 1.0.

  • effect_strength (float) – Standard deviation of the batch-effect distribution. Larger values generate stronger perturbations. Must be non-negative.

  • affected_features (Sequence[int] | Literal['all']) – Defines which features are affected: - "all": Apply batch effects to all features. - Sequence[int]: Apply effects only to the listed feature indices.

  • effect_granularity (Literal['per_feature', 'scalar']) –

    Controls whether effects vary across features: - "per_feature" (default): Draws effects with shape

    (n_batches, n_affected_features), so features within the same batch differ in their batch-specific effect.

    • "scalar": Draws a single scalar per batch and applies it uniformly across all affected features. This corresponds to a global per-batch offset (additive) or global scaling factor (multiplicative).

Returns:

A tuple (X_affected, batch_effects) where:

  • X_affected: The feature matrix with batch effects applied. The output type matches the input type (DataFrame or ndarray).

  • batch_effects: An array of length n_batches summarizing the effect applied in each batch:

    • For effect_granularity="scalar", these are the exact additive shifts (additive mode) or multiplicative factors minus 1.0 (multiplicative mode) drawn for each batch.

    • For "per_feature", these values are the mean additive effect or mean multiplicative deviation from 1.0 across all affected features in each batch. This provides a compact summary even though the full per-feature effects vary within batches.

Return type:

Tuple[pd.DataFrame | np.ndarray, np.ndarray]

Raises:
  • ValueError – If batch_assignments has the wrong shape, contains negative values, or effect_granularity is not one of the allowed strings.

  • IndexError – If affected_features contains indices outside the valid feature range.

Notes

  • The generator state rng is advanced during sampling.

  • Additive effects are sampled from Normal(0, effect_strength).

  • Multiplicative effects are sampled from 1 + Normal(0, effect_strength).

  • For DataFrame input, column names and index are preserved in the output.

Examples

>>> rng = np.random.default_rng(42)
>>> x = np.random.normal(size=(100, 5))
>>> batches = np.random.randint(0, 3, size=100)
>>>
>>> X_batch, batch_effects = apply_batch_effects(
...     x, batches, rng,
...     effect_type="additive",
...     effect_strength=0.5,
...     affected_features=[0, 2],
...     effect_granularity="scalar",
... )
>>> X_batch.shape
(100, 5)
>>> batch_effects.shape
(3,)
biomedical_data_generator.effects.batch.apply_batch_effects_from_config(x, y, batch_config, rng)[source]

Apply batch effects based on a configuration object.

High-level orchestration function that handles batch assignment and effect application in a single call.

Parameters:
  • x (pd.DataFrame | NDArray[np.float64]) – Feature matrix (DataFrame or array).

  • y (NDArray[np.int_] | ArrayLike) – Class labels (for potential confounding). May be any array-like accepted by np.unique (e.g. integers or strings).

  • batch_config (Any) – Configuration object with attributes: - n_batches: Number of batches (>= 1) - proportions: Optional batch size proportions - confounding_with_class: Degree of class-batch correlation in [0, 1] - effect_type: “additive” or “multiplicative” - effect_strength: Magnitude of batch effects - affected_features: “all”, “informative”, or list of indices

  • rng (np.random.Generator) – NumPy random generator (for reproducibility).

Returns:

  • X_affected: Feature matrix with batch effects applied

  • batch_labels: Array of batch assignments per sample

  • batch_effects: Random effects drawn per batch

Return type:

Tuple of (X_affected, batch_labels, batch_effects)

Notes

  • Automatically handles confounding when confounding_with_class > 0.

biomedical_data_generator.effects.batch.confounded_batch_assignment(class_labels, n_batches, confounding_with_class, rng, proportions=None)[source]

Create batch assignments confounded with class labels.

Direct function for creating batch labels that are correlated with class membership, simulating recruitment bias or site selection effects.

For most use cases, prefer generate_batch_assignments() which provides a unified interface for both random and confounded assignments.

Strategy:
  • Use np.unique to derive K distinct classes (works for int and str).

  • For each class index k (0..K-1), preferentially assign to a “preferred” batch (k % n_batches).

  • Higher confounding_with_class = stronger preference for the preferred batch.

  • Uses redistribution: probability mass is moved from other batches to the preferred batch.

  • At confounding_with_class=1.0: strong preference (all samples of a class tend to fall into its preferred batch, subject to randomness).

Parameters:
  • class_labels (ArrayLike) – Class membership for each sample. May be integer or string labels; only relative grouping matters.

  • n_batches (int) – Number of distinct batches (must be >= 2).

  • confounding_with_class (float) – Degree of correlation (0.0 = random assignment, 1.0 = strong preference), must be in [0, 1].

  • rng (np.random.Generator) – NumPy random generator (for reproducibility).

  • proportions (Sequence[float] | None) – Relative batch sizes (None = equal split).

Returns:

Array of batch assignments (integers 0 to n_batches-1).

Return type:

NDArray[np.int_]

Examples

>>> rng = np.random.default_rng(42)
>>> y = np.array(["control"] * 50 + ["case"] * 50)
>>> batches = confounded_batch_assignment(
...     y, n_batches=2, confounding_with_class=0.8, rng=rng
... )
>>> # Most "control" samples will be in batch 0,
>>> # most "case" samples will be in batch 1.

Notes

  • confounding_with_class=0.0 would reduce to random assignment; use generate_batch_assignments for that branch.

  • For 2 equal-sized batches, confounding_with_class=0.5 yields approx. a 75/25 preference towards the class’s preferred batch.

biomedical_data_generator.effects.batch.generate_batch_assignments(n_samples, n_batches, rng, proportions=None, class_labels=None, confounding_with_class=0.0)[source]

Generate batch assignments for samples.

Unified entry point for creating batch labels. Automatically handles both random and confounded assignments based on the confounding_with_class parameter.

This is the recommended function for batch assignment. For direct access to the underlying implementations, see:

  • random_batch_assignment() for independent assignments

  • confounded_batch_assignment() for class-correlated assignments

Parameters:
  • n_samples (int) – Number of samples.

  • n_batches (int) – Number of batches (must be >= 1).

  • rng (np.random.Generator) – NumPy random generator (for reproducibility).

  • proportions (Sequence[float] | None) – Relative batch sizes (None = equal split).

  • class_labels (ArrayLike | None) – Optional class labels for confounding. May be any array-like of shape (n_samples,) with labels that np.unique can handle (e.g. integers, strings). Only used if confounding_with_class > 0.

  • confounding_with_class (float) – Correlation with class (0.0 = random, 1.0 = strong preference), must be in [0, 1].

Returns:

Array of batch assignments (integers 0 to n_batches-1).

Return type:

NDArray[np.int_]

Examples

>>> rng = np.random.default_rng(42)
>>>
>>> # Random assignment
>>> batches = generate_batch_assignments(100, n_batches=3, rng=rng)
>>> np.bincount(batches)
array([33, 33, 34])
>>>
>>> # Confounded with class (simulate recruitment bias)
>>> y = np.array([0]*50 + [1]*50)
>>> batches = generate_batch_assignments(
...     n_samples=100,
...     n_batches=2,
...     rng=rng,
...     class_labels=y,
...     confounding_with_class=0.8,
... )

Notes

  • confounding_with_class=0.0: Random assignment (no bias)

  • confounding_with_class=1.0: Strong preference (each class → preferred batch)

  • For 2 equal-sized batches, confounding_with_class=0.5 yields approx. a 75/25 class split towards the preferred batch.

biomedical_data_generator.effects.batch.random_batch_assignment(n_samples, n_batches, rng, proportions=None)[source]

Create random batch assignments without confounding.

Direct function for creating batch labels with specified proportions but no correlation with class labels.

For most use cases, prefer generate_batch_assignments() which provides a unified interface for both random and confounded assignments.

Parameters:
  • n_samples (int) – Number of samples to assign.

  • n_batches (int) – Number of distinct batches (must be >= 1).

  • rng (Generator) – NumPy random generator (for reproducibility).

  • proportions (Sequence[float] | None) – Relative batch sizes (None = equal split).

Returns:

Array of batch assignments (integers 0 to n_batches-1).

Return type:

ndarray[tuple[Any, …], dtype[int64]]

Examples

>>> rng = np.random.default_rng(42)
>>> batches = random_batch_assignment(100, n_batches=3, rng=rng)
>>> np.bincount(batches)
array([33, 33, 34])
>>>
>>> # Unequal proportions
>>> batches = random_batch_assignment(
...     100, n_batches=3, rng=rng, proportions=[0.5, 0.3, 0.2]
... )

Notes

  • Generator state advances during shuffling.

  • Multiple calls with the same generator produce different results.