Code Documentation

artificial_data_generator

Generator for artificial data.

Can be used as baseline for benchmarking and the development of new methods. For example, simulation of biomarker data from high-throughput experiments.

artificial_data_generator.artificial_data_generator._build_pseudo_classes(params_dict)[source]

Create pseudo-classes by shuffling artificial classes.

The total number of underlying classes equals the total number of artificial classes. :param params_dict: Parameter dict containing number of pseudo-class features and number of artificial classes

(see example for parameters of generate_artificial_classification_data()).

Returns:

Numpy array of the given shape.

Return type:

Randomly shuffled pseudo-class

Parameters:

params_dict (Dict[str, Any])

artificial_data_generator.artificial_data_generator._build_single_class(class_params_dict, number_of_relevant_features, rng)[source]

Generate a single class with the given parameters.

Parameters:
  • class_params_dict (Dict[str, Any]) – Parameter dict for the class to generate. The key equals the class label. “number_of_samples”: Number of samples to generate. “distribution”: Distribution type (“normal” or “lognormal”). “mode”: Mean (“centre”) of the distribution. “scale”: Standard deviation (spread or “width”) of the distribution. Must be non-negative. “correlated_features”: Dict with parameters for correlated features (see _generate_correlated_features).

  • number_of_relevant_features (int) – Number of relevant features to generate.

  • rng – Random number generator.

Returns:

Numpy array with generated class data.

The first column is the label column. The rest of the columns are the features. The number of columns equals the number of relevant features. The number of rows equals the number of samples.

Return type:

ndarray

artificial_data_generator.artificial_data_generator._generate_correlated_cluster(number_of_features, number_of_samples, lower_bound, upper_bound, rng=None)[source]

Generate a cluster of correlated features. :param number_of_features: Number of columns of generated data. :param number_of_samples: Number of rows of generated data. :param lower_bound: Lower bound of the generated correlations. :param upper_bound: Upper bound of the generated correlations. :param rng: Random number generator.

Returns:

Numpy array of the given shape with correlating features in the given range.

Parameters:
  • number_of_features (int)

  • number_of_samples (int)

  • lower_bound (float)

  • upper_bound (float)

  • rng (default_rng)

Return type:

ndarray

artificial_data_generator.artificial_data_generator._generate_correlated_features(class_params_dict, rng)[source]

Generate correlated features for a given class.

Parameters:
  • class_params_dict (Dict[str, Any]) – Parameter dict for the class to generate. The key equals the class label. “correlated_features”: Dict with parameters for correlated features (see _generate_correlated_features).

  • rng – Random number generator.

Returns:

Numpy array with generated correlated features.

Return type:

ndarray

artificial_data_generator.artificial_data_generator._generate_dataframe(data_np, params_dict)[source]

Generate semantic names for the columns of the given DataFrame.

Parameters:
  • data_np (ndarray) – Numpy array with generated data.

  • params_dict (Dict[str, Any]) – Parameter dict including the number of features per class, the number of pseudo-class features and the number of random features (see example for parameters of generate_artificial_classification_data()).

Returns:

DataFrame with meaningful named columns.
  • label for the labels

  • bm for artificial class feature

  • pseudo for pseudo-class feature

  • random for random data

Return type:

DataFrame

artificial_data_generator.artificial_data_generator._generate_normal_distributed_class(class_params_dict, number_of_relevant_features, rng)[source]

Generate normal distributed class data.

Parameters:
  • class_params_dict (Dict[str, Any]) – Parameter dict for the class to generate. The key equals the class label. “number_of_samples”: Number of samples to generate. “distribution”: Distribution type (“normal” or “lognormal”). “mode”: Mean (“centre”) of the distribution. “scale”: Standard deviation (spread or “width”) of the distribution. Must be non-negative. “correlated_features”: Dict with parameters for correlated features (see _generate_correlated_features).

  • number_of_relevant_features (int) – Number of relevant features to generate.

  • rng – Random number generator.

Returns:

Numpy array with generated class data.

Return type:

ndarray

artificial_data_generator.artificial_data_generator._repeat_correlation_cluster_generation(correlated_feature_cluster, cluster_params_dict, rng)[source]

Repeat random generation of correlated features until lower bound is reached.

Parameters:
  • correlated_feature_cluster – Numpy array with generated correlated features.

  • cluster_params_dict

    Parameter dict for the cluster to generate. “correlation_lower_bound”: Lower bounds for the correlation

    of each cluster of correlated features within a normal distributed class.

    ”correlation_upper_bound”: Upper bounds for the correlation

    of each cluster of correlated features within a normal distributed class.

  • rng – Random number generator.

Returns:

Numpy array with generated correlated features.

The number of columns equals the number of features. The number of rows equals the number of samples. The values are normally distributed.

Return type:

ndarray

artificial_data_generator.artificial_data_generator._save(data_df, params_dict)[source]

Save the generated data and parameters to the given paths. :param data_df: DataFrame with generated data. :param params_dict: Parameter dict including the paths to save the data and parameters.

artificial_data_generator.artificial_data_generator._shuffle_features(data_df, params_dict)[source]

Shuffle the features of the given DataFrame.

Parameters:
  • data_df (DataFrame) – DataFrame with generated data.

  • params_dict (Dict[str, Any]) – Parameter dict including the number of features per class, the number of pseudo-class features and the number of random features (see example for parameters of generate_artificial_classification_data()).

Returns:

DataFrame with shuffled features.

Return type:

DataFrame

artificial_data_generator.artificial_data_generator.generate_artificial_classification_data(params_dict)[source]

Generate artificial classification (e.g. biomarker) data.

Parameters:

params_dict (Dict[str, Any]) – Parameters for the data to generate (see example below).

Returns:

Generated artificial data.

Return type:

DataFrame

Example: .. code-block:: python

params_dict = { “number_of_relevant_features”: 12,

“number_of_pseudo_class_features”: 2, “random_features”: {“number_of_features”: 10, “distribution”: “lognormal”,

“scale”: 1, “mode”: 0},

“classes”: {
1: {

“number_of_samples”: 15, “distribution”: “lognormal”, “mode”: 3, “scale”: 1, “correlated_features”: {

1: {“number_of_features”: 4, “correlation_lower_bound”: 0.7,

“correlation_upper_bound”: 1},

2: {“number_of_features”: 4, “correlation_lower_bound”: 0.7,

“correlation_upper_bound”: 1},

3: {“number_of_features”: 4, “correlation_lower_bound”: 0.7,

“correlation_upper_bound”: 1},

},

}, 2: {“number_of_samples”: 15, “distribution”: “normal”, “mode”: 1,

“scale”: 2, “correlated_features”: {}},

3: {“number_of_samples”: 15, “distribution”: “normal”, “mode”: -10,

“scale”: 2, “correlated_features”: {}},

}, “seed”: 42, “path_to_save_csv”: “your_path_to_save.csv”, “path_to_save_feather”: “”, “path_to_save_meta_data”: “your_path_to_save_params_dict.yaml”, “shuffle_features”: False,

}

Elements of the parameter dict: “number_of_relevant_features”: Total number of features (columns) to generate

for each artificial class.

“number_of_pseudo_class_features”: Number of pseudo-class features.

The underlying classes correspond to the selected number of classes and follow a normal distribution. Shifted modes of the generated artificial classes equal two times the class number. All samples of the generated classes are randomly shuffled and therefore have no relation to any class label.

“random_features”: “number_of_features”: Number of randomly generated features.

“distribution”: “lognormal” or “normal” “scale”: Standard deviation (spread or “width”) of the distribution.

Must be non-negative.

“mode”: Mean (“centre”) of the distribution.

“classes”: Parameter dicts for each class to generate. The key equals the class label.

“number_of_samples”: 15, “distribution”: “lognormal” or “normal” “mode”: Mean (“centre”) of the distribution. “scale”: Standard deviation (spread or “width”) of the distribution. Must be non-negative. “correlated_features”: Parameter dicts for each cluster of correlated features to

generate. The key equals the cluster number. To generate no clusters insert empty dict. “number_of_features”: Number of correlated features within

a cluster.

“correlation_lower_bound”: Lower bounds for the correlation

of each cluster of correlated features within a normal distributed class. Default is 0.7.

“correlation_upper_bound”: Upper bounds for the correlation

of each cluster of correlated features within a normal distributed class. Default is 1.

“seed”: Seed for reproducibility. If None, a random seed is used. “path_to_save_csv”: “your_path_to_save.csv” “path_to_save_feather”: “your_path_to_save.feather” “path_to_save_meta_data”: “your_path_to_save_params_dict.yaml” “shuffle_features”: If generated features should be shuffled.