Code Documentation

select_feature_subset()

reverse_feature_selection.reverse_random_forests.select_feature_subset(data_df, train_indices, label_column_name='label', meta_data=None)[source]

Selects a subset of features based on the mean out-of-bag (OOB) errors for random forests regressors.

Calculate the mean out-of-bag (OOB) errors for random forests regressors with different random seeds for training data including the label and without the label for each feature. It then selects a subset of features based on the Mann-Whitney U test which determines whether there is a significant difference between the two error distributions. The test is configured for the hypothesis that the distribution of labeled_error_distribution is shifted to the left of the unlabeled_error_distribution. The Mann-Whitney U test is used to calculate the p-value based on the Out-of-Bag (OOB) scores of the labeled and unlabeled error distributions. The p-value is a measure of the probability that an observed difference could have occurred just by random chance. The smaller the p-value, the greater the statistical evidence to reject the null hypothesis (conclude that both error distributions differ).

Parameters:
  • data_df (DataFrame) – The training data. The data must contain the label and features.

  • train_indices (ndarray) – Indices for the training split. The indices are used to select the training data from the data_df DataFrame.

  • label_column_name (str) – The name of the label column in the training data. Default is “label”.

  • meta_data (dict | None) –

    The metadata related to the dataset and experiment. If meta_data is None, default values for the required keys (“n_cpus”, “random_seeds”, and “train_correlation_threshold”) are used.

    1. n_cpus: The number of available CPUs is required as an integer and defaults to multiprocessing.cpu_count().

    2. random_seeds: A list of different seeds to initalize random forests is used to generate comparable error distributions. Define list of random seeds for reproducibility. Default is generating a random list of 30 seeds (int).

    3. train_correlation_threshold: The absolute correlation threshold for removing features from the training data correlated to the target feature is a float between 0 and 1. The higher the threshold, the more features are deselected. The default value is set to 0.7 and should be adjusted if the results are not satisfactory.

Returns:

A pandas DateFrame with the selected features in the “feature_subset_selection” column. The feature_subset_selection column contains the fraction difference based on the mean of OOB score distributions, where the p_value is smaller or equal to 0.05. Features with values greater than 0 in this column are selected.

The remaining columns provide additional information:

  • feature_subset_selection_median: Contains the feature subset based on the median fraction difference.

  • unlabeled_errors: Lists the OOB scores for the unlabeled training data.

  • labeled_errors: Lists the OOB scores for the labeled training data.

  • p_value: Contains the p-values from the Mann-Whitney U test.

  • fraction_mean: Shows the fraction difference based on the mean of the distributions.

  • fraction_median: Shows the fraction difference based on the median of the distributions.

  • train_features_count: Indicates the number of uncorrelated features in the training data.

The index of the DataFrame is the feature names.

Raises:
  • AssertionError – If the meta_data dictionary does not contain the required keys or if the values are not of the expected type. Also, if the training data does not contain any features or if the label column is not found in the training data.

  • ValueError – If no features uncorrelated to the target feature are found.

Return type:

DataFrame