1. Library Overview
The Feature Probability-based Estimation (FPE) library helps identify the most important features in a dataset by calculating feature-wise probabilities. It is particularly useful for feature selection tasks to improve machine learning model performance by removing less significant features.
Dataset Conditions
- Data Consistency: Each feature must consist of either entirely strings or entirely numeric values. Mixing types is not allowed.
- No Missing Data: No empty cells (e.g., NaN or blanks) should exist in the feature data.
- Uniform Data Type: All values within a single feature must follow the same data type.
- Target Column: The last column contains the target or label for the dataset. Works on labeled datasets only.
2. Installation
Install the FPE library using pip:
pip install fpe-lib==0.1.2
3. Usage
3.1 Importing the Library
from fpe.fpe import fpefs
3.2 Input Dataset
- Features (X): All columns except the last one are considered features.
- Target (y): The last column is treated as the target variable.
3.3 Example Usage
import pandas as pd
from fpe.fpe import fpefs
# Sample dataset
data = pd.DataFrame({
'Feature1': [1, 2, 3, 4, 5],
'Feature2': ['A', 'B', 'A', 'B', 'C'],
'Target': [1, 0, 1, 0, 1]
})
# Apply FPEFS
result = fpefs(data)
# View results
print(result)
Output:
Feature Probability
0 Feature1 0.75
1 Feature2 0.75
4. Algorithmic Working
4.1 Steps of the Algorithm
- Initiation: Loads and splits the dataset into features (X) and target (y), ensuring proper data types and structure.
- Feature Normalization: Applies Min-Max scaling to numeric features, normalizing them to [0, 1].
- Group Rows by Unique Values: Groups feature values by indices and identifies corresponding target classes.
- Analyze Class Coverage: Evaluates the relationship between feature values and target classes to compute partial coverage.
- Compute Feature Probabilities: Assigns probability scores to features based on class separation ability.
- Return Probabilities: Outputs a DataFrame with feature names and their corresponding probabilities.