Top 4 Reasons to Apply Feature Selection in Python:
- It improves the accuracy of a model if the right subset is chosen.
- It reduces overfitting.
- It enables the machine learning algorithm to train faster.
- It reduces the complexity of a model and makes it easier to interpret.
“I prepared a model by selecting all the features and I got an accuracy of around 65% which is not good for a predictive model and after doing some feature selection and feature engineering without doing any logical changes in my model code my accuracy jumped to 81% which is quite impressive”- By Raheel Shaikh.
What is Featurewiz?
How does it work?
SULOV means Searching for Uncorrelated List of Variables. The algorithm works in the following steps.
- First step: find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.8)).
- Second step: find their Mutual Information Score to the target variable. Mutual Information Score is a non-parametric scoring method. So it's suitable for all kinds of variables and target.
- Third step: take each pair of correlated variables, then knock off the one with the lower Mutual Information Score.
- Final step: Collect the ones with the highest Information scores and least correlation with each other.
After selecting the features with less correlation and high mutual information score, the Recursive XGBoost is used to find the best features among the remaining features. Here is how it works.
- First step: Select all features in the dataset and split the dataset into train and valid sets.
- Second step: Find top X features on train using valid for early stopping (to prevent overfitting).
- Third step: Take the next set of features and find top X.
- Final step: Repeat this 5 times and finally combine all selected features and de-duplicate them.
Installation
pip install featurewiz
How to Use Featurewiz for Feature Selection in Python
- 0 (low cost)
- 1 (medium cost)
- 2 (high cost)
- 4 (very high cost)
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from featurewiz import featurewiz
np.random.seed(1234)
data = pd.read_csv('../data/train.csv')
data.shape
X = data.drop(['price_range'],axis=1)
y = data.price_range.values
X_scaled = StandardScaler().fit_transform(X)
X_train, X_valid, y_train, y_valid = train_test_split(X_scaled,y,test_size = 0.2,stratify=y, random_state=1)
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)
# make prediction
preds = classifier.predict(X_valid)
# check performance
accuracy_score(preds,y_valid)
# automatic feature selection by using featurewiz package
target = 'price_range'
features, train = featurewiz(data, target, corr_limit=0.7, verbose=2, sep=",",
header=0,test_data="", feature_engg="", category_encoders="")
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
Shape of your Data Set loaded: (2000, 21)
Loading test data...
Filename is an empty string or file not able to be loaded
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
20 Predictors classified...
No variables removed since no ID or low-information variables found in data set
No GPU active on this device
Running XGBoost using CPU parameters
Removing 0 columns from further processing since ID or low information variables
columns removed: []
After removing redundant variables from further processing, features left = 20
#### Single_Label Multi_Classification Feature Selection Started ####
Searching for highly correlated variables from 20 variables using SULOV method
##### SULOV : Searching for Uncorrelated List Of Variables (takes time...) ############
No highly correlated variables in data set to remove. All selected...
Adding 0 categorical variables to reduced numeric variables of 20
############## F E A T U R E S E L E C T I O N ####################
Current number of predictors = 20
Finding Important Features using Boosted Trees algorithm...
using 20 variables...
using 16 variables...
using 12 variables...
using 8 variables...
using 4 variables...
Selected 16 important features from your dataset
Time taken (in seconds) = 19
Returning list of 16 important features and dataframe.
- Features - a list of selected features
- One dataframe - This dataframe contains only selected features and the target variable.
print(features)
'battery_power',
'px_height',
'px_width',
'touch_screen',
'mobile_wt',
'int_memory',
'three_g',
'sc_h',
'four_g',
'sc_w',
'n_cores',
'fc',
'pc',
'talk_time',
'wifi']
#split data into feature and target
X_new = train.drop(['price_range'],axis=1)
y = train.price_range.values
# preprocessing the features
X_scaled = StandardScaler().fit_transform(X_new)
#split data into train and validate
X_train, X_valid, y_train, y_valid = train_test_split(X_scaled,y,test_size = 0.2,stratify=y, random_state=1)
# create and train classifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)
# make prediction
preds = classifier.predict(X_valid)
# check performance
accuracy_score(preds,y_valid)
0.905