- Ordinal - This has a set of orders. Example: rating happiness on a scale of 1-10
- Binary - This has only two values. Example: Male or Female
- Nominal - This does not have any set of orders. Example: Countries
Combining Categorical Features in Machine Learning Models
df["new_feature"] = (
df.feature_1.astype(str)
+ "_"
+ df.feature_2.astype(str)
)
1.Load the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
np.random.seed(123)
warnings.filterwarnings('ignore')
%matplotlib inline
# Import data
data = pd.read_csv('data/Train_v2.csv')
# print shape
print('data shape :', data.shape)
data shape : (23524, 13)
# inspect data
data.head()
2.Understand The Dataset
#show Some information about the dataset
print(train_data.info())
3. Data preparation for machine learning models
#import preprocessing module
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
# Convert target label to numerical Data
le = LabelEncoder()
data['bank_account'] = le.fit_transform(data['bank_account'])
#Separate training features from target
X = data.drop(['bank_account'], axis=1)
y = data['bank_account']
print(y)
- Handle conversion of data types.
- Convert categorical features to numerical features by using One-hot Encoder and/or Label Encoder.
- Drop uniqueid variable.
- Perform feature scaling.
# function to preprocess our data
def preprocessing_data(data):
# Convert the following numerical labels from interger to float
float_array = data[["household_size", "age_of_respondent", "year"]].values.astype(float
)
# categorical features to be converted to One Hot Encoding
categ = [
"relationship_with_head",
"marital_status",
"education_level",
"job_type",
"country",
]
# One Hot Encoding conversion
data = pd.get_dummies(data, prefix_sep="_", columns=categ)
# Label Encoder conversion
data["location_type"] = le.fit_transform(data["location_type"])
data["cellphone_access"] = le.fit_transform(data["cellphone_access"])
data["gender_of_respondent"] = le.fit_transform(data["gender_of_respondent"])
# drop uniquid column
data = data.drop(["uniquid"]), axis=1)
# scale our data
scaler = StandardScaler()
data = scaler.fit_transform(data)
return data
# preprocess the train data
processed_test_data = preprocessing_data(X_train)
4. Model Building and Experiments
# Split train_data
from sklearn.model_selection import train_test_spilt
X_Train, X_val, y_Train, y_val = train_test_split(processed_train_data, y_train, stratify = y, test_size = 0.1, random_state=42)
#import classifier algorithm here
from sklearn.linear_model import LogisticRegression
# create classifier
lg_model = LogisticRegression()
#Training the classifier
lg_model.fit(X_Train,y_Train)
# import evaluation metrics
from sklearn.metrics import confusion_matrix, accuracy_score
# evaluate the model
y_pred = lg_model.predict(X_val)
# Get the accuracy
print("Accuracy Score of Logistic Regression classifier: ","{:.4f}".format(accuracy_score(y_val, lg_y_pred)))
1st Experiment: Combine education_level and job_type features.
# function to preprocess our data
def preprocessing_data(data):
# Convert the following numerical labels from integer to float
float_array = data[["household_size", "age_of_respondent", "year"]].values.astype(float)
# combine some cat features
data["features_combination"] = (data.education_level.astype(str) + "_" + data.job_type.astype(str) )
# remove individual features that are combined together
data = data.drop(['education_level','job_type'], axis=1)
# categorical features to be converted by One Hot Encoding
categ = [
"relationship_with_head",
"marital_status",
"features_combination",
"country"
]
# One Hot Encoding conversion
data = pd.get_dummies(data, prefix_sep="_", columns=categ)
# Label Encoder conversion
data["location_type"] = le.fit_transform(data["location_type"])
data["cellphone_access"] = le.fit_transform(data["cellphone_access"])
data["gender_of_respondent"] = le.fit_transform(data["gender_of_respondent"])
# drop uniquid column
data = data.drop(["uniqueid"], axis=1)
# scale our data
scaler = StandardScaler()
data = scaler.fit_transform(data)
return data
- Combine educaion_level and job_type to create a new feature called “features_combination”.
- Remove individual features (education_level and job_type) from the dataset.
- Add a new feature called “feature_combinaton” in the list of categorical features to be converted by One Hot Encoding.
Keep in mind that we did not change anything such as hyper-parameters in your machine learning classifier.
2nd Experiment: Combine relation_with_head and marital_status features
# function to preprocess our data
def preprocessing_data(data):
# Convert the following numerical labels from integer to float
float_array = data[["household_size", "age_of_respondent", "year"]].values.astype(
float
)
# combine some cat features
data["features_combination"] = (data.relationship_with_head.astype(str) + "_"
+ data.marital_status.astype(str)
)
# remove individual features that are combined together
data = data.drop(['relationship_with_head','marital_status'], axis=1)
# categorical features to be converted by One Hot Encoding
categ = [
"features_combination",
"education_level",
"job_type",
"country",
]
# One Hot Encoding conversion
data = pd.get_dummies(data, prefix_sep="_", columns=categ)
# Label Encoder conversion
data["location_type"] = le.fit_transform(data["location_type"])
data["cellphone_access"] = le.fit_transform(data["cellphone_access"])
data["gender_of_respondent"] = le.fit_transform(data["gender_of_respondent"])
# drop uniquid column
data = data.drop(["uniqueid"], axis=1)
# scale our data
scaler = StandardScaler()
data = scaler.fit_transform(data)
return data
- Combine relation_with_head and marital_status to create a new feature called “features_combination”.
- Remove individual features (relation_with_head and marital_status) from the dataset.
- Add a new feature called “feature_combination” in the list of categorical features to be converted by One Hot Encoding.
This shows that sometimes when you combine categorical features your machine learning model will not improve as you expected. Therefore you will need to run a lot of experiments until you get satisfactory performance from your machine learning model.