Classification Model to Predict Customer Booking with an Airline.

Kevin Kibe
4 min readJan 24, 2023

--

Problem Statement

Train a machine learning model to be able to predict the target outcome, which is a customer making a booking.

After training your model, you should evaluate how well it performed by conducting cross-validation and outputting appropriate evaluation metrics. Furthermore, you should create a visualisation to interpret how each variable contributed to the model. The data and code source is in my github repo linked at the bottom.

1. Importing necessary libraries and Loading the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df= pd.read_csv(r"path\customer_booking.csv", encoding="ISO-8859-1")
df.sample(10)

Output:

To provide more context, below is a more detailed data description, explaining exactly what each column means:

num_passengers = number of passengers travelling
sales_channel = sales channel booking was made on
trip_type = trip Type (Round Trip, One Way, Circle Trip)
purchase_lead = number of days between travel date and booking date
length_of_stay = number of days spent at destination
flight_hour = hour of flight departure
flight_day = day of week of flight departure
route = origin -> destination flight route
booking_origin = country from where booking was made
wants_extra_baggage = if the customer wanted extra baggage in the booking
wants_preferred_seat = if the customer wanted a preferred seat in the booking
wants_in_flight_meals = if the customer wanted in-flight meals in the booking
flight_duration = total duration of flight (in hours)
booking_complete = flag indicating if the customer completed the booking

2. Cleaning the Dataset

Checking for null values and dropping duplicate cells.

#to find null values in the dataframe
df.isnull().sum()
#to find duplicate values in the dataframe
df.duplicated().sum()
# to remove duplicate cells
df = df.drop_duplicates()

3. Feature Engineering

Transforming the values in column ‘flight day’ to a numerical value 1–7.

#converting flight day into a numerical value btwn 1 and 7
mapping = {
"Mon": 1,
"Tue": 2,
"Wed": 3,
"Thu": 4,
"Fri": 5,
"Sat": 6,
"Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)
df["flight_day"].sample(5)

Output:

Transforming column ‘sales_channel’ and ‘trip_type’ to a numerical datatype using dummy variables.

df2 = pd.get_dummies(df, columns=['sales_channel']) # dummy variables for the sales channel column
df2 = pd.get_dummies(df2, columns=['trip_type']) # dummy variables for the trip_type column
df2.head()

Output:

Dropping unnecessary columns.

df2=df2.drop(['num_passengers','route','booking_origin'], axis=1) #dropping unncessary columns
df2.info()

Output:

4. Modelling

Importing necessary libraries, splitting the dataset into training set and test then using the standard scaler to scale the two datasets.

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

X = df2.drop(columns=['booking_complete'])
y = df2['booking_complete']

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rand_clf = RandomForestClassifier(n_estimators=100)
rand_clf.fit(X_train, y_train)
y_pred = rand_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of RandomForest Classifier: {:.2f}%".format(accuracy*100))

svm_clf = svm.SVC(kernel='linear', C=1)
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Support vector machine: {:.2f}%".format(accuracy*100))

gb_clf = GradientBoostingClassifier(n_estimators=100)
gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gradient Boosting: {:.2f}%".format(accuracy*100))

Output:

All models performed well with SVM and Gradient Boosting having the best performance.

Performing Cross Validation on the Random Forest Classifier model and the Gradient boosting Classifier model.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]}
RFclf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(RFclf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)
print("Best score: {:.2f}%".format(grid_search.best_score_*100))


param_grid = {'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]}
GBclf = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(GBclf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)
print("Best score: {:.2f}%".format(grid_search.best_score_*100))

Output:

Plotting the features in order of importance.

# plottting the features in order of importance
features = [i.split("__")[0] for i in X.columns]
importances = rand_clf.feature_importances_
indices = np.argsort(importances)

fig, ax = plt.subplots(figsize=(10, 15))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Output:

Most important features are purchase_lead, flight_hour, length_of_stay, flight_duration, flight_day.

Link to the github repo.

--

--

Responses (3)