Using Machine Learning to predict who is most likely to have a bank account
The main dataset contains demographic information and what financial services are used by approximately 33,600 individuals across East Africa. This data was extracted from various Finscope surveys ranging from 2016 to 2018.
The task is to predict the likelihood of a person having a bank account or not.
- Data Exploration
The training dataset looks like this with columns: country, year, uniqueid, bank_account, location_type, cellphone_access, household_size, age_of_respondent, gender_of_respondent, relationship_with_head, marital_status, education_level, job_type.
I explored the data and plotted the distribution of bank account ownership.
2. Machine Learning
I will start by seperating the variables and the target variable which is the ‘bank_account’ from the train dataset. Then I will transform the values from object datatype to numerical data type.
I then wrote a function to preprocess data from the train models. Specifically to change numerical labels from integer to float, convert categorical features to One Hot Encoding, to drop the unique_id column and to scale our data.
3. Splitting our dataset
4. Training
Now to evaluate and get the error rate
the error rate was 0.11049723756906082. We need to lower the percentage to get the best model performance.
Lets check the confusion matrix
The XGBoost model performs well on predicting class 0 and performs poorly on predicting class 1, it may be caused by the imbalance of data provided(the target variable has more ‘No’ values than ‘Yes’ values).
So, we will do parameter tuning using Grid Search method.
We get an error rate of 0.10922226944326396 after tuning.
Link to my github