Introduction

The goal of this project is to utilize machine learning algorithms to classifier a transaction as fraudulent or not based on multiple inputs.

Datasets

The datasets for this project was gotten from Kaggle . The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The transaction contained inputs of datasets from dimension reduced observations V1…V28 , amount and class representing if the transaction is fradulent or not.

Machine Learning Algorithm Process

The machine learning algorithm process highlights the steps taken to get the models up and running from start to end. It also describes the data preprocessing and cleaning stage of the problem. The machine learning algorithm process includes:

Download data sets from Kaggle.
Load data into Jupyter notebook and perform exploratory analysis.
Split the data into input and output columns.
Standard scale data to remove skewness in the datasets.
Pass the data into grid search logistic regression, naive bayes, support vector classifiers and random forest regression.
Calculate the performance metric of the models. Note since we have imbalanced data we use a confusion matrix and f1 score to evaluate models.

Code for project can be found on my github website

Results

The results of the 4 machine learning models evaluated are as follows:

Naive Bayes performed worse with a f1 score of 11.31 %,
Logistic regression was 72.9 %
Random forest was 87.4 %
Support vector machines took so long to run that I had to stop the process.

Conclusion

In general random forest classifier performed best as it is a combination of decision trees and is protected from the problem overfitting due to it ensembling method. This project was an interesting to learn from and the outcomes from the result was used to further my knowledge in machine learning. Please find the code used for the project on my github page