Using Machine Learning to know Patients that are No Shows
Here is a brief introduction into the project.
Introduction
In this project, we will be utilizing machine learning algorithms to perform feature selection on patient appointments data. The goal is to understand what characteristics of a particular patient that makes them miss their appointment.
Dataset
The dataset for this project was gotten from Kaggle consisting of 14 columns and 110527 rows of data.
The data consists of the following columns:
Patient Id
Identification of a patient
Appointment ID
Identification of each appointment
Gender
Male or Female. Female is the greater proportion, woman takes way more care of they health in comparison to a man.
AppointmentDate
The day of the actual appointment, when they have to visit the doctor.
Scheduled Date
The day someone called or registered the appointment, this is before appointment of course.
Age
How old is the patient.
Neighborhood
Where the appointment takes place.
Scholarship
True of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia
Hypertension
True or False
Diabetes
True or False
Alcoholism
True or False
Handicap
True or False
SMS_received
1 or more messages sent to the patient.
No-show
True or False.
Machine Learning Process
The steps taken to accomplish our results include the following:
Data preprocessing.
Create awaiting time field (Days between Scheduled and appointed times)
Exploratory data analysis.
Pass the data through the machine learning algorithm
Select top 10 features that affect appointment times and least 10 features that affect appointment times.
The code of the project can be found on my github.
Exploratory Data Analysis
The below pie chart shows the number of Yes (shows up to appointment) as 85,299 and No (misses appointment) as 21,677. This implies we have an imbalanced data set and we need to keep that in mind as we move along.
Machine Learning Model
The machine learning model used here was a logistic regression with lasso regularization. Regularization is a way of penalizing the model’s cost function to ensure that the model does not overfit. In this case, the features that are not important are made to zero while we can select the important features.
Results and Insights
The model selected the most important features that affect patients missing their appointment as seen in the figure below.
From the image above we can break the groups of data into more likely to miss appointment and less likely to miss appointment.
More Likely to Miss Appointment
Patients who had a large difference between their scheduled and appointment date missed their appointment the most
Interestingly patients who received an SMS message still missed their appointment
Patients in the Itarare and Santos dumont neighborhood were more likely to miss their appointment
Patients between the ages of 13 and 14 were more likely to miss their appointments
Less Likely to Miss Appointment
Patients who were age 64 and 69
Patients who lived in Santa martha, Jardim da Penha and Jardim Camburi
Patients who had Hypertension were less likely to miss their appointments