Here is a brief introduction into the project.

Please check out the blog post: https://www.dotunopasina.com/datascience/noshowappointments

Introduction

In this project, we will be utilizing machine learning algorithms to perform feature selection on patient appointments data. The goal is to understand what characteristics of a particular patient that makes them miss their appointment.

Dataset

The dataset for this project was gotten from Kaggle consisting of 14 columns and 110527 rows of data.

The data consists of the following columns:

Patient Id
- Identification of a patient
Appointment ID
- Identification of each appointment
Gender
- Male or Female. Female is the greater proportion, woman takes way more care of they health in comparison to a man.
AppointmentDate
- The day of the actual appointment, when they have to visit the doctor.
Scheduled Date
- The day someone called or registered the appointment, this is before appointment of course.
Age
- How old is the patient.
Neighborhood
- Where the appointment takes place.
Scholarship
- True of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia
Hypertension
- True or False
Diabetes
- True or False
Alcoholism
- True or False
Handicap
- True or False
SMS_received
- 1 or more messages sent to the patient.
No-show
- True or False.

Machine Learning Process

The steps taken to accomplish our results include the following:

Data preprocessing.
Create awaiting time field (Days between Scheduled and appointed times)
Exploratory data analysis.
Pass the data through the machine learning algorithm
Select top 10 features that affect appointment times and least 10 features that affect appointment times.

The code of the project can be found on my github.

Exploratory Data Analysis

The below pie chart shows the number of Yes (shows up to appointment) as 85,299 and No (misses appointment) as 21,677. This implies we have an imbalanced data set and we need to keep that in mind as we move along.

Machine Learning Model

The machine learning model used here was a logistic regression with lasso regularization. Regularization is a way of penalizing the model’s cost function to ensure that the model does not overfit. In this case, the features that are not important are made to zero while we can select the important features.

Results and Insights

The model selected the most important features that affect patients missing their appointment as seen in the figure below.

Feature selections of Appointment No Shows

From the image above we can break the groups of data into more likely to miss appointment and less likely to miss appointment.

More Likely to Miss Appointment

Patients who had a large difference between their scheduled and appointment date missed their appointment the most
Interestingly patients who received an SMS message still missed their appointment
Patients in the Itarare and Santos dumont neighborhood were more likely to miss their appointment
Patients between the ages of 13 and 14 were more likely to miss their appointments

Less Likely to Miss Appointment

Patients who were age 64 and 69
Patients who lived in Santa martha, Jardim da Penha and Jardim Camburi
Patients who had Hypertension were less likely to miss their appointments