Healthcare Provider Fraud Detection And Analysis using Machine learning:-

11 min readJun 6, 2021

Binary classification problem to detect particular provider is fraud or not, using kaggle dataset

Healthcare is essential in people’s lives and it must be affordable. The healthcare industry is an intricate system with numerous moving components. It is expanding at an expeditious pace. At the same time, fraud in this industry is turning into a critical problem. Healthcare fraud is an organized crime that involves peers of providers, physicians, beneficiaries acting together to make fraud claims. First, we have to understand that every industry wants to make a profit so they will make it anyhow. So for a normal beneficiary doesn’t even understand that they are being frauded by the company. So here in this block, come up with a solution to Predict the potentially fraudulent providers based on the claims filed by them. Also, we will discover what are the most important variables helpful in detecting the behavior of potentially fraud providers

1. Business Problem:-

Predict that a particular provider is a fraud or not using machine learning algorithms.

Discover the features, using which most of the providers are doing fraud.

Business constraints:-

As it is a fraud detection technique so here no time limit is present

Errors can be very costly.

2.Performance Matric:-

For the performance matric, we used confusion AUC score, F1score, confusion matrix.

confusion matrix:

A confusion matrix is a table used to investigate the performance of a classification model where the actual test values are known. It has two rows and two columns describing the true positives, false positives, false negatives, and true negatives.

F1 score-

F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:

AUC score-

AUC also called an AREA UNDER CURVE. It is used in classification analysis in order to determine which of the used models predicts the classes best. An example of its application is ROC curves.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0 and if the predictions are 100% correct has an AUC of 1.

3.Application to machine learning algorithm:-

So here from the given dataset we will create some interesting features using exploratory data analysis and for a given set of features, we need to predict whether the corresponding provider needs to be flagged as potential fraud or not.

4.Source of Data:-

All the data set is downloaded from Kaggle. The link is below

https://www.kaggle.com/rohitrox/healthcare-provider-fraud-detection-analysis

Data set and column analysis:-

We have a total of four CSV files here Inpatientdata, Outpatient data, Beneficiary Data, target data.

A. In the Inpatient and outpatient data-

Here in the Inpatient and outpatient both the file contains almost the same columns. here mainly whose patients are admitted here data is given here. Here a total of 30 columns is given here. We will discuss it one by one.

Benid:- this is a unique id given to each beneficiary.
Clamid:- it is basically a unique id for each claim submitted.
Claim start date and end date:- Each claim when it starts demanding and which time it’s fulfilled.
Provider:- this is a unique id belongs to each insurance company
InscClaimAmtReimbursed:- it is basically how much money the beneficiary will get after submitting his medical document to the insurance company.
AttendingPhysician:- it is basically the id of the doctor who attended the particular beneficiary.
OperatingPhysician:- it is also a physician id.
Other Physician:- Id of the other physician who attends the patients.
DeductibleAmtPaid:- Suppose hospital bills are 25k, and deductibleAmt is 10k then beneficiaries have to pay the first 10k amount then the insurance company will pay the rest of the 15k amount.
Admission date and discharge date:- when a patient is admitted and discharged from the hospital.
ClmAdmitDiagnosisCode:- it is basically when a patient is admitted into the hospital according to his initial symptoms what disease he has according to this he gets a code.
DiagnosisGroupCode:- overall treatment of the patient, this code belongs to that treatment.
Claim Diagnosis Code 1 -10:- For each treatment, some codes are given, diagnosis codes are those codes a beneficiary demanding for money to return.
ClmProcedureCode_1 -6:- Actual treatment , patients undergo.Thiers codes basically.

B. Beneficiary Data:-

Here a total of 25 columns are present. Now we will discuss one by one.

BeneID:- Here it is a unique id given to each beneficiary who registered to the company
DOB, DOD:- It is the date of birth and date of death of each beneficiary.
Gender:- Gender of each beneficiary
Race, state, country:- Each beneficiary belongs to which country and state.
Paramedical condition:- Remaining columns belong to the paramedical condition of the beneficiary like ChronicCond_Heartfailure,’ ChronicCond_KidneyDisease,ChronicCond_Cancer,ChronicCond_Diabetes, ChronicCond_IschemicHeart, ChronicCond_Osteoporasis,Chronic And rheumatoid arthritis, ChronicCond_stroke

C. Target data:-

Here two columns are present here, provider, target.

Provider-Unique id of each health care provider
Target:-This column has two values yes and no indicating that the corresponding provider flagged the medical document fraud or not.

5.Data cleaning:

In this data, lots of NULL values present here so before going to any of the experiments we have to handle these values here as we know sometimes Null values can be a great source of data.
At first, we have to merge our three CSV files in this way.
Now in the DOD column, lots of NULL values are there means patients are alive we put them today’s value there just to calculate the age of the patients.
Under the Paramedical condition, lots of diseases are present here values are mainly 1,2, Yes. So here at the place of 2, we put zero means no disease, and at the place of yes, we put 1.
Claim Diagnosis Code 1 -10, ClmProcedureCode_1 -6 here total 16 columns are present here basically values are here different codes. We replace the missing values with zero. Actual values with 1 just for calculate a patient undergo how many treatment procedures. In this way, we cleaned the dataset here.

6.Exploratory Data analysis:-

Conclusion:

From this exploratory data analysis, we will try to come up with some interesting features here. From the above diagram, most of the fraud cases happened in the country code with 200,470,400 and state code with 5,10,33,39
So here we take the top 15 states and countries where the maximum fraud is happening and we create two features here fraud country and frau state if the patients belong to those states and country then we put 1 on those variables because most likely they will be victims of fraud.

conclusion:

For the first diagram, the x-axis represents the difference between the total number of procedural codes and the total number of diagnosis codes for each patient from the graph we can see that who are victims of fraud their difference is one. So from this analysis, we create a feature difference where the difference between these two codes will present for each patient.
In the 2nd diagram for each patient, we calculate how many diseases he/she suffered from so we can see that who are felt from four different diseases are cheated by provider most.

conclusion:

We already know that Healthcare fraud is an organized crime that involves peers of providers, physicians, beneficiaries acting together to make fraud claims. So here above graph zero means patients admitted date and discharge dates are missing and one means dates are present.
So here we can see that whose dates are missing those patients are victims of fraud.

Conclusion:

the first graph is basically the pdf of the age of the patients, which means the x-axis represents the age of the patient's y-axis represents a number of points present on that particular age. This means between 80 to 90 most patients are present here. And coming to the fraud case age feature is not giving much information between the fraud data and nonfraud data.
The second graph represents the pdf of the number of beneficiary IDs by each provider. the x-axis represents the number of beneficiaries and y-axis represents the number of providers. Means here most of the provider who is doing fraud hold less than 500 beneficiary ids.

Conclusion:

The first graph means in the data two code is present DiagnosisGroupCode and ClmAdmitDiagnosisCode one is given to the patient t the time of admission another one is actually the treatment code a patient goes through. we want to check that those two codes are the same or not. One means the same zero means not same we observed that most of the cases who are the victim of fraud have sam code.
In the second graph after the treatment, a patient has to give the money or not despite having the insurance. Zero means no need to pay one means has to pay the money. Here we observed that in most of the cases that do not give money victims of fraud.

7.Data Preparation:

In this section, we are preparing the data here.

After creating all the features, as we are doing here provider fraud analysis so we have to make the data in terms of provider. This means each provider has lots of beneficiaries and we are creating the features here according to each beneficiary. So we have to group by the features according to the provider.
So here we take features like clamDiagnosiCode1to 9, fraud state, etc. we sum up these features value group by each provider and we made a dataset called Train_sum.
In the same way, we create the data here Train_Data_mean here we mean the value of the features group by the provider. Here we take columns like columns like InscClaimAmtReimbursed’, ‘DeductibleAmtPaid’,’ admitted days, ‘ChronicCond_Alzheimer’, ‘ChronicCond_Heartfailure’,’ ChronicCond_Cancer’, etc. for each beneficiary.
We create now Train_Count here for each provider how many beneid,claimid present we count them.
Then we merged the three datasets here Train_Count, Train_Data_mean, Train_sum. Then this is the final data here and it is ready for classification.

This is the heatmap of the correlation of the features. So from here, some features that do not add much value to determine our target value so we remove them.

8.Machine Learning Model:-

Now we have reached the part where we are well equipped with the dataset to solve our problem. We will now try to experiment with different approaches and try to develop our first cut Model and then we choose the best performing model to reach our objective.

In this plot Red bar is the AUC values of each model here and the green bars are the F1 score of each model here.
So hereafter give the data to all the models Random forest gives the best value here. Here AUC value is 0.94 an F1 score is 0.52

One of the major advantages of using Tree-based algorithms is that they use Information gain formulation for splitting the data into branches. A column with the maximum information gain is chosen and marked as an important feature. Is ductable feature is the most important feature here

9.Ensemble Model:-

This is the block diagram for our ensemble model here. So at first, we divided our data into two-part D1 and D3 where D1 is 80 percent of data D3 is 20 percent of data. Now we divided the D1 into two-part D4 and D2. We have a total n number of a base classifier to train those base classifiers we take the data from D4 with random with replacement manner. For prediction, we used D2 data. Then we got the total n number of predictions we merged up those data and the label of this data is a class label of D2. And we use this data to train the meta classifier. For test data of the meta classifier, we take D3 and we put this data into n number of base classifier and got n number of prediction. Now we merged those values and used them for test data for our meta classifier.

Result:

Here all the model as meta classifier gives almost same result So we go with logistic regression.

Model test F1 score is :  0.49134948096885817
Model test AUC score is :  0.9086282032800701

10.Productionisation:

So for Productionisation, we used flask API here. We deploy our model in the AWS EC2 instance. So if the user puts any provider id on the webpage. Corresponding to provider id we will get the data from the data set and put the data into n number of base classifiers where those base classifiers already pre-trained. So we got n prediction from here we merged up those values and put them into the meta classifier and got the final output here. Whether the particular provider is a fraud or not.

11.Future work:

We can extend our work one step further by employing the Deep Multilayer perceptron model(deep leaning technique), with Relu activation function in the hidden and Sigmoid/SoftMax in the final layer.

12.Linkedin and Github Repository:-

https://www.linkedin.com/in/soumya-kundu-59360a165/

kundusoumya98/Provider-Fraud-Detection

Contribute to kundusoumya98/Provider-Fraud-Detection development by creating an account on GitHub.

github.com

Healthcare Provider Fraud Detection And Analysis using Machine learning:-

Table of Contents:

1. Business Problem:-

2.Performance Matric:-

3.Application to machine learning algorithm:-

4.Source of Data:-

B. Beneficiary Data:-

C. Target data:-

5.Data cleaning:

6.Exploratory Data analysis:-

Conclusion:

Conclusion:

7.Data Preparation:

8.Machine Learning Model:-

9.Ensemble Model:-

Result:

10.Productionisation:

11.Future work:

12.Linkedin and Github Repository:-

kundusoumya98/Provider-Fraud-Detection

Contribute to kundusoumya98/Provider-Fraud-Detection development by creating an account on GitHub.

13.Reference

Written by Soumya Kundu