Predicting the annual medical expenses (in dollars) of a person to set insurance premium, greater than the predicted medical expenses for earning profit.
Used 5 regression algorithms, linear regression, polynomial regression, ridge regression, xgboost regression and neural network regression to find the best ML model to predict medical expenses of a person on the basis of features like age, sex, bmi, region, children, smoker. Data was taken from the Kaggle website. The data set contains various attributes of 1338 individuals including the total amount of medical expense incurred for one year. The attributes for this data set is shown below:
As mentioned, medical charges will be our dependent variable and the rest will be our independent variables.
After extensive exploratory data analysis, came to the conclusion, that people having smoking habit have very large medical expenses and this feature is very highly correlated with the medical expenses. Also, the more the bmi,age the unhealthier the person, the higher the medical expenses. Using Anova test, found out that the region is an insignificant feature for using in model, so dropped it. Males have higher average medical expenses than females and medium sized families (2-3) children have higher medical expenses than small(0-1 children) and large(4-5 children). Created the 5 regression models,optimized them, by trying to enhance their r2_score and root mean square error, on the basis of which these algorithms were judged.