This project will be dealing with a Recipe CSV file and Rating CSV file which can both be found on food.com. Our exploratory data analysis on this data set can be found here. In particular this analysis will focus on prediciting the number of ingredients in a recipe based on several factors. The ingredients while a discrete variable, can take an infinite amount of values hence a regression type analyzer must be used. In order to evaluate the regression that is built we will be using r^2 as our evaluation metric since that tells us how well our predicition is, while we can use mean squared error, with such a large data frame we will get extremely large rmse values and in general it would be hard to evaluate how well the predictions actually is while r^2 gives a more concise answer to such a question. The features we will be using to build the regression are things that generally pertain to nutrition, rating, and number of steps.
In order to build the baseline model we need to figure out which features we are using exactly. In our case we are using nutrition (number of calories and percent recommeneded daily intake of fat, saturated fat, sugar, sodium, carbohydrates, and protein), number of steps in a recipe, time to make the dish in minutes, and average rating of the recipe. All of these features are quantitative as something like tag can not be used since that can only be added after considering the ingredients. However, we cannot use these features raw, instead we must transform them. For example, percent recommended daily intake doesnt give an exact value so we must transform it to actualy values, for example the recommened amount of fat intake per day is 72.5, hence we multiply the percent value by 72.5/100. We did this for all the nutritional columns except calories. Another transformation we had to do was imputate the missing values in ratings, in order to do this we simply replace the missing values by the mean of ratings in the pipeline. All of this was done inside of the pipeline. In order to predict we will be using Random Forest Regressor for now. However, we must first split our data into a training and testing set so that we dont overfit the data and unseen data might not be predicted correctly. And under this model we get a r^2 value of .38 which means that the model is moderately good at predicting the number of ingredients in a recipe, however we can do better.
To start, when we look at the data there appears to be a lot of recipes with extremely high values for all of the nutrition, so one thing we can do is try to normalize all of the non-calorie nutrition. In order to do that we will be using calories as our comparison unit so we can essentially divide all of the other nutritional columns by calories in order to essentially get nutrition per calorie. This helps evaluate which recipes actually has a high amount of a certain nutrition instead of it just being a recipe that creates lots of serving. Also when we plot calories to number of ingredients we see that it appears to follow a square root regression so we will also take the square root of the calories column. Implementing all of this into our new pipeline improves our r^2 value by quite a bit, it is now .40, an entire .02 difference between before. Some other predictors that we can use are Lasso, SGD, K nearest cluster, and decision tree but they all have significantly worse performance when compared to forest which is likely due to forest being able to handle large data sets well. Unfortunately forest regressor takes quite a while to run so it is difficult to test a bunch of hyperparameters, so we will simply be testing the max depth of the forest (essentially how deep we want the predictor to train). After several tests with grid search we find the best depth is 90, any higher would be over training and while improing traiining score wll hurt the test score and any less will result in not enough training and hurt both training and test score. After doing this we get we get a r^2 value of .41. Again not the best r^2 value as this just means that only 41% of the variablility can be explained by the regression, but it is still a decent score. So we essentially improved our model by being able to explain 3% more of the variablility in the regression.
While the model seems to be quite good, we dont quite know if there are any sort of systematic failures in the prediction, like if it predicts a certain group particularly worse than others. Hence we will be testing how well it predicts based on a certain threshold of calories. Hence our X group will be calories and we are using that to predict the number of ingredients (Y group). The null hypothesis is that the r^2 valued for prediction when calories are below 500 is equal to the r^2 value when calories are above 500. The alternate hypothesis is that these two are not equal. Our test statistic will simply be the |r^2 value when calories are above 500 - r^2 value when calories are below 500|. When we run our entire data through the regressor we get that this test statistic is equal .003 which at first glance seems quite low. However, we will run 5000 permutation tests shuffling whether or not a specific data poin is below 500 calories and collecting a test statistic for each shuffle. After doing that we find that p = .7296 or that 72.96% of our shuffles had a greater than or equal to test statistic to .003 or our regressors test statistic. Using a .05 signicance we cannot reject the null hypothesis that the r^2 valued for prediction when calories are below 500 is equal to the r^2 value when calories are above 500. Hence it is unlikely the model is biased with respect to calories.