One of the daily challenges for data scientists is selecting an optimal model. To select the optimal model, we need to work on data cleaning, such as handling missing values and outliers. Even if we carefully clean up the data, there is no promise of meeting the optimal model. On top of that, all datasets are unique and require different approaches to discover an optimal model. Several data scientists prefer to use Random Forest as their first choice because the model works well in general. However, the model is expensive, and there might be a better model if we tune it up well. The auto data analysis for regression suggests an optimal model, hyper-parameters, and proportion of train datasets.
- python ==> 3.9
- py7zr
- Chrome Browser
pip install -r requirements.txt
python main.py -I demo.csv -T AveragePrice -R 0 Date -D type region
For more details on parameters, please run:
python main.py --help
Example:
python decompress.py -F demo.7z
For the more details of parameters, please run:
python decompress.py --help
demo.Auto.Data.Analysis.Regression.mp4
*If users want to include neural networks in the gridsearch, please provide "-N True", and it might take a lot of time to find the best one.
The best model is saved in a 7z file, and named 'best_best.pkl'
All results are saved in the local folder. If you want to check it again, please run the decompress.py file.
main.py -I demo.csv <- analysis the dataset
decompress.py -F demo.7z <- show the result again
This model suggests one of the best models. However, the best model would be different when you use different data preprocessing and analysis. For example, ayami-n found SVR is better than Random Forest when the author splits the demo datasets by region.