bus_delay_prediction

Lab for testing multiple Machine Learning models using local traffic data to predict the average expected delay at a bus stop

HEAVILY WORK IN PROGRESS

Raw dataset

The complete dataset is private.

But this is a little snippet of the used data. The data, which is used to train the models, has a range of 8 months and several million lines of data.

timestamp	busStopID	stopName	patternText	direction	actualTime	plannedTime	status	routeId	tripId	vehicleId
2024-12-18 18:32:17	1020	Heikendorf, Am Heidberg	15	Mettenhof	18:52	18:52	PREDICTED	1610073983892324384	1610077840787696906	-7638104967705910578
2024-12-18 18:32:17	1005	Belvedere	13	Kiel Hbf	18:33	18:33	PREDICTED	1610073983892324382	1610077840787574030	-7638104967705910932
2024-12-18 18:32:17	1005	Belvedere	11	Wik Kanal	18:37	18:35	PREDICTED	1610073983892324359	1610077840787451151	-7638104967705910820
2024-12-18 18:32:17	1005	Belvedere	12	Schilksee	18:37	18:36	PREDICTED	1610073983892324353	1610077840787565838	-7638104967705910672
2024-12-18 18:32:17	1005	Belvedere	11	Dietrichsdorf	18:39	18:38	PREDICTED	1610073983892324359	1610077840787479822	-7638104967705910884
2024-12-18 18:32:17	1005	Belvedere	11	Wik Kanal	18:44	18:43	PREDICTED	1610073983892324359	1610077840787467534	-7638104967705910912
2024-12-18 18:32:17	1005	Belvedere	6	Hassee	18:46	18:46	PREDICTED	1610073983892324356	1610077840787827991	-7638104967705910810
2024-12-18 18:32:17	1005	Belvedere	12	Kiel Hbf	18:48	18:48	PREDICTED	1610073983892324353	1610077840787569934	-7638104967705910704
2024-12-18 18:32:17	1005	Belvedere	11	Dietrichsdorf	18:48	18:48	PREDICTED	1610073983892324359	1610077840787455247	-7638104967705910898
2024-12-18 18:32:17	1005	Belvedere	13	Strande	18:51	18:51	PREDICTED	1610073983892324382	1610077840788143370	-7638104967705910868

Todo / Notes

Feature Engineering: regarding duplicated entries, hence the data was collected by a query every 15 mins
Feature Engineering: more features like holiday yes/no, kieler woche yes/no, maybe even weather? Rush hour?

Analysis of the used methods

Linear Regression (Ridge)

WIP

HistGradientBoostingRegressor

The HistGradientBoostingRegressor is a regression model that uses gradient boosting decision trees with a histogram-based approach. It bins continuous features into discrete bins to make training faster and more memory-efficient, especially on large datasets. This model can handle missing values and categorical features natively. It works by building trees sequentially, where each new tree corrects the errors of the previous ones, to minimize prediction errors like MAE or RMSE.

MAE (Mean Absolute Error): 0.97 minutes RMSE(Root Mean Square Error): 1.94 minutes R^2 (R-squared Score): 0.0752

Training R^2 Score: 0.2247 Test R^2 Score: 0.0752

The big difference between the Training R^2 Score and Test R^2 Score: 0.0752 suggests overfitting.

In general, the charts show that the model still underestimates delays. We can try to improve this by Feature Engineering, hence we are already weighting delays longer than 5 minutes more heavily.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
analysis		analysis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bus_delay_prediction

Raw dataset

Todo / Notes

Analysis of the used methods

Linear Regression (Ridge)

HistGradientBoostingRegressor

About

Uh oh!

Releases

Packages

Languages

maxverwiebe/bus_delay_prediction

Folders and files

Latest commit

History

Repository files navigation

bus_delay_prediction

Raw dataset

Todo / Notes

Analysis of the used methods

Linear Regression (Ridge)

HistGradientBoostingRegressor

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages