Date of Submission


Document Type


Degree Name

Master of Science in Industrial Engineering


Mechanical and Industrial Engineering


Dr. Gökhan Eğilmez

Committee Member

Dr. Ridvan Gedik

Committee Member

Dr. Nadiye O. Erdil

Committee Member

Dr. Ceyda Mumcu


Crash severity, Game outcomes, Support Vector Machine, Neural Network (NN), Random Forest (RF)


Machine learning, Traffic accidents, Soccer, Regression analysis


Machine learning has become a cutting-edge and widely studied data science field of study in recent years across many industries and disciplines. In this thesis, two problems (1- crash severity prediction, 2- soccer game outcome prediction.) were investigated by using a set of machine learning approaches, namely: Ridge regression, Lasso Regression, Support Vector Machine (SVM), Neural Network (NN), Random Forest (RF).

The first study is focused on investigating the critical factors affecting crash severity on a comprehensive time-series state-wide traffic crash data. The dataset covers crashes occurred in the state of Connecticut between 1995 and 2014. Traffic crashes are an increasing cause of death and injury in the world. The overall purposes of the first study were to propose, develop, and implement machine learning approaches in predicting the severity levels of human beings involved in the crashes and investigating the important crash predictors contributing to the injury severity. The predictor variables included road and vehicle conditions, characteristics of drivers and passengers, and environmental conditions. Results indicate that RF provided the best prediction accuracy of 73.85% in correctly classifying a crash based on its severity: fatal, injury, or property damage only. In addition to the overall comparison of proposed machine learning approaches in terms of accuracy, the prediction results were combined with the economic loss of each severity level to provide managerial insights on estimating the financial consequences of traffic crashes. RF provided the importance of each predictor in affecting the severity levels of involved human beings. The ejection status of the driver or passenger was found to be as the most crucial factor leading to the most severe injuries. Besides, a time series analysis of the 20-years crash data was conducted. The analysis results demonstrated that the prediction accuracy of RF increased with period, and the importance of some predictors also changed. From the perspective of policy making, strict inspection on drunk driving and drug use could lead to substantial road safety improvement. Ejection status is the essential risk factors that affect fatal and incapacitating severity level. The use of seat belts significantly reduces the risk of passengers being ejected out of the vehicle when the crash occurred.

In the second study, recent five-season game data of three major leagues were scraped from The Leagues were two top European leagues, Spanish La Liga, English Premier League (EPL), and one US League, Major League Soccer (MLS). The purpose of the study was to develop a statistically credible machine learning approaches to predict a soccer game outcome and investigate the significance of predictors (game statistics). Different from previous closely-related studies, the proposed machine learning models were not only applied to the combined dataset of the three leagues but also were studied separately on each league to compare the prediction performance and important predictors. The best prediction performance was achieved by NN with an accuracy of 85.71% (+/- 0.73%) of the combined dataset. For each league, RF had the best performance. RF also provided the importance of each predictor. The results presented that the home-field advantage was more evident in the MLS games than in the other two Europe leagues. The home team or away team factor was the most critical predictor that affected the MLS games. Although it was also an important predictor for La Liga and EPL games, the most influential predictor was the difference in the number of shots on target between the home team and away team. For the three leagues, the number of crosses was the most significant pass type, and the difference in the rate of card per foul was the most crucial card situation. The referee primarily determines the difference in the rate of card per foul. For the Europe leagues, the difference in the number of counter attacks and open plays were consequential attempt types affecting a game result in La Liga and EPL, while in the MLS, the difference in the number of set-piece was the most crucial predictor variable.

Overall, the results of the two studies indicated that the proposed machine learning approaches yielded effective prediction performance for crash severity and soccer outcomes’ prediction. RF had slightly superior prediction performance among the five machine learning models for both studies. Even though the two problem domains were from different industries or policy making area, the proposed machine learning approaches effectively dealt with the complexity of the data in terms of dimensionality and time-series nature.