NBA Game Prediction System Using Composite Stacked Machine Learning Model

I-No Liao

Abstract

Predicting NBA games is not an easy task. Conventional data analysis and statistic approaches are usually complicated and the accuracy is not high. In this work, a machine learning based method is proposed and the complete design flow is thoroughly introduced and explained. 8 single-stage machine learning models are trained and compared. More complex composite models such as voting mechanism and stacking method are also designed and elaborated. The proposed model reaches 76.8% accuracy on predicting all 2018 NBA playoffs. Furthermore, for the Eastern Conference Final, Western Conference Finals, and Conference Finals, our model achieves an extraordinary prediction accuracy of 85.7%, 71.4%, and 100%, respectively. The source code and dataset are available here.

Problem Definition

The mission of this work is to precisely predict NBA games' winning and losing result. Machine learning models are trained to predict game results based on the information of two teams' recent status. The idea is expressed as in Figure 1. Beside game-winning prediction, information of the prediction's confidence level is also valuable for us to understand how intense the matchup might be.

Problem Definition

Figure 1: Problem definition.

Dataset

This work predicts NBA games based on the dataset collected from the official NBA stats website. A crawler program is designed to scrape the game boxes and save the data automatically. Details about the crawler design are available here. NBA games, including seasons and playoffs, from 1985 to 2018 are collected. The dataset contains 68,458 season matches and 4,816 playoff matches. Figure 2 shows the number of games played by all 30 NBA teams. Due to various history from each team, the number of games for 30 teams are not homogeneous. Moreover, since there are at most 16 teams entitled to enter playoff each year, the number of playoff games played by each team is different as well.

Number of games played by each team

Figure 2: Statistics of the number of games played by each NBA team.

Before processing our data, a classification regarding data types is conducted. Table 1 shows data types of the dataset. As we can see, most of the data are numeric. There is one categorical data, Team, and there are two binary data, Win/Lose and Home/Away. Our target is to precisely predict which team wins a game when two teams meet. Therefore, Win/Lose is the label and our machine learning model predicts the Win/Lose outcome and provides the confidence level of its prediction.

Table 1: Type of Attirbutes

Binary	Win/Lose, Home/Away
Categorical	Team
Numeric	Date, PTS, FG%, FGM, FGA, 3P%, 3PM, 3PA, FT%, FTM, FTA, REB, OREB, DREB, AST, STL, BLK, TOV, PF

Data Preprocessing and Feature Extraction

Typical data preprocessing is conducted as shown in Figure 3. Preprocessing includes data cleaning, one-hot encoding, numeric data normalization, game pairing, validity checking, etc. The final legitimate data volume is 61,368, including seasons and playoffs.

Data preprocessing block diagram

Figure 3: Data preprocessing flow chart.

To train machine learning models, feature extraction is carried out as shown in Figure 4 and 5. Firstly, select the attributes that are more representative to the winning or losing of games. Then, put all selected attributes in a vector. The attribute X is the average performance considered from previous games played by two teams prior to the date we target to predict. In other words, attribute X represents the teams' recent status. Label Y is Win/Lose since we would like to predict which team wins the game.

Feature extraction

Figure 4: Feature extraction.

Figure 5: How attribute X and label Y look like.

Model Training and Testing

After data preprocessing and feature extraction are completed, model training and testing can proceed. In this section, grid search with cross validation is firstly applied to find the optimal model parameters. Afterward, data size evaluation is conducted to help us understand how data volume influences the model performance. Then, voting and stacking models are introduced. At last, a comprehensive performance comparison of different machine learning models is presented.

Grid Search with Cross Validation

8 different frequently used single-stage machine learning models are analyzed in this work. Model parameters are optimized by grid search. To prevent possible overfitting issue that happens frequently in model training, cross validation is applied. Table 2 presents which parameters are considered and in what ranges are they examined. Note that since the Naïve Bayes model has no parameters to choose, it does not require grid search.

Table 2: Grid Search Parameters

Model	Parameters Sweeping Table	Model	Parameters Sweeping Table
Logistic Regression	'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'max_iter': [100, 200, 300, 400, 500]	GBDT	'loss': ['deviance', 'exponential'], 'n_estimators': [600, 800, 1000], 'learning_rate': [0.1, 0.2, 0.3], 'max_depth': [3, 5, 10], 'subsample': [0.5], 'max_features': ['auto', 'log2', 'sqrt']
SVM	'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear'], 'gammas': ['auto', 0.001, 0.01], 'shrinking': [True, False]	LightGBM	'learning_rate': [0.1, 0.2, 0.3], 'n_estimators': [600, 800, 1000], 'max_depth': [-1, 5, 10], 'subsample' : [0.5]
XGBoost	'max_depth': [3, 5, 7], 'learning_rate': [0.1, 0.3], 'n_estimators': [100, 200, 300], 'min_child_weight': [1, 3], 'gamma': [x/10 for x in range(0, 5)]	AdaBoost	'learning_rate': [1, 0.1, 0.2, 0.3], 'n_estimators': [50, 100, 600, 800, 1000]
Random Forest	'n_estimators': [600, 800, 1000], 'criterion': ['gini', 'entropy'], 'bootstrap': [True, False], 'max_depth': [None, 5, 10], 'max_features': ['auto', 'log2', 'sqrt']	Naïve Bayes	N/A

Data Size Evaluation

The data size evaluation is an important step when training models. Since the play style of NBA games changes rapidly as time goes, training models using more data does not mean a better prediction accuracy. As a result, the relation between training data size and performance is evaluated and the outcome is presented in Table 3. As shown in the table, training data covering three-year previous games presents the best performance and it is chosen as the optimal dataset for all our models.

Table 3: Data Size Evaluation

Training Data (yr)	Training Data (#)	Accuracy (%)
Training Data (yr)	Training Data (#)	LogiRegr	SVM	XGBoost	Naïve Bayes	Random Forest	GBDT	LightGBM	AdaBoost
1	2460	69.6	70.9	74.7	60.8	68.4	72.2	68.4	73.4
2	5078	70.9	72.2	72.2	59.5	69.6	69.6	74.7	68.4
3	7234	70.9	74.7	74.7	60.8	70.9	73.4	68.4	76.0
4	9370	69.6	72.2	72.2	59.5	72.2	70.9	73.4	73.4
5	11702	70.9	70.9	76.0	59.5	74.7	74.7	69.6	74.7

Voting

To prevent bias from a single machine learning model, a voting mechanism, as shown in Figure 6, is applied to make the prediction decision more convincing. 5 machine learning models, including Logistic Regression, SVM, XGBoost, GBDT, and AdaBoost, are considered in the voting model owing to their better performance. The voting mechanism is simple. The decision agreed by most of the models is the final decision. Furthermore, the ratio of agreed votes to total votes is an indicator implying the confidence level of the final decision.

Voting Model

Figure 6: Voting model.

Stacking

Stacking is a more sophisticated approach that consolidates the predictions from multiple well-trained models and uses them as a new set of training attributes to train another model. It can be considered a multi-stage model or a stacked model that is helpful for preventing bias from certain models. At some level, is can be seen as a mode complicated voting mechanism. Figure 7 shows the block diagram of the stacking model and the details of how stacking works are presented in Figure 8. In this work, several combinations of different machine learning models constructing the stacked model are evaluated. In addition, both 2-stage and 3-stage stacked models are analyzed.

Stacking Model

Figure 7: Stacking model.

Stacking Model

Figure 8: Details in stacking block.

As shown in Table 4, 3-stage stacking model is slightly better than 2-stage stacking model. To thoroughly consider all models, 2-stage stacking of SVM/GBDT/XGBoost + AdaBoost and 3-stage stacking of SVM/XGBoost + RF/GBDT + AdaBoost are selected for the consideration of the final performance comparison.

Table 4: Stacking Model Performance Evaluation

Stacking	Stage 1	Stage 2	Final Stage	Total Estimators (#)	Prediction Accuracy (%)
2-Stage
	SVM/GBDT/XGBoost	None	AdaBoost	4	76.8 (1st)
	SVM/GBDT/AdaBoost	None	XGBoost	4	74.4
	SVM/XGBoost/AdaBoost	None	GBDT	4	72.0
	XGBoost/GBDT/AdaBoost	None	SVM	4	73.2
	SVM/RF/GBDT/XGBoost	None	AdaBoost	5	74.4
	SVM/RF/GBDT/AdaBoost	None	XGBoost	5	72.0
3-Stage
	SVM/XGBoost	RF/GBDT	AdaBoost	5	76.8 (1st)
	RF/GBDT	SVM/XGBoost	AdaBoost	5	75.6
	SVM/AdaBoost	RF/GBDT	XGBoost	5	75.6
	RF/GBDT	SVM/AdaBoost	XGBoost	5	75.6
	SVM/RF	XGBoost/GBDT	AdaBoost	5	75.6

Experimental Results

This work evaluates eight single-stage models, one voting model, one 2-stage stacked model, and one 3-stage stacked model. The performance comparison is summarized in Table 5. We can observe that for the single-stage estimators, all models have decent prediction accuracy except for Naïve Bayes and LightGBM. Moreover, composite models such as voting and stacking are even more accurate than single-stage estimators. AdaBoost, 2-stage stacked, and 3-stage stacked models possess the peak performance of 76.8 % prediction accuracy. In conclusion, stacked machine learning model is an appropriate approach for our task.

Table 5: 2018 NBA Playoff Game Winning Prediction

Model	Algorithms/Architectures	Prediction Accuracy (%)
Single-Stage Estimator
	Logistic Regression	72.0
	SVM	75.6
	XGBoost	75.6
	Naïve Bayes	62.2
	Random Forest	72.0
	GBDT	74.4
	LightGBM	69.5
	AdaBoost	76.8
Voting	Logistic Regreesion/SVM/XGBoost/GBDT/AdaBoost	73.2
2-Stage Stacking	SVM/GBDT/XGBoost + AdaBoost	76.8
3-Stage Stacking	SVM/XGBoost + RF/GBDT + AdaBoost	76.8

The most important games in the NBA are Eastern/Western Finals and the Conference Finals. GBDT is applied as an example to show our predictions on each game as shown in Table 6. The accuracy of the model prediction manifests the tension of the games to some extent. For example, in the 2018 NBA Conference Finals, Golden State Warriors swept Cleveland Cavaliers and our model precisely predicted the fact without incorrect predictions. As shown in the table, only one game had a confidence level lower than 60% and that game was indeed more intense than the other three games. As for Eastern and Western Conference Finals, since both matchups were more competitive, the resulting confidence level of our model was lower compared to the Conference Finals. In summary, this work designs a machine learning model that can reach prediction accuracy of 85.7%, 71.4%, and 100% for Eastern Conference Final, Western Conference Finals, and Conference Finals, respectively.

Table 6: 2018 NBA Finals/Semi-Finals Game Winning Prediction by GBDT Model

Game (#)	Home	Away	Actual Winner	Predicted Winner	Confidence (%)	Accuracy (%)
NBA Conference Finals
1	GSW	CLE	GSW	GSW	70.9	100.0
2	GSW	CLE	GSW	GSW	68.9
3	CLE	GSW	GSW	GSW	58.3
4	CLE	GSW	GSW	GSW	62.3
NBA Western Conference Finals
1	HOU	GSW	GSW	GSW	56.3	71.4
2	HOU	GSW	HOU	HOU	53.9
3	GSW	HOU	GSW	GSW	50.2
4	GSW	HOU	HOU	GSW	64.2
5	HOU	GSW	HOU	HOU	60.0
6	GSW	HOU	GSW	GSW	54.0
7	HOU	GSW	GSW	HOU	63.7
NBA Eastern Conference Finals
1	BOS	CLE	BOS	BOS	53.6	85.7
2	BOS	CLE	BOS	BOS	58.9
3	CLE	BOS	CLE	CLE	61.2
4	CLE	BOS	CLE	CLE	61.3
5	BOS	CLE	BOS	BOS	57.8
6	CLE	BOS	CLE	CLE	55.9
7	BOS	CLE	CLE	BOS	51.8