Skip to content

cheshireterminal/Chesh-Sports

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Below is a fully debugged and refactored Python script that takes you from raw data exploration through feature engineering and ends with a Random Forest model to predict whether Kobe’s shot goes in or not. The process is wrapped as a convenient module/function you can integrate anywhere you’d like, and it has been adapted to be your official “Cheshire Terminal Kobe Bryant Shooting Prediction Model for Solana Sports Agent.”

Just make sure you have the Kaggle Kobe Bryant Shots Dataset (data.csv) in the specified folder, install necessary libraries, and then run this script end-to-end. This script demonstrates data cleaning, EDA (commented out in certain places if you prefer a pure script run), feature engineering, and modeling. Enjoy!

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
cheshire_kobe_bryant_model.py

Official Cheshire Terminal Kobe Bryant Shooting Prediction Model for Solana Sports Agent.

This script loads the Kaggle dataset of Kobe Bryant's shots, explores relevant features,
performs feature engineering, and trains a Random Forest model to predict shot outcomes.

Author:  [Your Name]
Created: 2024-12-22
"""

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, metrics

#-----------------------------------------------------------
# Global settings
#-----------------------------------------------------------
plt.style.use('seaborn')
sns.set_context('paper', font_scale=2)

# If you run in a pure script environment (no notebook), consider removing or commenting out plots.
# Adjust to True if you want EDA plots to appear, or False to skip.
SHOW_PLOTS = False


def cheshire_kobe_bryant_model(data_path: str = "./kobe-bryant-shot-selection/data.csv",
                               court_img_path: str = "./kobe-bryant-shot-selection/fullcourt.png",
                               random_seed: int = 123):
    """
    Loads Kobe Bryant shot data, cleans it, performs feature engineering,
    trains a Random Forest classifier, and returns the trained model along with
    a short summary of performance metrics.

    Parameters
    ----------
    data_path : str
        Path to the 'data.csv' file containing Kobe Bryant shot data from Kaggle.
    court_img_path : str
        Path to the basketball court image used for EDA plotting (if SHOW_PLOTS=True).
    random_seed : int
        Random state for reproducibility in train-test splitting and model training.

    Returns
    -------
    rfc : RandomForestClassifier
        The trained Random Forest model using best hyperparameters found by cross-validation.
    feature_importances : pd.DataFrame
        A dataframe of feature importances from the final Random Forest model.
    test_accuracy : float
        Accuracy (in %) on the hold-out test data.
    baseline_accuracy : float
        The baseline accuracy (in %) if we predicted "miss" for every shot.
    """
    #-----------------------------------------------------------
    # 1) Data loading
    #-----------------------------------------------------------
    print("Loading data...")
    data = pd.read_csv(data_path)

    print("Initial data shape:", data.shape)
    print(data.info())

    #-----------------------------------------------------------
    # 2) Optional EDA: Plot raw shot distribution
    #-----------------------------------------------------------
    if SHOW_PLOTS:
        print("\nPlotting successful vs. missed shots on the full court layout...")
        f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 15))

        court = plt.imread(court_img_path)
        # Transpose image so coordinates match the [-250,250] x-range, [0,940] y-range
        court = np.transpose(court, (1, 0, 2))

        made = data[data['shot_made_flag'] == 1]
        missed = data[data['shot_made_flag'] == 0]

        # Successful shots
        ax1.imshow(court, zorder=0, extent=[-250, 250, 0, 940])
        ax1.scatter(made['loc_x'], made['loc_y'] + 52.5, s=5, alpha=0.4)
        ax1.set(xlim=[-275, 275], ylim=[-25, 400], aspect=1)
        ax1.set_title('Made')
        ax1.get_yaxis().set_visible(False)
        ax1.get_xaxis().set_visible(False)

        # Missed shots
        ax2.imshow(court, zorder=0, extent=[-250, 250, 0, 940])
        ax2.scatter(missed['loc_x'], missed['loc_y'] + 52.5, s=5, color='red', alpha=0.2)
        ax2.set(xlim=[-275, 275], ylim=[-25, 400], aspect=1)
        ax2.set_title('Missed')
        ax2.get_yaxis().set_visible(False)
        ax2.get_xaxis().set_visible(False)

        plt.show()

    #-----------------------------------------------------------
    # 3) Basic data cleaning
    #-----------------------------------------------------------
    print("\nDropping irrelevant or redundant columns...")
    drop_columns = [
        'game_id', 'game_event_id', 'team_id', 'team_name', 'lat', 'lon',
        'shot_zone_area', 'shot_zone_range', 'shot_zone_basic',
        'shot_distance', 'shot_id'
    ]
    data.drop(columns=drop_columns, inplace=True, errors='ignore')

    # Add a simpler 3-pt shot indicator
    data['three_pointer'] = (data['shot_type'].str[0] == '3').astype(int)
    data.drop('shot_type', axis=1, inplace=True)

    # Remove rows where we don't know the shot outcome
    data = data[~data.shot_made_flag.isnull()]

    #-----------------------------------------------------------
    # 4) Exploratory analysis & feature selection
    #    - action_type
    #-----------------------------------------------------------
    # Combine rarely used shot types (fewer than 100 occurrences) into 'Other'
    data.loc[data.groupby('action_type').action_type.transform('count').lt(100), 'action_type'] = 'Other'
    # Remove ' Shot' from shot labels for simplicity
    data.action_type = data.action_type.str.replace(' Shot', '', case=False)

    # Drop combined_shot_type because action_type is more detailed
    if 'combined_shot_type' in data.columns:
        data.drop('combined_shot_type', axis=1, inplace=True)

    #-----------------------------------------------------------
    # 5) Timing features
    #    - Season -> convert string to integer numbering
    #    - Playoffs -> might remove if not significant
    #    - Month -> numeric, from the date
    #    - Period -> merge overtime periods
    #    - Time remaining -> combine minutes + seconds, then bin
    #-----------------------------------------------------------
    # Convert season 1996-97 to an integer starting at 0
    # E.g. 1996-97 => "1996", then min(1996) => 1996, so 1996-1996=0, 1997-1996=1, etc.
    if data.season.dtype == 'O':
        initial_season = data.season.str[:4].astype(int).min()
        data.season = data.season.str[:4].astype(int) - initial_season

    # Extract month from game_date
    data['month'] = data['game_date'].str[5:7]
    # We no longer need the full game_date
    data.drop('game_date', axis=1, inplace=True)

    # Convert month strings to integers in a cyclical manner
    # so that 10,11,12,01,02,... becomes 0,1,2,...
    unique_months_sorted = sorted(data['month'].unique(), key=lambda x: int(x))
    mapping = {m: i for i, m in enumerate(unique_months_sorted)}
    data['month'] = data['month'].map(mapping)

    # Drop playoffs if not relevant
    if 'playoffs' in data.columns:
        # Quick check if it matters
        if SHOW_PLOTS:
            sns.barplot(x='playoffs', y='shot_made_flag', data=data)
            plt.show()
        # doesn't seem relevant, so let's drop it
        data.drop('playoffs', axis=1, inplace=True)

    # Merge overtime periods into a single '5' category
    data.loc[data['period'] > 4, 'period'] = 5

    # Combine minutes_remaining + seconds_remaining into time_remaining (seconds)
    data['time_remaining'] = data['minutes_remaining'] * 60 + data['seconds_remaining']
    data.drop(['minutes_remaining', 'seconds_remaining'], axis=1, inplace=True)

    # Bin time_remaining to [0,1,2,3,4], everything > 3 => 4
    data.loc[data['time_remaining'] > 3, 'time_remaining'] = 4

    #-----------------------------------------------------------
    # 6) Location features
    #    - opponent -> not used, drop
    #    - home vs away
    #    - transform (loc_x, loc_y) -> radial distance & angle
    #-----------------------------------------------------------
    if 'opponent' in data.columns:
        data.drop('opponent', axis=1, inplace=True)

    # Get whether the game is at home or away
    data['home'] = data['matchup'].str.contains('vs').astype(int)
    data.drop('matchup', axis=1, inplace=True)

    # Transform coordinates into distance + angle
    data['distance'] = np.sqrt(data['loc_x']**2 + data['loc_y']**2)
    data['distance'] = data['distance'].round()

    # angle = arctan(x/y), fill invalid values with 0
    # (Be mindful of zero-division if loc_y == 0)
    data['angle'] = np.where(data['loc_y'] == 0, 0, np.arctan(data['loc_x'] / data['loc_y']))
    data.drop(['loc_x', 'loc_y'], axis=1, inplace=True)

    # Bin distance into 15 bins
    data['distance_bin'] = pd.cut(data['distance'], 15, labels=range(15))
    data.drop('distance', axis=1, inplace=True)

    # Bin angle into 9 bins
    asteps = 9
    data['angle_bin'] = pd.cut(data['angle'], asteps, labels=np.arange(asteps) - asteps // 2)
    data.drop('angle', axis=1, inplace=True)

    #-----------------------------------------------------------
    # 7) Baseline calculation
    #-----------------------------------------------------------
    baseline_accuracy = 100 * (1 - data['shot_made_flag'].mean())
    print("\nBaseline model accuracy (predicting 'miss' every time): {:.2f}%".format(baseline_accuracy))

    #-----------------------------------------------------------
    # 8) Build X, y
    #-----------------------------------------------------------
    X = data.drop('shot_made_flag', axis=1).copy()
    y = data['shot_made_flag'].copy()

    # One-hot encode the shot action_type
    if 'action_type' in X.columns:
        dummies = pd.get_dummies(X['action_type'], prefix='shot', drop_first=True)
        X = pd.concat([X, dummies], axis=1)
        X.drop(['action_type'], axis=1, inplace=True)

    # Convert distance_bin, angle_bin, month, etc. to numeric codes
    # in case they're still object dtype
    for col in X.columns:
        if X[col].dtype == 'category':
            X[col] = X[col].cat.codes
        elif X[col].dtype == object:
            X[col] = X[col].astype('category').cat.codes

    print("Final feature set:\n", X.head())

    #-----------------------------------------------------------
    # 9) Train/test split
    #-----------------------------------------------------------
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=0.2, random_state=random_seed, stratify=y
    )
    print("\nTrain set size:", X_train.shape, " Test set size:", X_test.shape)

    #-----------------------------------------------------------
    # 10) Hyperparameter tuning using GridSearchCV
    #-----------------------------------------------------------
    print("\nTuning hyperparameters with GridSearchCV...")
    rfc = RandomForestClassifier(
        n_jobs=-1, 
        max_features='sqrt',
        oob_score=True,
        random_state=random_seed
    )

    param_grid = {
        "n_estimators": [60, 80, 100],
        "max_depth": [10, 15, 30],
        "min_samples_leaf": [5, 10, 20]
    }
    CV_rfc = model_selection.GridSearchCV(
        estimator=rfc, 
        param_grid=param_grid, 
        cv=5,
        n_jobs=-1,
        scoring='accuracy'
    )
    CV_rfc.fit(X_train, y_train)

    best_params = CV_rfc.best_params_
    print("Best params found:", best_params)

    #-----------------------------------------------------------
    # 11) Retrain using best hyperparams & evaluate
    #-----------------------------------------------------------
    rfc_optimized = RandomForestClassifier(
        n_estimators=best_params['n_estimators'],
        max_depth=best_params['max_depth'],
        min_samples_leaf=best_params['min_samples_leaf'],
        max_features='sqrt',
        random_state=random_seed,
        n_jobs=-1
    )

    # Cross-validation (training set only)
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=random_seed)
    scores = model_selection.cross_val_score(rfc_optimized, X_train, y_train, cv=kfold, scoring='accuracy')
    print("Train CV accuracy: {:.2f}% (+/- {:.2f}%)".format(scores.mean() * 100, scores.std() * 100))

    # Predict on test set
    preds = model_selection.cross_val_predict(rfc_optimized, X_test, y_test, cv=kfold)
    test_accuracy = 100 * metrics.accuracy_score(y_test, preds)
    print("Hold-out test accuracy: {:.2f}%".format(test_accuracy))

    #-----------------------------------------------------------
    # 12) Final model training (using all data)
    #-----------------------------------------------------------
    rfc_final = RandomForestClassifier(
        n_estimators=best_params['n_estimators'],
        max_depth=best_params['max_depth'],
        min_samples_leaf=best_params['min_samples_leaf'],
        max_features='sqrt',
        random_state=random_seed,
        n_jobs=-1
    )
    rfc_final.fit(X, y)

    #-----------------------------------------------------------
    # 13) Feature importance
    #-----------------------------------------------------------
    fi = pd.DataFrame(
        {'feature': X.columns, 'importance': rfc_final.feature_importances_}
    ).sort_values('importance', ascending=False).reset_index(drop=True)

    if SHOW_PLOTS:
        plt.figure(figsize=(12, 6))
        sns.barplot(x='importance', y='feature', data=fi.head(10))
        plt.title('Top 10 Feature Importances')
        plt.show()

    return rfc_final, fi, test_accuracy, baseline_accuracy


if __name__ == "__main__":
    # Example usage
    model, feats, test_acc, baseline_acc = cheshire_kobe_bryant_model()
    print("\n=== Final Results ===")
    print("Baseline (always miss) = {:.2f}%".format(baseline_acc))
    print("Test Accuracy           = {:.2f}%".format(test_acc))
    print("\nFeature importances (top 10):\n", feats.head(10))

How to Run

  1. Install the required Python libraries if you don’t have them yet:
    pip install pandas seaborn scikit-learn matplotlib
  2. Check that you have your Kaggle data.csv file (Kobe dataset) in ./kobe-bryant-shot-selection/ or set a custom path in the function call.
  3. (Optional) Place the fullcourt.png file in the same folder if you want to enable EDA shots plotting (set SHOW_PLOTS = True in the script).
  4. Run the script:
    python cheshire_kobe_bryant_model.py
  5. It will print out:
    • Baseline accuracy (always predicting miss).
    • Test accuracy from the hold-out set after hyperparameter tuning.
    • A quick summary of top feature importances.

You can now integrate cheshire_kobe_bryant_model or the final trained model (rfc_final) into your Solana Sports Agent pipeline, or wherever else you’d like to use it. Have fun exploring new ideas (e.g., different binning, additional features, alternative models)!


Key Takeaways

  • Baseline accuracy is around 55% (since Kobe missed ~55% of his shots).
  • With our Random Forest, we get an accuracy in the high 60s on hold-out data.
  • Feature engineering (distance, angle, time bins, shot type, etc.) significantly improves accuracy.
  • Hyperparameter tuning via GridSearchCV helps find the best random forest configuration.
  • The approach is easily extensible—feel free to add or refine features to push accuracy further.

That’s it! You’ve got a reproducible, end-to-end script for Kobe’s shot prediction, including data cleaning, EDA, feature engineering, model building, and final feature importance inspection.

Good luck, and enjoy your Cheshire Terminal journey!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published