Small Python project to practice scikit-learn. It extracts features from chess positions, trains a Random Forest model, and compares it against a simple baseline predictor.
- Goal: Predict the outcome of a chess position (white win, draw, black win) in any game.
- Workflow:
- (Optional) Extract positions from PGN files (e.g. from Lichess.org)
- Compute features for each position
- Evaluate feature importance using mutual information
- Train a Random Forest model
- Compare performance against a baseline predictor
I personally selected Random Forest because it is a robust algorithm. I just wanted to practice the main concepts of the scikit-learn library without digging too much into parameter hypertuning.
- Clone the repository
git clone https://github.com/Layyser/chess-feature-predictor.git
cd chess-feature-predictor- Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate- Install dependencies
pip install -r requirements.txtTo extract features, you need to download a PGN file for example in Lichess.org and execute extract_features.py
python src/feature_extraction/extract_features.py
--pgn path/to/your/pgn
--games 20000
--workers 8
--output data/processed/features.csv- NOTE: PGN files are around 300GB so consider using the dataset.csv provided in data/raw/dataset.csv instead of extracting the features from the PGN file
python src/feature_selection/compute_importance.py
--input data/raw/dataset.csvConsider to modify the code and save the model if needed
python src/modeling/train_model.py
--input data/raw/dataset.csv python src/baseline/baseline_predictor.pyTest set accuracy: 0.8949
precision recall f1-score support
0.0 0.89 0.90 0.89 54075
1.0 0.97 0.79 0.87 6847
2.0 0.89 0.91 0.90 56968
accuracy 0.89 117890
macro avg 0.92 0.86 0.89 117890
weighted avg 0.90 0.89 0.89 117890
- pandas
- numpy
- scikit-learn
- python-chess
- Test additional algorithms (e.g. XGBoost, LightGBM, neural networks)
- Perform grid search or Bayesian optimization to fine‑tune parameters (e.g. tree depth, learning rate, number of estimators)
- Use cross‑validation (k‑fold, stratified) to get more robust performance estimates
- Aggregate PGN data over several months or years instead of a single month to ensure diversity
- Filter out anomalous or low‑quality games (e.g. abandonments, ultra‑short games)
- Design new board‑state features (e.g. king safety metrics, mobility scores...)
- Encode move‑history information (e.g. repetition counts, move timestamps)
- Incorporate opening classifications or engine evaluations as features
- Experiment with resampling techniques (SMOTE, ADASYN, under‑sampling)
- Adjust class weights in models or use cost‑sensitive learning
- Implement a full sklearn pipeline that handles preprocessing, feature selection, and modeling in one workflow
- Add unit tests for feature extraction and model evaluation
- Implement the model into my Light-Chess deployment
This project is licensed under the MIT License.