To see a more recent example of my coding style and best practices, take a look at my Python caching library rapide 🐎
This was a personal project I worked on between June - October 2023, to further hone my skills in data processing, interacting with the Ethereum network, and machine learning.
The initial idea was to scrape per-block data on all holders of any given on-chain crypto asset. Each per-block dataset would contain accurate statistics on live holder profit-and-losses, along with other data points. This type of data was not offered by any third-party data brokers at the time, and would have cost a fortune to acquire through managed nodes.
- ✅ Efficiently query and cache data from a single, self-hosted Ethereum node
- ✅ Per-block, per-address live profit and loss data, fast enough for live use cases
- ✅ Use minimal 3rd party data sources
- ✅ Prepare normalized feature set from large amounts of raw data
- ✅ Train both neural-network and gradient boosting models
- ✅ UI that displays live model inference on a given crypto asset
- ✅ Simple backtesting system for models
- ✅ Uses EdgeDB, use EdgeQL to create concise queries
- ✅ Runs even on consumer hardware
Looking back on this project, there is a number of improvements and simplifications I would make.
-
Rework Data Features: At the time of this project I didn't have an understanding of predictive financial models, and simply included all the data I had collected. I compensated with regularization in the training steps, but this was a band-aid. Making a profitable model was not really my focus at the time.
-
Switch to SQLite: EdgeDB was fun to use, but introduces too much overhead and complexity given the scope of this project. A simple SQLite implementation for the cached data would probably be faster and give fewer headaches, especially when writing many rows in a single transaction. Some of the tables would have to be redesigned to better fit the more traditional style of SQL queries.
-
Testing for Faster Iteration: This codebase originally grew out of a small single script. Certain components, such as the
Database
andHybrid
classes, are fairly complex. I would add a suite of unit and integration tests, especially for these components, before making any further changes. This would be essential if switching to SQLite.
- Python 3.11
- poetry for package management
- edgedb-cli to setup database
- A fully synced, archival, self-hosted Ethereum node
- I recommend reth for execution and lighthouse for consensus
Used to interact with the library
poetry run ui
This prompts for a token pair address, then loads a terminal UI displaying live data and charts for that pair, including PnL spread amongst addresses and model inference results.
poetry run collect-samples
Using the pair addresses provided in samples_test.csv
and samples_train.csv
, this script pulls all per-block data from the database or node as needed, generates feature set, then packages into a compressed JSON for training.
poetry run reset-pair
Prompts for a token pair address and clears associated data from the database
poetry run train
Begins model training process. Each epoch is saved in a project directory.
poetry run train-save-epoch
Prompts for a epoch number in the training output directory and saves the model.
poetry run train-lgbm
Begins LGBM training process.
poetry run backtest
One of the last things I experimented with. Runs a basic backtesting suite and outputs results to JSON file.
poetry run backtest-analyze
Prompts for backtest results file and outputs some useful interpretive statistics in the terminal.