Multi-venue quantitative crypto trading system — cointegration-based pair selection,
ML-enhanced signals, and walk-forward backtesting across 32 CEX/DEX venues.
Results • Architecture • Strategies • Data Pipeline • Quick Start • Structure • Development
|
|
Walk-forward out-of-sample results. Train: Jan 2022 – Jun 2023 | Test: Jul 2023 – Dec 2024. No leverage (1.0x). Transaction costs included.
|
32 Venues CEX • DEX • Hybrid |
211 Symbols 16 crypto sectors |
226K+ Lines 184 Python files |
137 Dependencies Pinned & reproducible |
|
Walk-Forward Out-of-sample validated |
ML Enhanced GBM + Random Forest |
No Leverage 1.0x only |
61 Compliance Checks Automated validation |
run_arb.py Master orchestrator
│
├── phase1run.py ─► run_phase1.py Phase 1: Multi-venue data collection
│ ├── 7 CEX collectors Binance, Bybit, OKX, Kraken, Coinbase, Deribit, CME
│ ├── 12 DEX collectors Uniswap, Curve, GMX, SushiSwap, Jupiter, ...
│ ├── 3 Hybrid collectors Hyperliquid, dYdX, Drift
│ └── 10 Alternative sources On-chain, sentiment, social, analytics
│
├── phase2run.py Phase 2: Altcoin StatArb (5-step pipeline)
│ ├── Step 1 Universe construction + cointegration testing
│ ├── Step 2 Baseline z-score mean reversion strategy
│ ├── Step 3 ML enhancement (Gradient Boosting + Random Forest)
│ ├── Step 4 Walk-forward backtest + crisis analysis
│ └── Step 5 Report generation
│
├── run_phase3.py ─► phase3run.py Phase 3: BTC Futures curve trading
│ ├── Funding rate term structure
│ ├── Calendar spread signals
│ ├── Cross-venue arbitrage
│ └── Walk-forward backtest
│
├── generate_visualizations.py 34 publication-quality charts
└── Compliance validator 61 automated checks
Identifies cointegrated cryptocurrency pairs and trades mean-reverting spreads with ML-enhanced signals.
| Parameter | CEX | DEX |
|---|---|---|
| Universe | 50 tokens (top by volume) | 25 tokens (DeFi-native) |
| Entry Z-Score | ± 2.0 | ± 2.5 |
| Exit Z-Score | 0.0 (mean) | |z| < 1.0 |
| Stop Z-Score | ± 3.0 | ± 3.5 |
| Max Position | $100,000 | $50,000 |
| Transaction Cost | 0.20% (4-leg round trip) | 0.50 – 1.50% all-in |
| Max Positions | 5 – 8 | 2 – 3 |
ML Enhancement: Gradient Boosting + Random Forest ensemble predicts spread direction, filtering baseline z-score signals. Features include lagged spreads, volatility regimes (HMM), momentum, and cross-pair correlations.
Risk Controls:
- Venue-based tier classification (T1: Both CEX, T2: Mixed, T3: Both DEX)
- 40% sector concentration limit, 70% max cross-pair correlation
- Kelly criterion position sizing (0.25 – 0.5x)
- 1.0x leverage only (no leverage)
Exploits the term structure of BTC perpetual funding rates across venues.
| Component | Details |
|---|---|
| Venues | Binance, Hyperliquid, dYdX, OKX, Bybit, GMX, Aevo |
| Signals | Funding rate carry, calendar spreads, cross-venue basis |
| Frequency | Hourly rebalancing |
| Walk-Forward | 6-month train / 18-month test |
| Type | Venues | Data |
|---|---|---|
| CEX (7) | Binance, Bybit, OKX, Kraken, Coinbase, Deribit, CME | OHLCV, funding rates, open interest, liquidations, options |
| Hybrid (3) | Hyperliquid, dYdX, Drift | OHLCV, hourly funding rates, open interest |
| DEX (12) | Uniswap, Curve, GMX, SushiSwap, Jupiter, 1inch, 0x, CoWSwap, GeckoTerminal, DexScreener, … | Pool data, swaps, TVL, liquidity |
| On-Chain (5) | Covalent, Bitquery, Santiment, The Graph, Nansen | Wallet flows, smart money, on-chain metrics |
| Alternative (5+) | DeFiLlama, Coinalyze, LunarCrush, Dune, CoinGecko, CryptoCompare, Messari | TVL, sentiment, social, fundamentals |
211 unique symbols across 16 sectors with full survivorship bias tracking:
View full sector breakdown
| Sector | Count | Examples |
|---|---|---|
| L1 Blockchains | 18 | SOL, AVAX, ADA, DOT, ATOM, TON |
| DeFi DEX | 16 | UNI, SUSHI, CRV, DYDX, GMX, JUP |
| Major Altcoins | 13 | BNB, XRP, DOGE, LTC, BCH |
| Infrastructure | 13 | LINK, GRT, FIL, AR, ENS |
| DeFi Lending | 9 | AAVE, COMP, MKR, ENA |
| Liquid Staking | 9 | LDO, RPL, EIGEN, PENDLE |
| AI / ML | 8 | FET, TAO, WLD, RNDR |
| L2 Solutions | 8 | ARB, OP, MATIC, STRK |
| Meme Tokens | 8 | PEPE, SHIB, BONK, WIF |
| Gaming | 7 | AXS, SAND, MANA, APE |
The system includes both an event-driven backtester and a vectorized fast backtester:
| Feature | Details |
|---|---|
| Walk-Forward Validation | Train: Jan 2022 – Jun 2023 | Test: Jul 2023 – Dec 2024 |
| Crisis Analysis | UST/Luna collapse, FTX bankruptcy, Banking crisis, SEC lawsuits |
| Capacity Analysis | Market-impact modelling per venue |
| Attribution | Per-pair, per-sector, and per-regime P&L decomposition |
| Compliance | 61 automated checks via run_arb.py --validate |
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.10 | 3.12 |
| RAM | 8 GB | 16 GB |
| Disk | 5 GB | 10 GB |
| OS | macOS / Linux / Windows (WSL) | macOS (Apple Silicon) |
# Clone the repository
git clone https://github.com/abailey81/Crypto-Statistical-Arbitrage.git
cd Crypto-Statistical-Arbitrage
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# macOS only (required by XGBoost)
brew install libomp# Copy the API key template
cp config/api_keys_template.env config/.env
# Edit with your API keys (many venues work without keys)
nano config/.env
# Verify credentials
python config/verify_my_credentials.pyNote: Many data sources (Binance, Bybit, OKX, Hyperliquid, dYdX, GeckoTerminal, DeFiLlama, etc.) work without API keys using public endpoints.
# Full pipeline: Data Collection + Altcoin StatArb + BTC Futures + Visualizations
python run_arb.py
# Skip data collection (use existing data)
python run_arb.py --skip-phase1
# Run specific phases
python run_arb.py --phase 2 # Altcoin StatArb only
python run_arb.py --phase 3 # BTC Futures only
python run_arb.py --phase 2 3 # Both strategies
# Cold run (clear all caches)
python run_arb.py --clean-cache
# Validate compliance (61 checks)
python run_arb.py --validateAll run modes
| Mode | Command | Description |
|---|---|---|
| Full Pipeline | python run_arb.py |
All phases + visualizations + compliance |
| Cold Run | python run_arb.py --clean-cache |
Clear caches, run from scratch |
| Warm Run | python run_arb.py --skip-phase1 |
Skip data collection |
| Phase Select | python run_arb.py --phase 2 |
Run specific phase(s) |
| Validate | python run_arb.py --validate |
61-check compliance audit |
| Check Data | python run_arb.py --check-only |
Data readiness audit |
| 1-Day Test | python run_phase1.py --start 2026-02-08 --end 2026-02-09 |
Smoke test |
View full project tree
.
├── config/ # Configuration
│ ├── config.yaml # Strategy parameters, risk limits, dates
│ ├── venues.yaml # 32 venue configs (endpoints, costs, capacity)
│ ├── symbols.yaml # 211 symbols across 16 sectors
│ └── api_keys_template.env # API key template (copy to .env)
│
├── data_collection/ # Phase 1: Data acquisition layer
│ ├── cex/ # CEX collectors (Binance, Bybit, OKX, ...)
│ ├── dex/ # DEX collectors (Uniswap, Curve, GMX, ...)
│ ├── hybrid/ # Hybrid collectors (Hyperliquid, dYdX, Drift)
│ ├── onchain/ # On-chain analytics (10 providers)
│ ├── options/ # Options data (Deribit, Aevo)
│ ├── alternative/ # Alternative data (DeFiLlama, Coinalyze, ...)
│ ├── market_data/ # Market data aggregators
│ ├── indexers/ # Blockchain indexers (The Graph)
│ └── utils/ # Rate limiting, caching, validation, storage
│
├── strategies/ # Trading strategies
│ ├── pairs_trading/ # Cointegration, Kalman filter, ML signals
│ ├── futures_curve/ # Term structure, calendar spreads
│ ├── funding_rate_arb/ # Cross-venue funding rate arbitrage
│ └── vol_surface_or_dex_arb/ # Options vol surface / DEX arbitrage
│
├── backtesting/ # Backtesting engine
│ ├── backtest_engine.py # Core event-driven backtester
│ ├── optimized_backtest.py # Vectorized fast backtester
│ └── analysis/ # Walk-forward, crisis, capacity, attribution
│
├── portfolio/ # Portfolio construction
│ ├── optimizer.py # HRP, MVO, risk parity, Black-Litterman
│ └── risk_manager.py # Drawdown stops, VaR limits, stress tests
│
├── reporting/ # Report generation
│ ├── advanced_report_generator.py
│ └── strict_pdf_validator.py # 61-check compliance validator
│
├── execution/ # Execution layer
│ ├── order_manager.py # Order routing and management
│ └── slippage_model.py # Venue-specific slippage models
│
├── notebooks/ # Jupyter notebooks
│ ├── 00_data_acquisition_plan.ipynb
│ ├── 01_cex_data_exploration.ipynb
│ ├── 02_dex_data_exploration.ipynb
│ ├── 03_venue_comparison.ipynb
│ ├── 04_strategy_development.ipynb
│ ├── 05_multi_venue_backtesting.ipynb
│ └── 06_portfolio_construction.ipynb
│
├── docs/ # Documentation
│ ├── methodology.md # Statistical methodology
│ ├── data_dictionary.md # Data schema reference
│ ├── data_sources.md # Venue documentation
│ ├── api_reference.md # API reference
│ └── venue_comparison.md # Venue comparison analysis
│
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── performance/ # Performance benchmarks
│
├── run_arb.py # Master orchestrator
├── phase1run.py # Phase 1 entry point
├── phase2run.py # Phase 2 engine (~5,700 lines)
├── run_phase1.py # Phase 1 engine (~5,400 lines)
├── run_phase3.py # Phase 3 entry point
├── generate_visualizations.py # Chart generator (34 visualizations)
├── requirements.txt # Dependencies (137 packages)
├── setup.py # Package configuration
└── Makefile # Build automation
All strategy parameters are defined in config/config.yaml:
| Section | What it controls |
|---|---|
universe |
Token lists, sector mappings, venue assignments |
cointegration |
Half-life bounds, p-value thresholds, test window |
strategy |
Z-score entry/exit/stop, position sizing, Kelly fraction |
risk |
Drawdown limits, concentration caps, correlation thresholds |
backtest |
Train/test dates, transaction costs, walk-forward windows |
venues |
Per-venue endpoints, rate limits, fee schedules |
make format # Black + isort formatting
make lint # Flake8 linting
make type-check # mypy type checking
make quality # All of the above
make test # Run all tests
make test-unit # Unit tests only
make test-integration # Integration tests
make test-coverage # With coverage reportSee CONTRIBUTING.md for full development guidelines.
| Document | Description |
|---|---|
| Methodology | Statistical methodology — cointegration, Kalman filter, HMM |
| Data Dictionary | Data schema and field reference |
| Data Sources | Venue documentation and capabilities |
| API Reference | Module-level API reference |
| Venue Comparison | Cross-venue comparison analysis |
137 pinned packages organized by function:
View dependency breakdown
| Category | Key Packages |
|---|---|
| Scientific Computing | numpy, pandas, scipy |
| Data Collection | ccxt, aiohttp, requests, websockets, httpx |
| Econometrics | statsmodels, arch, hmmlearn |
| Machine Learning | scikit-learn, lightgbm, xgboost |
| GPU Acceleration | numba, pyopencl, joblib |
| Portfolio Optimization | cvxpy |
| Data Storage | pyarrow, fastparquet, h5py |
| Visualization | matplotlib, seaborn, plotly, kaleido |
| Configuration | pydantic, python-dotenv, PyYAML |
This project is for educational and research purposes only. It is not financial advice. Cryptocurrency trading involves substantial risk of loss. Past performance does not guarantee future results. Always do your own research before making any investment decisions.
This project is licensed under the MIT License. See LICENSE for details.
Built by Tamer Atesyakar