Data pipeline — Recorded with DroidGrid multi-phone camera rig. Label JSONs auto-generated by the recording assistant during capture.
FERN works on any standard RGB camera — no depth sensor, no wearable, no special rig. It extracts 33-joint pose skeletons via MediaPipe, normalizes them to body-relative coordinates, and feeds 60-frame sliding windows into a compact CNN (~526K params) that classifies 8 foot gestures at ~20 ms/window on an RTX 3070. CPU inference is also supported.
The pipeline is fully modular: swap the camera, retrain on new subjects, or extend the gesture set without touching the model architecture.
| Feature | Detail | |
|---|---|---|
| 🎯 | Real-time inference | ~20 ms/window on GPU, CPU-supported |
| 📷 | Any RGB camera | Webcam, phone (via DroidCam), or video file |
| 🧠 | CNN-only | BiLSTM was evaluated and dropped — CNN outperforms at this dataset scale |
| 🔀 | Multi-angle | Per-frame camera-ID flag lets one model handle multiple angles |
| 🏷️ | Auto-labeling | Recording assistant generates label JSONs during capture — no manual annotation |
| 🔄 | Data augmentation | Mirror, time warp, joint dropout, noise injection |
All gestures are performed with the right foot.
| ID | Class | Description |
|---|---|---|
| 0 | foot_hold |
Standing still / idle — no gesture |
| 1 | foot_lift |
Lift foot straight up |
| 2 | sideway_kick |
Kick foot laterally |
| 3 | cross_front |
Cross foot in front of body |
| 4 | heel_tap |
Tap heel to ground |
| 5 | flamingo_bend |
Single-leg balance with knee bend |
| 6 | forward_step |
Step forward |
| 7 | forward_kick |
Kick forward |
foot_holdserves as the idle/null class — without it the model forces predictions on non-gesture frames. It needs dedicated diverse footage, not just transition padding.
RGB video frames
│
▼
MediaPipe PoseLandmarker ──→ 33 body joints × (x, y, z, visibility)
│
▼
Normalise to Mid-Hip + Torso Length ──→ 30 features per frame
│
▼
[optional] Camera-ID one-hot flag ──→ 30 + N features
│
▼
Sliding window (60 frames, stride 15)
│
▼
CNN1D ──→ local spatial-temporal patterns
│
▼
Softmax ──→ 8-class label + confidence
~526K parameters (optimal). ~132K in baseline config. Runs real-time on CPU; ~20 ms/window on RTX 3070 Laptop.
Why CNN-only? BiLSTM was evaluated extensively and consistently underperformed at current dataset scale. CNN-only is the proven baseline for FERN v2.
| Metric | Value |
|---|---|
| ONNX test accuracy (sweep optimal) | 86.29% |
| 3-fold CV (subject-independent) | 44.36% ± 6.75% |
| Model parameters (optimal) | ~526K |
| Architecture | CNN-only |
| Training device | RTX 3070 Laptop (8 GB) / Ryzen 7 5800H |
| Primary camera | c3 (front, 0°) |
| Detection filter | ≥70% detection ratio |
The gap between CV (44.36%) and train-all ONNX (86.29%) confirms data scarcity is the primary bottleneck — the model memorizes well but doesn't generalize across subjects. Target: 20+ subjects.
Known weak classes:
heel_tap— inherently weak from front view; side camera recommendedfoot_hold— weak when only recorded as transition padding
FERN supports multi-angle capture with a single camera-conditioned model.
| Camera | Angle | Position | Status |
|---|---|---|---|
| c3 | 0° (front) | Ground level | ✓ Training baseline |
| c4 | ~45° (right) | Ground level | ✓ Active |
| c2 | ~90° (left) | Ground level | ✓ Active |
| c1 | Elevated | — | ✗ Excluded — breaks normalisation |
| c5 | — | — | ✗ Excluded — insufficient subjects |
Camera-ID flag: A per-frame one-hot vector is appended to skeleton features so one model learns angle-conditioned recognition across all cameras. Geometric rotation via MediaPipe z-depth was evaluated and failed (~15% accuracy). Stereo triangulation is designed but requires physical calibration.
git clone https://github.com/Vision-Orchestration/FERN
cd FERN
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements_v2.txt
$env:PYTHONPATH = "$(Get-Location)\src"Download pose_landmarker_heavy.task (~30 MB) and place it at:
C:\Users\<user>\.cache\mediapipe\models\pose_landmarker_heavy.task
# Webcam
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id 0
# Video file
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id "path\to\video.mp4"python src\recording_assistant.py --subject p01 --cameras phone1,phone2,phone3The recording assistant provides a fullscreen tkinter UI with countdown/GO/REST cues, stick-figure gesture illustrations, DroidGrid REST API integration for multi-camera sync, and auto-generated label JSONs. Target: 20 subjects, varied height and footwear.
The recording assistant lives in the DroidGrid repo and is symlinked or copied into
src/.
python src\extract_skeleton.py --video_dir data\raw --output_dir data\skeletonspython src\train_v2.py --skeleton_dir data\skeletons\front --label_dir data\labels\front --output_dir models_sweep --epochs 200 --warmup_epochs 20 --batch_size 32 --window_size 60 --stride 15 --lr 3e-4 --weight_decay 1e-2 --dropout 0.3 --cnn_out 128 --lstm_hidden 0 --device cuda --num_workers 0 --train_allpython src\export_onnx.py --checkpoint_path models_sweep\fern_v2_latest.pth --output_path models_sweep\fern_v2.onnx
python src\test_onnx.py --onnx_path models_sweep\fern_v2.onnx --skeleton_dir data\skeletons\front --label_dir data\labels\front --window_size 60 --stride 15| Tool | Command | Effect |
|---|---|---|
| LR Mirror | python src\mirror_dataset.py ... |
Doubles dataset via X-flip |
| V1 Merge | python src\merge_v1_database.py ... |
Imports FERN v1 clip database |
| Add Foot-Hold Gaps | python src\add_foot_hold_gaps.py ... |
Inserts 60-frame idle gaps at gesture transitions |
Subject-aware splits required — mirrored pairs must stay in the same train/val/test fold.
FERN uses DroidGrid as its data-capture pipeline. Phones running DroidCam stream video via RTSP → MediaMTX broker → laptop, with FFMPEG pass-through recording per camera. The recording assistant controls DroidGrid via REST API to synchronise multi-camera capture and auto-generate label JSONs from wall-clock anchors.
DroidCam phones ──RTSP──► MediaMTX ──RTSP──► Python (OpenCV capture + preview)
└──► FFMPEG pass-through → .mp4 per camera
│
▼
Recording Assistant → label JSONs (wall-clock anchors)
│
▼
FERN training pipeline (skeleton → CNN → ONNX)
| Config | Mean CV | vs Baseline |
|---|---|---|
| Dropout=0.3 + cnn_out=128 | 44.36% | +3.25 pp |
| Dropout=0.3 | 43.73% | +2.62 pp |
| cnn_out=128 | 42.80% | +1.69 pp |
| Baseline (cnn_out=64, dropout=0.6) | 41.11% | — |
| cnn_out=32 | 34.05% | -7.06 pp |
| Dropout=0.7 | 33.02% | -8.09 pp |
Optimal: cnn_out=128, dropout=0.3, lr=3e-4, weight_decay=1e-2, warmup_epochs=20
| Model | Path | Params | Front Acc |
|---|---|---|---|
| Old front-only | models_final/fern_v2.onnx |
132K | 62.58% |
| Sweep optimal | models_sweep/fern_v2.onnx |
526K | 86.29% |
| Phase 1 (camera-flag) | models_final_v2/fern_v2.onnx |
140K | 50.48% |
| Key | Action |
|---|---|
R |
Start recording (recording assistant) |
S |
Stop recording |
Q |
Quit |
H |
Toggle HUD overlay (inference) |
ONNX inference is slow
- Make sure you're using the GPU build:
pip install onnxruntime-gpu - Check that ONNX Runtime sees your CUDA device
Training hangs on Windows
- Use
--num_workers 0— DataLoader multiprocessing hangs withSubsetRandomSampler
Low accuracy
- Start with front-camera (c3) data only before adding multi-angle
- Use the sweep optimal config:
cnn_out=128,dropout=0.3,lr=3e-4
Camera shows no detection
- Check
data/skeletons/CSVs exist and have valid joint coordinates - Verify the pose model
.taskfile is in the correct cache path
Q: What hardware do I need? Any laptop with a webcam. CUDA GPU recommended for training (RTX 3070 or better), but CPU training and inference work.
Q: How do DroidGrid and FERN connect? DroidGrid is the data-collection rig — it records multi-camera video. FERN is the recognition engine — it extracts skeletons, trains, and runs inference. The recording assistant bridges them by controlling DroidGrid via REST and auto-generating label JSONs.
Q: Why CNN and not BiLSTM? Extensive eval showed CNN-only outperforming BiLSTM by 45%+ on this dataset size. BiLSTM will be revisited with more data.
Q: How do I contribute data?
Record with recording_assistant.py (from the DroidGrid repo), run skeleton extraction and training, then submit a PR.
FERN/
├── src/
│ ├── model_v2.py # CNN-only architecture
│ ├── dataset_v2.py # Sliding-window dataset with camera-flag
│ ├── train_v2.py # Training loop (cosine LR + warmup + early stopping)
│ ├── evaluate_v2.py # Per-class metrics + confusion matrix
│ ├── kfold_cv.py # K-fold CV with grouped folds
│ ├── extract_skeleton.py # MediaPipe skeleton extraction
│ ├── infer_v2.py # Live inference (PyTorch)
│ ├── infer_onnx.py # Live inference (ONNX Runtime)
│ ├── export_onnx.py # .pth → .onnx export
│ ├── test_onnx.py # Full-dataset ONNX accuracy
│ ├── recording_assistant.py # Recording UI (from DroidGrid)
│ ├── add_foot_hold_gaps.py # Insert idle gaps at transitions
│ ├── mirror_dataset.py # LR skeleton augmentation
│ └── merge_v1_database.py # FERN v1 clip merger
├── models_final/ # Old front-only (132K, 62.58%)
├── models_final_v2/ # Phase 1 camera-flag (140K, 50.48%)
├── models_sweep/ # Sweep optimal (526K, 86.29%)
├── assets/
│ └── banner.svg
├── AGENTS.md # AI agent knowledge
├── CAMERA_FLAG_AGENT.md # Camera-flag implementation plan
├── FERN_v2_COMPLETE_REPORT.md # Full technical report
├── FERN_v2_AI_REPORT.md # AI session report
├── run_nightly.ps1 # Nightly training pipeline
├── requirements_v2.txt
├── .gitignore
└── README.md
| Decision | Outcome |
|---|---|
| BiLSTM evaluated and dropped | CNN outperforms at current dataset scale |
| MediaPipe z-depth rotation | Failed (~15% accuracy) — z too noisy for single-camera transforms |
| Auto-generated labels | Eliminates manual annotation errors |
| Camera-ID one-hot flag | Single model handles multiple angles |
| Stereo triangulation | Designed; needs physical calibration |
| Early stopping warmup guard | Required — val_loss vs val_acc mismatch caused false stops |
| num_workers=0 on Windows | DataLoader multiprocessing hangs with SubsetRandomSampler |
- MediaPipe skeleton extraction
- CNN-only model (~526K params optimal)
- Auto-labeling recording assistant
- Multi-camera setup (c3, c4, c2)
- Camera-ID one-hot flag design
- Hyperparameter sweep (9 configs, optimal found)
- 20-subject dataset recording
- LR mirror augmentation
- Subject-independent 5-fold CV as primary metric
- Camera-flag model on multi-angle data
- Stereo triangulation (requires calibration)
- Confidence smoothing (temporal majority vote)
- Full augmentation suite
- Ablation study
- Paper draft
- Paper submission
- Open-source weights + demo
- Dataset release
Issues and pull requests are welcome.
- Fork the repo
- Create a branch:
git checkout -b feature/my-feature - Commit with a clear message
- Open a pull request
@misc{fern2026,
title = {FERN: Real-Time Foot Gesture Recognition via MediaPipe Skeleton and CNN},
author = {Vision-Orchestration},
year = {2026},
url = {https://github.com/Vision-Orchestration/FERN}
}MIT
Part of the Vision-Orchestration toolkit.
MediaPipe skeletons + CNN. No depth sensor. No wearables. Just a camera.