Skip to content

Vision-Orchestration/FERN

Repository files navigation

FERN — Foot gEsture Recognition Network

DroidGrid

Python PyTorch MediaPipe ONNX CUDA License

Data pipeline — Recorded with DroidGrid multi-phone camera rig. Label JSONs auto-generated by the recording assistant during capture.


Overview

FERN works on any standard RGB camera — no depth sensor, no wearable, no special rig. It extracts 33-joint pose skeletons via MediaPipe, normalizes them to body-relative coordinates, and feeds 60-frame sliding windows into a compact CNN (~526K params) that classifies 8 foot gestures at ~20 ms/window on an RTX 3070. CPU inference is also supported.

The pipeline is fully modular: swap the camera, retrain on new subjects, or extend the gesture set without touching the model architecture.

Feature Detail
🎯 Real-time inference ~20 ms/window on GPU, CPU-supported
📷 Any RGB camera Webcam, phone (via DroidCam), or video file
🧠 CNN-only BiLSTM was evaluated and dropped — CNN outperforms at this dataset scale
🔀 Multi-angle Per-frame camera-ID flag lets one model handle multiple angles
🏷️ Auto-labeling Recording assistant generates label JSONs during capture — no manual annotation
🔄 Data augmentation Mirror, time warp, joint dropout, noise injection

Gestures (8 classes)

All gestures are performed with the right foot.

ID Class Description
0 foot_hold Standing still / idle — no gesture
1 foot_lift Lift foot straight up
2 sideway_kick Kick foot laterally
3 cross_front Cross foot in front of body
4 heel_tap Tap heel to ground
5 flamingo_bend Single-leg balance with knee bend
6 forward_step Step forward
7 forward_kick Kick forward

foot_hold serves as the idle/null class — without it the model forces predictions on non-gesture frames. It needs dedicated diverse footage, not just transition padding.


Architecture

RGB video frames
      │
      ▼
MediaPipe PoseLandmarker  ──→  33 body joints × (x, y, z, visibility)
      │
      ▼
Normalise to Mid-Hip + Torso Length  ──→  30 features per frame
      │
      ▼
[optional] Camera-ID one-hot flag  ──→  30 + N features
      │
      ▼
Sliding window (60 frames, stride 15)
      │
      ▼
CNN1D  ──→  local spatial-temporal patterns
      │
      ▼
Softmax  ──→  8-class label + confidence

~526K parameters (optimal). ~132K in baseline config. Runs real-time on CPU; ~20 ms/window on RTX 3070 Laptop.

Why CNN-only? BiLSTM was evaluated extensively and consistently underperformed at current dataset scale. CNN-only is the proven baseline for FERN v2.


Current Results

Metric Value
ONNX test accuracy (sweep optimal) 86.29%
3-fold CV (subject-independent) 44.36% ± 6.75%
Model parameters (optimal) ~526K
Architecture CNN-only
Training device RTX 3070 Laptop (8 GB) / Ryzen 7 5800H
Primary camera c3 (front, 0°)
Detection filter ≥70% detection ratio

The gap between CV (44.36%) and train-all ONNX (86.29%) confirms data scarcity is the primary bottleneck — the model memorizes well but doesn't generalize across subjects. Target: 20+ subjects.

Known weak classes:

  • heel_tap — inherently weak from front view; side camera recommended
  • foot_hold — weak when only recorded as transition padding

Camera Setup

FERN supports multi-angle capture with a single camera-conditioned model.

Camera Angle Position Status
c3 0° (front) Ground level ✓ Training baseline
c4 ~45° (right) Ground level ✓ Active
c2 ~90° (left) Ground level ✓ Active
c1 Elevated ✗ Excluded — breaks normalisation
c5 ✗ Excluded — insufficient subjects

Camera-ID flag: A per-frame one-hot vector is appended to skeleton features so one model learns angle-conditioned recognition across all cameras. Geometric rotation via MediaPipe z-depth was evaluated and failed (~15% accuracy). Stereo triangulation is designed but requires physical calibration.


Quick Start

1. Clone & install

git clone https://github.com/Vision-Orchestration/FERN
cd FERN
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements_v2.txt
$env:PYTHONPATH = "$(Get-Location)\src"

2. Get the pose model

Download pose_landmarker_heavy.task (~30 MB) and place it at:

C:\Users\<user>\.cache\mediapipe\models\pose_landmarker_heavy.task

3. Run live inference

# Webcam
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id 0

# Video file
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id "path\to\video.mp4"

Training Pipeline

1. Record dataset (with DroidGrid + Recording Assistant)

python src\recording_assistant.py --subject p01 --cameras phone1,phone2,phone3

The recording assistant provides a fullscreen tkinter UI with countdown/GO/REST cues, stick-figure gesture illustrations, DroidGrid REST API integration for multi-camera sync, and auto-generated label JSONs. Target: 20 subjects, varied height and footwear.

The recording assistant lives in the DroidGrid repo and is symlinked or copied into src/.

2. Extract skeletons

python src\extract_skeleton.py --video_dir data\raw --output_dir data\skeletons

3. Train (optimal config)

python src\train_v2.py --skeleton_dir data\skeletons\front --label_dir data\labels\front --output_dir models_sweep --epochs 200 --warmup_epochs 20 --batch_size 32 --window_size 60 --stride 15 --lr 3e-4 --weight_decay 1e-2 --dropout 0.3 --cnn_out 128 --lstm_hidden 0 --device cuda --num_workers 0 --train_all

4. Export to ONNX & evaluate

python src\export_onnx.py --checkpoint_path models_sweep\fern_v2_latest.pth --output_path models_sweep\fern_v2.onnx
python src\test_onnx.py --onnx_path models_sweep\fern_v2.onnx --skeleton_dir data\skeletons\front --label_dir data\labels\front --window_size 60 --stride 15

Augmentation Tools

Tool Command Effect
LR Mirror python src\mirror_dataset.py ... Doubles dataset via X-flip
V1 Merge python src\merge_v1_database.py ... Imports FERN v1 clip database
Add Foot-Hold Gaps python src\add_foot_hold_gaps.py ... Inserts 60-frame idle gaps at gesture transitions

Subject-aware splits required — mirrored pairs must stay in the same train/val/test fold.


DroidGrid Integration

FERN uses DroidGrid as its data-capture pipeline. Phones running DroidCam stream video via RTSP → MediaMTX broker → laptop, with FFMPEG pass-through recording per camera. The recording assistant controls DroidGrid via REST API to synchronise multi-camera capture and auto-generate label JSONs from wall-clock anchors.

DroidCam phones  ──RTSP──►  MediaMTX  ──RTSP──►  Python (OpenCV capture + preview)
                                        └──►  FFMPEG pass-through → .mp4 per camera
                                                    │
                                                    ▼
                                        Recording Assistant → label JSONs (wall-clock anchors)
                                                    │
                                                    ▼
                                        FERN training pipeline (skeleton → CNN → ONNX)

Hyperparameter Sweep (Optimal Config Found)

Config Mean CV vs Baseline
Dropout=0.3 + cnn_out=128 44.36% +3.25 pp
Dropout=0.3 43.73% +2.62 pp
cnn_out=128 42.80% +1.69 pp
Baseline (cnn_out=64, dropout=0.6) 41.11%
cnn_out=32 34.05% -7.06 pp
Dropout=0.7 33.02% -8.09 pp

Optimal: cnn_out=128, dropout=0.3, lr=3e-4, weight_decay=1e-2, warmup_epochs=20


Production Models

Model Path Params Front Acc
Old front-only models_final/fern_v2.onnx 132K 62.58%
Sweep optimal models_sweep/fern_v2.onnx 526K 86.29%
Phase 1 (camera-flag) models_final_v2/fern_v2.onnx 140K 50.48%

Keyboard Controls

Key Action
R Start recording (recording assistant)
S Stop recording
Q Quit
H Toggle HUD overlay (inference)

Troubleshooting

ONNX inference is slow

  • Make sure you're using the GPU build: pip install onnxruntime-gpu
  • Check that ONNX Runtime sees your CUDA device

Training hangs on Windows

  • Use --num_workers 0 — DataLoader multiprocessing hangs with SubsetRandomSampler

Low accuracy

  • Start with front-camera (c3) data only before adding multi-angle
  • Use the sweep optimal config: cnn_out=128, dropout=0.3, lr=3e-4

Camera shows no detection

  • Check data/skeletons/ CSVs exist and have valid joint coordinates
  • Verify the pose model .task file is in the correct cache path

FAQ

Q: What hardware do I need? Any laptop with a webcam. CUDA GPU recommended for training (RTX 3070 or better), but CPU training and inference work.

Q: How do DroidGrid and FERN connect? DroidGrid is the data-collection rig — it records multi-camera video. FERN is the recognition engine — it extracts skeletons, trains, and runs inference. The recording assistant bridges them by controlling DroidGrid via REST and auto-generating label JSONs.

Q: Why CNN and not BiLSTM? Extensive eval showed CNN-only outperforming BiLSTM by 45%+ on this dataset size. BiLSTM will be revisited with more data.

Q: How do I contribute data? Record with recording_assistant.py (from the DroidGrid repo), run skeleton extraction and training, then submit a PR.


File Structure

FERN/
├── src/
│   ├── model_v2.py              # CNN-only architecture
│   ├── dataset_v2.py            # Sliding-window dataset with camera-flag
│   ├── train_v2.py              # Training loop (cosine LR + warmup + early stopping)
│   ├── evaluate_v2.py           # Per-class metrics + confusion matrix
│   ├── kfold_cv.py              # K-fold CV with grouped folds
│   ├── extract_skeleton.py      # MediaPipe skeleton extraction
│   ├── infer_v2.py              # Live inference (PyTorch)
│   ├── infer_onnx.py            # Live inference (ONNX Runtime)
│   ├── export_onnx.py           # .pth → .onnx export
│   ├── test_onnx.py             # Full-dataset ONNX accuracy
│   ├── recording_assistant.py   # Recording UI (from DroidGrid)
│   ├── add_foot_hold_gaps.py    # Insert idle gaps at transitions
│   ├── mirror_dataset.py        # LR skeleton augmentation
│   └── merge_v1_database.py     # FERN v1 clip merger
├── models_final/                # Old front-only (132K, 62.58%)
├── models_final_v2/             # Phase 1 camera-flag (140K, 50.48%)
├── models_sweep/                # Sweep optimal (526K, 86.29%)
├── assets/
│   └── banner.svg
├── AGENTS.md                    # AI agent knowledge
├── CAMERA_FLAG_AGENT.md         # Camera-flag implementation plan
├── FERN_v2_COMPLETE_REPORT.md   # Full technical report
├── FERN_v2_AI_REPORT.md         # AI session report
├── run_nightly.ps1              # Nightly training pipeline
├── requirements_v2.txt
├── .gitignore
└── README.md

Key Design Decisions

Decision Outcome
BiLSTM evaluated and dropped CNN outperforms at current dataset scale
MediaPipe z-depth rotation Failed (~15% accuracy) — z too noisy for single-camera transforms
Auto-generated labels Eliminates manual annotation errors
Camera-ID one-hot flag Single model handles multiple angles
Stereo triangulation Designed; needs physical calibration
Early stopping warmup guard Required — val_loss vs val_acc mismatch caused false stops
num_workers=0 on Windows DataLoader multiprocessing hangs with SubsetRandomSampler

Roadmap

Alpha (current)

  • MediaPipe skeleton extraction
  • CNN-only model (~526K params optimal)
  • Auto-labeling recording assistant
  • Multi-camera setup (c3, c4, c2)
  • Camera-ID one-hot flag design
  • Hyperparameter sweep (9 configs, optimal found)
  • 20-subject dataset recording
  • LR mirror augmentation
  • Subject-independent 5-fold CV as primary metric

Beta

  • Camera-flag model on multi-angle data
  • Stereo triangulation (requires calibration)
  • Confidence smoothing (temporal majority vote)

Gold

  • Full augmentation suite
  • Ablation study
  • Paper draft

Release

  • Paper submission
  • Open-source weights + demo
  • Dataset release

Contributing

Issues and pull requests are welcome.

  1. Fork the repo
  2. Create a branch: git checkout -b feature/my-feature
  3. Commit with a clear message
  4. Open a pull request

Citation

@misc{fern2026,
  title   = {FERN: Real-Time Foot Gesture Recognition via MediaPipe Skeleton and CNN},
  author  = {Vision-Orchestration},
  year    = {2026},
  url     = {https://github.com/Vision-Orchestration/FERN}
}

License

MIT


Part of the Vision-Orchestration toolkit.

MediaPipe skeletons + CNN. No depth sensor. No wearables. Just a camera.

About

tuned up

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors