GitHub - Vision-Orchestration/FERN: tuned up

Data pipeline — Recorded with DroidGrid multi-phone camera rig. Label JSONs auto-generated by the recording assistant during capture.

Overview

FERN works on any standard RGB camera — no depth sensor, no wearable, no special rig. It extracts 33-joint pose skeletons via MediaPipe, normalizes them to body-relative coordinates, and feeds 60-frame sliding windows into a compact CNN (~526K params) that classifies 8 foot gestures at ~20 ms/window on an RTX 3070. CPU inference is also supported.

The pipeline is fully modular: swap the camera, retrain on new subjects, or extend the gesture set without touching the model architecture.

	Feature	Detail
🎯	Real-time inference	~20 ms/window on GPU, CPU-supported
📷	Any RGB camera	Webcam, phone (via DroidCam), or video file
🧠	CNN-only	BiLSTM was evaluated and dropped — CNN outperforms at this dataset scale
🔀	Multi-angle	Per-frame camera-ID flag lets one model handle multiple angles
🏷️	Auto-labeling	Recording assistant generates label JSONs during capture — no manual annotation
🔄	Data augmentation	Mirror, time warp, joint dropout, noise injection

Gestures (8 classes)

All gestures are performed with the right foot.

ID	Class	Description
0	`foot_hold`	Standing still / idle — no gesture
1	`foot_lift`	Lift foot straight up
2	`sideway_kick`	Kick foot laterally
3	`cross_front`	Cross foot in front of body
4	`heel_tap`	Tap heel to ground
5	`flamingo_bend`	Single-leg balance with knee bend
6	`forward_step`	Step forward
7	`forward_kick`	Kick forward

foot_hold serves as the idle/null class — without it the model forces predictions on non-gesture frames. It needs dedicated diverse footage, not just transition padding.

Architecture

RGB video frames
      │
      ▼
MediaPipe PoseLandmarker  ──→  33 body joints × (x, y, z, visibility)
      │
      ▼
Normalise to Mid-Hip + Torso Length  ──→  30 features per frame
      │
      ▼
[optional] Camera-ID one-hot flag  ──→  30 + N features
      │
      ▼
Sliding window (60 frames, stride 15)
      │
      ▼
CNN1D  ──→  local spatial-temporal patterns
      │
      ▼
Softmax  ──→  8-class label + confidence

~526K parameters (optimal). ~132K in baseline config. Runs real-time on CPU; ~20 ms/window on RTX 3070 Laptop.

Why CNN-only? BiLSTM was evaluated extensively and consistently underperformed at current dataset scale. CNN-only is the proven baseline for FERN v2.

Current Results

Metric	Value
ONNX test accuracy (sweep optimal)	86.29%
3-fold CV (subject-independent)	44.36% ± 6.75%
Model parameters (optimal)	~526K
Architecture	CNN-only
Training device	RTX 3070 Laptop (8 GB) / Ryzen 7 5800H
Primary camera	c3 (front, 0°)
Detection filter	≥70% detection ratio

The gap between CV (44.36%) and train-all ONNX (86.29%) confirms data scarcity is the primary bottleneck — the model memorizes well but doesn't generalize across subjects. Target: 20+ subjects.

Known weak classes:

heel_tap — inherently weak from front view; side camera recommended
foot_hold — weak when only recorded as transition padding

Camera Setup

FERN supports multi-angle capture with a single camera-conditioned model.

Camera	Angle	Position	Status
c3	0° (front)	Ground level	✓ Training baseline
c4	~45° (right)	Ground level	✓ Active
c2	~90° (left)	Ground level	✓ Active
c1	Elevated	—	✗ Excluded — breaks normalisation
c5	—	—	✗ Excluded — insufficient subjects

Camera-ID flag: A per-frame one-hot vector is appended to skeleton features so one model learns angle-conditioned recognition across all cameras. Geometric rotation via MediaPipe z-depth was evaluated and failed (~15% accuracy). Stereo triangulation is designed but requires physical calibration.

Quick Start

1. Clone & install

git clone https://github.com/Vision-Orchestration/FERN
cd FERN
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements_v2.txt
$env:PYTHONPATH = "$(Get-Location)\src"

2. Get the pose model

Download pose_landmarker_heavy.task (~30 MB) and place it at:

C:\Users\<user>\.cache\mediapipe\models\pose_landmarker_heavy.task

3. Run live inference

# Webcam
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id 0

# Video file
python src\infer_v2.py --model models_sweep\fern_v2.onnx --camera_id "path\to\video.mp4"

Training Pipeline

1. Record dataset (with DroidGrid + Recording Assistant)

python src\recording_assistant.py --subject p01 --cameras phone1,phone2,phone3

The recording assistant provides a fullscreen tkinter UI with countdown/GO/REST cues, stick-figure gesture illustrations, DroidGrid REST API integration for multi-camera sync, and auto-generated label JSONs. Target: 20 subjects, varied height and footwear.

The recording assistant lives in the DroidGrid repo and is symlinked or copied into src/.

2. Extract skeletons

python src\extract_skeleton.py --video_dir data\raw --output_dir data\skeletons

3. Train (optimal config)

python src\train_v2.py --skeleton_dir data\skeletons\front --label_dir data\labels\front --output_dir models_sweep --epochs 200 --warmup_epochs 20 --batch_size 32 --window_size 60 --stride 15 --lr 3e-4 --weight_decay 1e-2 --dropout 0.3 --cnn_out 128 --lstm_hidden 0 --device cuda --num_workers 0 --train_all

4. Export to ONNX & evaluate

python src\export_onnx.py --checkpoint_path models_sweep\fern_v2_latest.pth --output_path models_sweep\fern_v2.onnx
python src\test_onnx.py --onnx_path models_sweep\fern_v2.onnx --skeleton_dir data\skeletons\front --label_dir data\labels\front --window_size 60 --stride 15

Augmentation Tools

Tool	Command	Effect
LR Mirror	`python src\mirror_dataset.py ...`	Doubles dataset via X-flip
V1 Merge	`python src\merge_v1_database.py ...`	Imports FERN v1 clip database
Add Foot-Hold Gaps	`python src\add_foot_hold_gaps.py ...`	Inserts 60-frame idle gaps at gesture transitions

Subject-aware splits required — mirrored pairs must stay in the same train/val/test fold.

DroidGrid Integration

FERN uses DroidGrid as its data-capture pipeline. Phones running DroidCam stream video via RTSP → MediaMTX broker → laptop, with FFMPEG pass-through recording per camera. The recording assistant controls DroidGrid via REST API to synchronise multi-camera capture and auto-generate label JSONs from wall-clock anchors.

DroidCam phones  ──RTSP──►  MediaMTX  ──RTSP──►  Python (OpenCV capture + preview)
                                        └──►  FFMPEG pass-through → .mp4 per camera
                                                    │
                                                    ▼
                                        Recording Assistant → label JSONs (wall-clock anchors)
                                                    │
                                                    ▼
                                        FERN training pipeline (skeleton → CNN → ONNX)

Hyperparameter Sweep (Optimal Config Found)

Config	Mean CV	vs Baseline
Dropout=0.3 + cnn_out=128	44.36%	+3.25 pp
Dropout=0.3	43.73%	+2.62 pp
cnn_out=128	42.80%	+1.69 pp
Baseline (cnn_out=64, dropout=0.6)	41.11%	—
cnn_out=32	34.05%	-7.06 pp
Dropout=0.7	33.02%	-8.09 pp

Optimal: cnn_out=128, dropout=0.3, lr=3e-4, weight_decay=1e-2, warmup_epochs=20

Production Models

Model	Path	Params	Front Acc
Old front-only	`models_final/fern_v2.onnx`	132K	62.58%
Sweep optimal	`models_sweep/fern_v2.onnx`	526K	86.29%
Phase 1 (camera-flag)	`models_final_v2/fern_v2.onnx`	140K	50.48%

Keyboard Controls

Key	Action
`R`	Start recording (recording assistant)
`S`	Stop recording
`Q`	Quit
`H`	Toggle HUD overlay (inference)

Troubleshooting

ONNX inference is slow

Make sure you're using the GPU build: pip install onnxruntime-gpu
Check that ONNX Runtime sees your CUDA device

Training hangs on Windows

Use --num_workers 0 — DataLoader multiprocessing hangs with SubsetRandomSampler

Low accuracy

Start with front-camera (c3) data only before adding multi-angle
Use the sweep optimal config: cnn_out=128, dropout=0.3, lr=3e-4

Camera shows no detection

Check data/skeletons/ CSVs exist and have valid joint coordinates
Verify the pose model .task file is in the correct cache path

FAQ

Q: What hardware do I need? Any laptop with a webcam. CUDA GPU recommended for training (RTX 3070 or better), but CPU training and inference work.

Q: How do DroidGrid and FERN connect? DroidGrid is the data-collection rig — it records multi-camera video. FERN is the recognition engine — it extracts skeletons, trains, and runs inference. The recording assistant bridges them by controlling DroidGrid via REST and auto-generating label JSONs.

Q: Why CNN and not BiLSTM? Extensive eval showed CNN-only outperforming BiLSTM by 45%+ on this dataset size. BiLSTM will be revisited with more data.

Q: How do I contribute data? Record with recording_assistant.py (from the DroidGrid repo), run skeleton extraction and training, then submit a PR.

File Structure

FERN/
├── src/
│   ├── model_v2.py              # CNN-only architecture
│   ├── dataset_v2.py            # Sliding-window dataset with camera-flag
│   ├── train_v2.py              # Training loop (cosine LR + warmup + early stopping)
│   ├── evaluate_v2.py           # Per-class metrics + confusion matrix
│   ├── kfold_cv.py              # K-fold CV with grouped folds
│   ├── extract_skeleton.py      # MediaPipe skeleton extraction
│   ├── infer_v2.py              # Live inference (PyTorch)
│   ├── infer_onnx.py            # Live inference (ONNX Runtime)
│   ├── export_onnx.py           # .pth → .onnx export
│   ├── test_onnx.py             # Full-dataset ONNX accuracy
│   ├── recording_assistant.py   # Recording UI (from DroidGrid)
│   ├── add_foot_hold_gaps.py    # Insert idle gaps at transitions
│   ├── mirror_dataset.py        # LR skeleton augmentation
│   └── merge_v1_database.py     # FERN v1 clip merger
├── models_final/                # Old front-only (132K, 62.58%)
├── models_final_v2/             # Phase 1 camera-flag (140K, 50.48%)
├── models_sweep/                # Sweep optimal (526K, 86.29%)
├── assets/
│   └── banner.svg
├── AGENTS.md                    # AI agent knowledge
├── CAMERA_FLAG_AGENT.md         # Camera-flag implementation plan
├── FERN_v2_COMPLETE_REPORT.md   # Full technical report
├── FERN_v2_AI_REPORT.md         # AI session report
├── run_nightly.ps1              # Nightly training pipeline
├── requirements_v2.txt
├── .gitignore
└── README.md

Key Design Decisions

Decision	Outcome
BiLSTM evaluated and dropped	CNN outperforms at current dataset scale
MediaPipe z-depth rotation	Failed (~15% accuracy) — z too noisy for single-camera transforms
Auto-generated labels	Eliminates manual annotation errors
Camera-ID one-hot flag	Single model handles multiple angles
Stereo triangulation	Designed; needs physical calibration
Early stopping warmup guard	Required — val_loss vs val_acc mismatch caused false stops
num_workers=0 on Windows	DataLoader multiprocessing hangs with SubsetRandomSampler

Roadmap

Alpha (current)

MediaPipe skeleton extraction
CNN-only model (~526K params optimal)
Auto-labeling recording assistant
Multi-camera setup (c3, c4, c2)
Camera-ID one-hot flag design
Hyperparameter sweep (9 configs, optimal found)
20-subject dataset recording
LR mirror augmentation
Subject-independent 5-fold CV as primary metric

Beta

Camera-flag model on multi-angle data
Stereo triangulation (requires calibration)
Confidence smoothing (temporal majority vote)

Gold

Full augmentation suite
Ablation study
Paper draft

Release

Paper submission
Open-source weights + demo
Dataset release

Contributing

Issues and pull requests are welcome.

Fork the repo
Create a branch: git checkout -b feature/my-feature
Commit with a clear message
Open a pull request

Citation

@misc{fern2026,
  title   = {FERN: Real-Time Foot Gesture Recognition via MediaPipe Skeleton and CNN},
  author  = {Vision-Orchestration},
  year    = {2026},
  url     = {https://github.com/Vision-Orchestration/FERN}
}

License

MIT

Part of the Vision-Orchestration toolkit.

MediaPipe skeletons + CNN. No depth sensor. No wearables. Just a camera.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
docs		docs
final/models		final/models
final_v2/models		final_v2/models
src		src
sweep/models		sweep/models
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CAMERA_FLAG_AGENT.md		CAMERA_FLAG_AGENT.md
FERN_v2_AI_REPORT.md		FERN_v2_AI_REPORT.md
FERN_v2_COMPLETE_REPORT.md		FERN_v2_COMPLETE_REPORT.md
fern_agent.md		fern_agent.md
readme.md		readme.md
requirements_v2.txt		requirements_v2.txt
run_nightly.ps1		run_nightly.ps1

Folders and files

Latest commit

History

Repository files navigation

Overview

Gestures (8 classes)

Architecture

Current Results

Camera Setup

Quick Start

1. Clone & install

2. Get the pose model

3. Run live inference

Training Pipeline

1. Record dataset (with DroidGrid + Recording Assistant)

2. Extract skeletons

3. Train (optimal config)

4. Export to ONNX & evaluate

Augmentation Tools

DroidGrid Integration

Hyperparameter Sweep (Optimal Config Found)

Production Models

Keyboard Controls

Troubleshooting

FAQ

File Structure

Key Design Decisions

Roadmap

Alpha (current)

Beta

Gold

Release

Contributing

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages