Evaluation toolkit for validating multi-camera tracking and 3D reconstruction.
This repo provides a consistent way to audit data quality before doing any biomechanical / behavioral analysis.
The goals are:
Original Work: This repository is based on the evaluation metrics framework developed by Vishal Soni. The core Phase 1 and Phase 2 evaluation pipelines are from Vishal’s original work.
Additions by Howard Wang:
Downstream analysis (3D jaw motion, gape, kinematics, etc.) only makes sense if the raw inputs are trustworthy.
Phase 1 is a battery of sanity checks run on the 2D per-camera tracking and basic multi-camera alignment.
It answers: “Can we trust this session enough to even start doing 3D / biomechanics?”
If Phase 1 fails, we stop and fix tracking / calibration / sync before wasting time.
The evaluation pipeline is organized into 4 sequential phases, each building on the previous one:
Phase 1 is split into 5 ordered steps.
Each step focuses on one failure mode (coverage, pixel accuracy, drift, jitter, sync).
| Step | Phase 1 Evaluation Metric (What this step is doing for us) |
|---|---|
| Step 1 – Visibility & Continuity Audit | For every camera and every node (upper lip, lower lip, jaw ref, head, etc.) we measure: (i) how often that node is detected with confidence above threshold, (ii) how long it stays continuously detected without blinking out, and (iii) how many nodes are valid per frame over time. This immediately shows which cameras or landmarks are flaky, where tracking vanishes, and where coverage collapses mid-session. In plain words: “Do we actually have reliable signal here, or are we blind half the time?” |
| Step 2 – Pixel Accuracy vs Ground Truth | We align predictions to human-labeled ground truth (auto-fixing small frame offsets), then compute per-frame pixel error for every node in every camera. We also compute PCK-style stats (% of points landing within tight pixel radii like 2–5 px). This answers: “When the tracker says ‘this is the lower lip’, is it actually on the lower lip, or is it drifting somewhere else?” This step catches nodes that looked ‘present’ in Step 1 but are not anatomically usable. |
| Step 3 – Temporal Drift Check | Using the frame-by-frame pixel error from Step 2, we fit error vs time for each (camera, node), report the slope (px per 10k frames) and R², and flag nodes whose error steadily grows. A flat slope = stable lock. A positive slope = slow slide due to skin slip, lighting, or calibration creep. This prevents the classic failure mode: “the first 10 seconds looked fine so we trusted the whole session.” |
| Step 4 – Marker Stability / Jitter Audit | We measure frame-to-frame wobble and residual jitter for each node in each camera: how much the point jumps each frame, and how much it buzzes around a short smoothed trajectory. Even if the average position is “correct”, high jitter injects fake velocity/acceleration into gape, angle rate, etc. This step separates real biomechanical motion from neural-net wiggle. In other words: “Is this trace clean enough to use for kinematics, or is it twitchy nonsense?” |
| Step 5 – Inter-Camera Sync / Lag Check | For each pair of cameras, we cross-correlate the motion traces of the same node to estimate timing offset in frames and produce a lag matrix. If cam A is effectively ~1–2 frames ahead of cam B, multi-view triangulation will give fake 3D. This step answers: “Are all cameras looking at the same instant in time?” If not, downstream 3D fusion is automatically suspect. |
TL;DR of Phase 1:
Only if these 5 checks look sane do we move forward.
Phase 2 asks one question:
“Can we trust the 3D kinematics we’re about to analyze biologically?”
Everything in Phase 2 is built around that. We don’t care only about ‘does tracking exist’ (Phase 1). Now we care about: is the fused 3D signal actually solid, continuous, and geometrically real.
This prevents us from accidentally publishing “behavioral dynamics” that are actually tracker noise.
This is our reality check. If the 3D doesn’t line up with the images, it is not scientifically defensible.
In short:
Phase 1 = Did 2D tracking behave?
Phase 2 = Is the 3D reconstruction trustworthy enough to measure biomechanics on it?
| Step | Phase 2 Evaluation Metric (What this step is doing for us) |
|---|---|
| Step 1 – 3D Track Integrity Audit (Coverage / Jitter / Dropout) | We take the raw fused 3D trajectories (no smoothing, no cleanup, just the direct triangulated output) and score each joint in physical units. For every joint we report: Coverage (%) = how often this joint has a valid 3D position (not NaN), Jitter (mm/frame) = how much that joint is jumping frame-to-frame in 3D space (median and p95 step size), and Longest Gap (ms) = the worst continuous blackout where that joint disappears. Units are auto-normalized to mm, and FPS is used to convert gaps to milliseconds. This step answers: “Is this 3D skeleton actually present, stable, and continuous — or are we missing chunks, vibrating, or blacking out?” It’s an honest health report of the raw 3D before any rescue/smoothing, so we know what is truly trustworthy. |
| Step 2 – Geometric Consistency / Reprojection Error Audit | For every frame, joint, and camera we take the 3D point [X,Y,Z], project it back into that camera using its calibration matrix, and compare that predicted 2D pixel location to what the tracker actually saw in that camera. The pixel distance is the reprojection error. We summarize median and p95 reprojection error per (camera, joint), and also per joint across cameras. This answers: “Does the 3D point really line up with all cameras, or is the ‘3D’ actually inconsistent with the views?” High reprojection error flags bad calibration, camera desync (one camera a frame ahead/behind), triangulation errors (e.g. swapped upper/lower lip), or just hallucinated 3D. Low error means the 3D geometry is self-consistent and physically believable. |
Phase 3 provides visualization and behavioral analysis tools for understanding motion patterns in the 3D tracking data. These analyses help characterize:
In short:
Phase 1 = Did 2D tracking behave?
Phase 2 = Is the 3D reconstruction trustworthy enough to measure biomechanics on it?
Phase 3 = What behavioral patterns can we extract from the 3D data?
| Step | Phase 3 Evaluation Metric (What this step is doing for us) |
|---|---|
| Step 1 – Normal Vector Stability Analysis | For every frame, we compute the plane normal vector from 3 landmark points (typically head landmarks) and measure stability metrics. We compute: (i) frame-to-frame angle changes and angular velocities, (ii) rolling mean normal vectors over different time windows (0.5s, 1s, 5s), (iii) rolling standard deviation of angles from the rolling mean, (iv) low-pass filtered normals to reduce noise, and (v) state classification (stable/changing) based on stability thresholds. We also provide per-second block analysis with mean normal vectors and angle shifts between seconds. This answers: “How stable is the reference plane orientation over time, and when are periods of stability vs. movement?” This helps understand head orientation stability and detect periods of movement vs. stability. |
| Step 2 – Chewing Sidedness Analysis | For every frame, we compute the midpoint between two tracked nodes (typically node_8 and node_9), calculate its velocity using central differences, and project the velocity vector onto the plane normal to determine sidedness. We classify each frame as “left”, “right”, or “neutral” based on velocity magnitude and sidedness score, filtering out idle movements to focus on active chewing. We report summary statistics including total frames, active chewing frames, classification breakdown (left/right/neutral percentages), average sidedness scores, and dominant side. This answers: “What is the lateral preference in chewing, and how does sidedness vary over time?” This helps characterize lateral chewing preference and patterns. |
Phase 4 focuses on evaluating the quality of gape onset timing predictions and feeding pattern detection. It answers two key questions:
This phase does not do any preprocessing or model prediction – it only evaluates predictions that are already computed. It uses statistically robust metrics (medians, F1, bout recall) and bootstrap 95% confidence intervals (CIs) to summarize performance across sessions.
Phase 4 provides objective performance metrics for:
median_latency: Median of (predicted onset - true onset), indicating if the model is early/late/on-timemedian_abs_latency: Median absolute timing error, showing typical prediction accuracyperfect_onset_rate: Fraction of gape cycles predicted within a tight tolerance (e.g., ±1 frame)f1_feeding: Frame-wise F1-score for the 0/1 feeding classificationbout_recall: Fraction of true feeding bouts that are detected by the modelPhase 4 expects two CSV files:
gt.csv)Required columns:
session_id – string label for each recording (e.g., monkey01_day1)frame – frame index or time (must match prediction CSV)feeding_gt – ground-truth feeding label (0 = not feeding, 1 = feeding)onset_gt – 1 only at true gape onset frames, 0 otherwisepred.csv)Required columns:
session_id – must match GTframe – must match GT (same values, same order per session)feeding_pred – model-predicted feeding label (0/1)onset_pred – 1 at predicted gape onset frames, 0 otherwiseThe notebook produces two key outputs:
per_session_df – One row per session with metrics:
n_cycles – number of gape onsets evaluatedmedian_latency – median timing error (frames)median_abs_latency – median absolute timing errorperfect_onset_rate – fraction within tolerancen_frames – total number of framesf1_feeding – F1 score for feeding detectionbout_recall – fraction of feeding bouts detectedsummary_df – Mean ± 95% CI across all sessions:
mean, ci_low, ci_highIn short:
Phase 1 = Did 2D tracking behave?
Phase 2 = Is the 3D reconstruction trustworthy enough to measure biomechanics on it?
Phase 3 = What behavioral patterns can we extract from the 3D data?
Phase 4 = How accurate are our gape timing and feeding predictions?
This repo expects you to organize each recording session in its own folder.
That session folder is both your input and (after running) it will also hold that session’s QC results.
You (the user) prepare a folder for one session, for example:
session_data/
└─ <session_name>/
├─ cam_1.analysis.h5
├─ cam_2.analysis.h5
├─ cam_3.analysis.h5
├─ cam_4.analysis.h5
├─ calibration.json
└─ points3d.h5 / final_3d_tracks.npz
Meaning:
cam_*.analysis.h5 - These are the per-camera 2D tracking outputs after SLEAP (or equivalent). One file per camera view.calibration.json - Camera calibration + geometry info used to relate the different views.points3d.h5 / final_3d_tracks.npz - A fused multi-view 3D track dump for that same session (whatever your upstream pipeline exports as the “current best 3D”).This single <session_name>/ directory is what you point the evaluation code at.
That is the ONLY path you need to give the Phase 1 and Phase 2 notebooks.
In the repo, you have notebooks in the folder:
Phase 1- Sleap Videos & h5 Files/
├─ Evaluation_Metric_Step_1.ipynb
├─ Evaluation_Metric_Step_2.ipynb
├─ Evaluation_Metric_Step_3.ipynb
├─ Evaluation_Metric_Step_4.ipynb
└─ Evaluation_Metric_Step_5.ipynb
You open these in order (Step 1 → Step 2 → Step 3 → Step 4 → Step 5).
Inside each notebook you provide the path to the session folder:
session_path = "session_data/<session_name>/"
The code will read:
cam_*.analysis.h5,calibration.json,points3d.h5 or final_3d_tracks.npz).The notebook then runs that step’s QC logic for that same session.
In the repo, you have notebooks in the folder:
Phase 2-3D File/
├─ Phase_2_Evaluation_Metric_Step_1.ipynb
└─ Phase_2_Evaluation_Metric_Step_2.ipynb
You open these in order (Step 1 → Step 2).
Inside each notebook you provide the path to the session folder:
session_path = "session_data/<session_name>/"
The code will read:
The current 3D dump (points3d.h5 or final_3d_tracks.npz)
→ this is the main input for Phase 2, used in both Step 1 and Step 2.
All cam_*.analysis.h5 files (the per-camera 2D tracking outputs),
used in Phase 2 Step 2.
The shared calibration.json (camera intrinsics/extrinsics),
used in Phase 2 Step 2 to reproject the 3D back into each camera view.
Optional: CT Pedestal Integration
Inside the notebooks, you can configure the CT pedestal integration:
# ========== CT PEDESTAL CONFIGURATION (OPTIONAL) ==========
nose_landmark_name = "F" # update to your tracked nose joint name
include_ct_pedestals = True # set False to disable
# =========================================================
Each notebook runs that step’s QC logic for that same session and writes out CSV summaries and plots.
In the repo, you have notebooks in the folder:
Phase 3-Visualization/
├─ Phase_3_Evaluation_Metric_Step_1.ipynb
└─ Phase_3_Evaluation_Metric_Step_2.ipynb
You open these in order (Step 1 → Step 2).
Inside each notebook you provide the path to a CSV file with 3D node data:
csv_path = r"data/processed/all_nodes_3d_long.csv"
The CSV file should have columns: frame, node, x, y, z, time_s.
For Step 1 (Normal Vector Stability):
normal_vector_stability_analysis.csv and normal_vector_stability_sec_blocks.csvFor Step 2 (Chewing Sidedness):
chewing_sidedness_analysis.csvEach notebook runs that step’s analysis and writes out CSV summaries and statistics.
In the repo, you have the notebook in the folder:
Phase 4-Gape Analysis/
└─ Phase_4_Evaluation.ipynb
Inside the notebook you provide paths to two CSV files:
gt_csv_path = "data/gt.csv"
pred_csv_path = "data/pred.csv"
Configuration:
perfect_tol = 1.0 # e.g. ±1 frame at 60 Hz
The notebook will:
session_id and frameper_session_df and summary_dfWhen you run a notebook for a given session, the code writes results back inside that same session folder under a nested output tree.
Concretely, after running the evaluation notebooks, your session folder will now look like:
session_data/
└─ <session_name>/
├─ cam_1.analysis.h5
├─ cam_2.analysis.h5
├─ cam_3.analysis.h5
├─ cam_4.analysis.h5
├─ calibration.json
├─ points3d.h5 / final_3d_tracks.npz
│
├─ out_metrics/
│ └─ Evaluation_Metrics/
│ ├─ metrics_step1/
│ ├─ metrics_step2/
│ ├─ metrics_step3/
│ ├─ metrics_step4/
│ └─ metrics_step5/
│
└─ out_3d_eval/
└─ ... Phase 2 evaluation results (Step 1 and Step 2) ...
Important details:
out_metrics/ is created automatically inside that same <session_name>/.out_metrics/, the code creates Evaluation_Metrics/.Evaluation_Metrics/, each Phase 1 notebook creates its own subfolder:
metrics_step1/metrics_step2/metrics_step3/metrics_step4/metrics_step5/out_3d_eval/ is also created automatically inside that same <session_name>/.
out_3d_eval/, you get all Phase 2 evaluation results (Step 1 and Step 2).This means every QC artifact for that session stays local to that session.
You don’t have to manually create these folders — the notebooks do it when they run.
Phase 3 and Phase 4 outputs:
Phase 3 and Phase 4 notebooks create output files in a configured directory (typically results/ or another specified output directory):
normal_vector_stability_analysis.csv, normal_vector_stability_sec_blocks.csv, chewing_sidedness_analysis.csvper_session_df, summary_df) and can be exported as neededPrepare session data
Make a new folder for one session under session_data/<session_name>/.
Put in (for that session):
cam_*.analysis.h5 (per-camera 2D tracks),calibration.json,points3d.h5 or final_3d_tracks.npz).Run Phase 1 — Per-Camera Tracking QC
Open and run notebooks in order:
Evaluation_Metric_Step_1.ipynb → … → Evaluation_Metric_Step_5.ipynb
Point them at the session folder.
Phase 1 code will create out_metrics/Evaluation_Metrics/ inside that session folder and populate:
metrics_step1/metrics_step2/metrics_step3/metrics_step4/metrics_step5/Run Phase 2 — 3D Reconstruction QC
Open and run notebooks in order:
Phase_2_Evaluation_Metric_Step_1.ipynb → Phase_2_Evaluation_Metric_Step_2.ipynb
Use the same session folder.
Optional: Configure CT pedestal integration if needed.
Phase 2 code will create out_3d_eval/ inside that session folder.
out_3d_eval/ holds all Phase 2 evaluation results (Step 1 and Step 2).Run Phase 3 — Visualization and Behavioral Analysis
Open and run notebooks in order:
Phase_3_Evaluation_Metric_Step_1.ipynb → Phase_3_Evaluation_Metric_Step_2.ipynb
Provide a CSV file with 3D node data (frame, node, x, y, z, time_s).
Configure landmark nodes for plane computation and analysis.
Phase 3 code will create output CSVs in the configured directory (typically results/).
Run Phase 4 — Gape & Feeding Evaluation
Open and run:
Phase_4_Evaluation.ipynb
Provide two CSV files: gt.csv (ground truth) and pred.csv (predictions).
Configure the perfect latency tolerance.
Phase 4 displays per_session_df and summary_df with performance metrics.
At that point, all four phases are complete:
This repository is maintained at: https://github.com/HowardWHSrun/ruten-work
Full documentation is available at: https://howardwhsrun.github.io/ruten-work/