First-person video · action recognition · ablation

How does a model read an egocentric action?

Start with a coffee-making episode. Slice it into short windows. Turn video frames and hand joints into features. Train three baselines and compare what each signal contributes.

Explore the result Learn the concepts

Sample Run

120 windows · 2 action classes · 30 evaluation windows

The numbers below check the pipeline on one episode. They are not a multi-episode benchmark.

Learning Path

1. Window

Group nearby frames so the model can see motion and task progress.

2. Feature

Convert RGB frames and hand joints into compact numeric vectors.

3. Classifier

Train a softmax model that maps each window to an action label.

4. Evaluate

Compare accuracy and F1 to understand strengths and failure modes.

1. RGB baseline

Uses appearance: color, texture, coarse layout, and changes across sampled frames.

2. Hand baseline

Uses 3D hand-joint motion, which is often a strong cue for intent.

3. Fusion baseline

Concatenates visual context with hand motion before classification.

RGB baseline

RGB asks whether the scene itself reveals the action. For pour-over coffee, kettle position and dripper context can be highly informative.

Ablation Result

Macro F1 bar chart for RGB, hand, and fusion baselines

Confusion matrix for the sample action baseline run

Key Terms In One Minute

Accuracy: the fraction of windows classified correctly.

What The Sample Teaches

0.471

Hand-joint accuracy on the blocked-instance split. Hand motion is the strongest single cue across all 18 actions of the full episode.

0.143

Majority-class accuracy. Every learned baseline must beat this floor before any other comparison means anything.

The split strategy decides what the numbers mean: a leaky stratified split inflates every model above 0.93, while the honest blocked-instance split keeps train support for every test class without sharing frames. Because the sample is one episode, treat the result as a within-episode demonstration. For a benchmark, add more episodes and split by held-out episode.

Concepts, Explained Visually

Click a term to see how it appears in this repo. Each explanation connects the ML idea to the coffee-making episode.

Temporal window

A temporal window is like watching a few seconds of motion instead of judging from a single photo. The model can see reaching, grasping, lifting, and pouring as short sequences.

Run This Locally

1. Clone the repository.

2. Set DATA_ROOT.

3. Install pip install -e ".[dev]".

4. Run the ablation CLI.

Softmax baseline

ego-action-ablation --data-root "$DATA_ROOT" --output-dir outputs/sample_ablation

Softmax + MLP comparison

ego-action-ablation --data-root "$DATA_ROOT" --model both --epochs 300

Artifact Preview

{
  "split_strategy": "blocked-instance",
  "experiments": {
    "hand_joints_only": {"accuracy": 0.471, "macro_f1": 0.259, "ece": 0.357},
    "rgb_hand_late_fusion": {"model": "late_fusion_softmax"}
  }
}

Run the experiment with scripts/run_ablation.py or ego-action-ablation. The page above covers the core concepts directly.