First-person video · action recognition · ablation

How does a model read an egocentric action?

Start with a coffee-making episode. Slice it into short windows. Turn video frames and hand joints into features. Train three baselines and compare what each signal contributes.

Explore the result Learn the concepts

Sample Run

120 windows · 2 action classes · 30 evaluation windows

The numbers below check the pipeline on one episode. They are not a multi-episode benchmark.

Learning Path

1. Window

Group nearby frames so the model can see motion and task progress.

2. Feature

Convert RGB frames and hand joints into compact numeric vectors.

3. Classifier

Train a softmax model that maps each window to an action label.

4. Evaluate

Compare accuracy and F1 to understand strengths and failure modes.

1. RGB baseline

Uses appearance: color, texture, coarse layout, and changes across sampled frames.

2. Hand baseline

Uses 3D hand-joint motion, which is often a strong cue for intent.

3. Fusion baseline

Concatenates visual context with hand motion before classification.

RGB baseline

RGB asks whether the scene itself reveals the action. For pour-over coffee, kettle position and dripper context can be highly informative.

Ablation Result

Macro F1 bar chart for RGB, hand, and fusion baselines

Confusion matrix for the sample action baseline run

Key Terms In One Minute

Accuracy: the fraction of windows classified correctly.

What The Sample Teaches

0.700

RGB accuracy with chronological evaluation. Appearance still carries the strongest signal in this short run.

0.286

Hand-only macro F1. Time-held-out windows reveal that hand motion alone is harder to generalize from one episode.

Because the sample is one episode, treat the result as a clear pipeline demonstration. For a benchmark, add more episodes and split by held-out episode.

Concepts, Explained Visually

Click a term to see how it appears in this repo. Each explanation connects the ML idea to the coffee-making episode.

Temporal window

A temporal window is like watching a few seconds of motion instead of judging from a single photo. The model can see reaching, grasping, lifting, and pouring as short sequences.

Run This Locally

1. Clone the repository.

2. Set DATA_ROOT.

3. Install pip install -e ".[dev]".

4. Run the ablation CLI.

Softmax baseline

ego-action-ablation --data-root "$DATA_ROOT" --output-dir outputs/sample_ablation

Softmax + MLP comparison

ego-action-ablation --data-root "$DATA_ROOT" --model both --epochs 300

Artifact Preview

{
  "split_strategy": "chronological",
  "experiments": {
    "rgb_only": {"accuracy": 0.70, "macro_f1": 0.824},
    "rgb_hand_fusion_mlp": {"model": "mlp"}
  }
}

Run the experiment with scripts/run_ablation.py or ego-action-ablation. The page above covers the core concepts directly.