Unified Glossary

Every concept from the three project glossaries, cross-linked in one place. Generated from each repo's docs/concepts.md by scripts/build_glossary.py.

Action Recognition

from egocentric-action-baselines

Egocentric Video

Egocentric video is recorded from the actor's point of view. The camera moves with the person, so the video contains strong cues about hands, objects, and task progress. It also has challenges: motion blur, occlusion by hands, changing lighting, and quick camera rotations.

Action Recognition

Action recognition asks: "What is the person doing now?" In this repo the answer is a discrete label such as Pick up kettle or Position kettle to pour.

The model sees a short temporal window instead of a single frame. A window lets the classifier observe motion: reaching, grasping, lifting, placing, and pouring.

Feature

A feature is a numeric description of the input. The classifier does not train directly on raw videos here. It receives compact features:

• RGB features summarize color, texture, coarse spatial layout, and frame

changes.

• Hand-joint features summarize 3D hand pose, velocity, and range of motion.

• Fusion features concatenate RGB and hand-joint features.

Baseline

A baseline is a simple method that sets a reference point. Good baselines answer important questions before using larger neural networks:

• Is visual context already enough?

• Do hand trajectories carry action intent?

• Does combining modalities improve the decision?

Ablation Study

An ablation study removes or changes one component at a time. The three experiments here form a small ablation:

1. RGB only. 2. Hand joints only. 3. RGB plus hand joints.

If fusion improves over both single-modality baselines, the two signals are complementary. If it does not, the dataset or model may need more samples, better temporal modeling, or stronger regularization.

Accuracy

Accuracy is the fraction of predictions that are correct:

accuracy = correct predictions / all predictions

It is intuitive, but it can hide class imbalance. A model can score high if it mostly predicts a frequent action.

Macro F1

F1 combines precision and recall for each class. Macro F1 averages class scores equally, so rare classes matter as much as common classes.

Use accuracy to understand overall correctness. Use macro F1 to check whether the model is treating each action class fairly.

Temporal Leakage

Overlapping windows share many frames. If nearby windows are randomly split between train and test, the test set can look too similar to the training set. On this episode a stratified split inflates every baseline above 0.93 accuracy. For serious evaluation, split by time blocks or by held-out episodes.

Label Shift

A split can be leak-free and still useless. The chronological tail split holds out the end of the episode — but every action here happens exactly once, so the tail contains classes the model never saw in training. Even the majority baseline scores 0.0. When the train and test label distributions disagree this badly, no model can look good, and the number measures the split rather than the model.

Blocked-Instance Split

The within-episode compromise: hold out the chronological tail of each action instance, then purge any train window that shares frames with a test window. Every test class keeps train support (no label shift) and no frame appears on both sides (no leakage). It measures within-instance generalization — a weaker claim than cross-episode generalization, but an honest one.

Early vs Late Fusion

Early fusion concatenates features before training one classifier. Late fusion trains one classifier per modality and averages their predicted probabilities. Early fusion can model cross-modal interactions but lets a weak modality drag down a strong one inside a single head; late fusion keeps modalities independent but cannot model their interactions. Neither automatically beats the best single modality — making fusion help is a research problem, not a default.

Calibration and ECE

A model is calibrated when predictions made with confidence p are correct about p of the time. Expected calibration error (ECE) bins predictions by confidence and averages the gap between confidence and accuracy. Small models on small data are usually overconfident — here ECE reaches 0.47, meaning confidence overstates accuracy by almost half. Check calibration.json before trusting any probability downstream.

3D Reconstruction

from egocentric-3d-reconstruction-demo

Frame Extraction

Most reconstruction tools operate on images. Frame extraction samples still images from the video so feature matchers and neural renderers can process them.

Dense sampling gives more overlap but costs more time. Sparse sampling is faster but can fail if adjacent images no longer share enough visual content.

Camera Intrinsics

Intrinsics describe how a camera maps 3D rays to pixels. Common values include:

• focal length,

• principal point,

• distortion parameters.

Fisheye cameras are useful for egocentric capture because they see a wide field of view. They also require careful distortion handling.

Camera Extrinsics

Extrinsics describe where a camera is in 3D space and how it is oriented. In a multi-camera rig, extrinsics explain how cameras are positioned relative to each other.

SLAM

SLAM means simultaneous localization and mapping. It estimates the camera path while building a sparse map of the environment. The sample annotation already contains SLAM camera poses and a point cloud preview.

COLMAP

COLMAP is a structure-from-motion and multi-view stereo system. It usually runs:

1. feature extraction, 2. feature matching, 3. mapping and bundle adjustment, 4. optional image undistortion and dense reconstruction.

Bundle adjustment optimizes camera poses and 3D points so projected points align with matched image features.

NeRF

NeRF represents a scene as a neural function. Given a 3D point and viewing direction, it predicts density and color. Rendering a new view means sampling many camera rays through this learned field.

3D Gaussian Splatting

3D Gaussian Splatting represents a scene with many colored 3D Gaussians. It is often faster to render than classic NeRF and can produce high-quality novel views when camera poses are accurate.

Why Egocentric Reconstruction Is Hard

First-person footage violates many assumptions of clean multi-view reconstruction:

• the camera moves quickly,

• hands and objects move independently,

• motion blur reduces feature matches,

• fisheye distortion needs the right model,

• dynamic objects can confuse bundle adjustment,

• the camera may not revisit the same area from enough viewpoints.

Good reconstruction starts with clean frame sampling, calibration awareness, and explicit failure analysis.

Frame Quality Diagnostics

Reconstruction fails quietly when inputs are weak. Three cheap numbers predict most failures before COLMAP runs: Laplacian variance (blur — low variance means few sharp edges), exposure clipping fractions, and ORB feature matches between consecutive frames (overlap — the matches COLMAP will actually rely on). Measuring them on a central crop keeps the black fisheye border from skewing the statistics.

Extrinsic Convention Verification

A transform stored as T_c_b might map body→camera or camera→body, and the SLAM pose might be world→body or body→world. Names do not disambiguate; geometry does. Projecting known 3D points (here, mocap hand joints) under each candidate composition and scoring in-bounds fraction plus depth plausibility selects the right chain empirically — and the wide fisheye field of view makes in-bounds alone insufficient, which is why the depth prior matters.

Dynamic-Object Masking

COLMAP assumes a static scene; hands and manipulated objects violate it. A mask (white = use, black = ignore) removes those pixels from feature extraction. Masks from projected mocap are approximate — registration offsets of tens of pixels are normal — so dilation absorbs error and a visual preview is mandatory before trusting them.

ATE, RPE, and Sim(3) Alignment

COLMAP trajectories live in an arbitrary frame and scale, so comparing raw translations to SLAM is meaningless. First convert COLMAP world-to-camera poses to camera centers (C = -R^T t), then align with a similarity transform (Umeyama). ATE RMSE measures global consistency after alignment; RPE RMSE measures local drift between consecutive poses; the recovered scale is itself diagnostic — a scale far from any physical expectation signals a degenerate reconstruction.

Scene Graph Memory

from scene-graph-from-egocentric-video

Scene Graph

A scene graph represents the world as objects and relations. Instead of storing only pixels, it stores structured facts:

object: kettle
relation: hand_grasps(kettle)
time: 72863804786785

This structure is useful because it can be queried, inspected, and passed to an agent or planner.

Object-Centric Memory

Object-centric memory keeps a record for each object: when it was first seen, when it was last seen, how many times it appeared, and which aliases refer to it.

In egocentric video, objects may disappear behind hands or leave the field of view. Tracking object memory across time helps preserve task context.

Temporal Relation

A temporal relation is a fact with a timestamp. For example:

visible_in(kettle, frame_12)
hand_grasps(kettle) at timestamp T
action_active(person, "Move kettle") at timestamp T

Temporal relations turn a static graph into a timeline.

Hand-Object Interaction

Hand-object interaction describes how the actor's hands relate to objects: moving toward, grasping, placing, pouring, or touching. These relations are important for understanding intention, not just appearance.

Spatial Reasoning

Spatial reasoning asks where things are and how they relate in space. This repo stores SLAM camera poses with frames, which is a first step toward space-aware queries.

Examples:

• What object was visible when the camera was near the dripper?

• Which object did the hand interact with before the kettle moved?

• What was the current task state at the last observed timestamp?

Provenance

Provenance records where each fact came from. A relation may come from caption text, object annotations, or future detector outputs. Provenance makes the graph auditable and easier to improve.

Detector Hook

The current graph uses caption annotations. A future detector or segmenter can replace or merge those object sources while keeping the same graph schema:

video frame -> detector/segmenter -> objects/masks -> scene graph -> queries

Spatial Memory (Egocentric Proxy)

Attaching the wearer's SLAM camera position to each object sighting turns the scene graph into a spatial memory: where:<object> answers with the camera position at the last sighting. This is a proxy — where the *wearer* was, not where the object is. For tabletop manipulation the two nearly coincide; for far-field observations they do not. Provenance labels the proxy explicitly so downstream users cannot mistake it for triangulated geometry.

Graph QA Evaluation

A scene graph is only as useful as the questions it answers correctly. eval/qa_pairs.json holds human-labeled questions with gold answers — object memory, alias resolution, visibility at a timestamp, interactions, task state, temporal order, and spatial proximity. Negative questions (absent objects, wrong relations) prevent a yes-saying graph from scoring well. With caption-grounded construction the score is a regression check; with detector sources it becomes a perception benchmark against the same gold answers.