Back to hub

Project Narrative

From first-person video to world memory

Egocentric vision is powerful because it observes the world from the same viewpoint as a person or embodied agent. The same video episode can answer what the wearer is doing, where the camera moves in 3D, and what objects should be remembered.

Action Understanding

The action repo starts with short temporal windows. It compares a majority baseline, softmax classifiers, and an optional MLP head over RGB, hand-joint, and fusion features. The key lesson is that chronological and multi-episode evaluation are what make the numbers meaningful.

3D Reconstruction

The reconstruction repo prepares frames, calibration, SLAM poses, COLMAP command templates, and optional COLMAP summaries. The key lesson is that reconstruction quality starts with diagnostics before training any neural renderer.

Scene Graph Memory

The scene graph repo turns observations into objects, relations, timestamps, provenance, and queries. Caption-derived objects are transparent, and detector/tracker JSON can be merged when available.

Why They Belong Together

Action recognition explains intention. Reconstruction explains geometry. Scene graphs explain persistent state. Together they form a practical learning path for egocentric systems that need to understand tasks, space, and objects.

Current Evidence