Research Notes
All views expressed here are my own.
Embodied AI / Data
Ego Data: Today's Feasible Path, Not the Final Substrate
Ego data is valuable because it is large, natural, semantically rich, and close to everyday human physical experience. That may make it the most feasible, viable, and scalable path to pursue today. At the same time, embodied intelligence may require more than reproducing human behavior. The deeper data problem is how to represent tasks as physical state transformations.
Ego data may be the most practical path we have today: feasible to collect, viable for current training pipelines, and scalable enough to move the field. The point is not to step away from ego data, but to understand what additional layers it may eventually need.
The recent enthusiasm around ego data is easy to understand. Embodied AI has a data bottleneck. Language models can absorb web text, and vision models can absorb web images and video, but robots do not have an equivalent public ocean of interaction data. Real robot trajectories are expensive: they require hardware, space, teleoperation, maintenance, safety procedures, and repeated collection under physical constraints.
Ego video offers a practical answer to that scarcity. It is cheaper to collect than robot interaction data, grounded in the real world, and full of everyday human activities. Ego4D, for example, reports more than 3,670 hours of daily-life first-person video from 923 participants across 74 locations and 9 countries. Ego-Exo4D goes further by synchronizing egocentric and exocentric video and adding modalities such as audio, gaze, point clouds, localization, and IMU signals. These projects matter because they make human physical experience trainable at a scale that robot data has struggled to reach.
In this sense, ego data is beginning to play a role for embodied AI that internet text played for language modeling. It is not perfect, but it is large, messy, diverse, natural, and semantically dense. It captures kitchens, tools, clothing, furniture, social activity, repair, cooking, cleaning, attention, intention, and object use. It also fits naturally into the training pipelines of vision-language models, video models, and emerging vision-language-action models. That makes it an obvious and worthwhile frontier for scaling.
This is why the ego-data path is worth taking seriously. It is not merely convenient; it is the rare data source that is simultaneously real-world, human-scale, semantically rich, comparatively cheap to collect, and compatible with current foundation-model infrastructure. For a field constrained by robot hardware and slow physical data collection, that combination is unusually powerful.
The stronger question is not whether ego data matters. It clearly does. The question is what kind of information it gives us. First-person video is a record of situated human experience: what a person saw, attended to, and did in a physical scene. It is much less often a record of controlled intervention: what forces were applied, which alternatives were available, what would have happened under a different grasp, and why one action succeeded while a nearby one would fail. That distinction matters because embodied intelligence is not only about recognizing activity; it is about choosing interventions that change the world.
The Embodiment Gap
The caution is that ego data mostly teaches models how humans act. Embodied intelligence often asks a different question: how should an agent change the physical world to reach a goal, given its own body, sensors, tools, and constraints?
Those two questions can look similar, but they are not the same. Consider folding clothes. Human folding strategies are shaped by the human body: two hands, flexible fingers, limited force, visual blind spots, social habits, and a preference for motions that look natural to us. A robot may have very different affordances. It might pull a garment flat with two grippers, anchor a corner with suction, use a table edge as a tool, roll fabric through a fixture, or execute a strategy that does not look human but produces a more stable final fold.
A human demonstration is therefore one feasible solution under human embodiment. It is not necessarily the best robot solution, and it is not necessarily the best task solution. It is useful to separate three notions of optimality:
- Human optimality: what is natural, efficient, and socially familiar for a human body.
- Robot optimality: what is stable, fast, low-cost, and robust for a particular robot morphology.
- Task optimality: what physical transformation reaches the goal state most effectively, whether or not it resembles human action.
Ego data mostly approximates the first. Embodied AI often needs the second, and in the long run may need the third.
This is why some simple household tasks are deeper than they look. Opening a drawer is not only a hand motion; it is a constraint problem involving a joint, a handle, friction, collision, and sometimes a cabinet that moves unless it is braced. A human may stabilize the drawer with the other hand without thinking about it. A mobile robot might use its base, a second arm, a hook-like tool, or a pushing contact on the cabinet frame. The task objective is the same, but the best policy depends on embodiment and environment. Ego data can suggest the semantic plan; it may not identify the best physical mechanism.
What Scaling Laws May Not Settle
Scaling laws are important, but they should not be over-interpreted. They tell us that within a given data distribution and training objective, larger models, more data, and more compute can produce predictable gains. A 2024 study on neural scaling laws in robotics reports a meta-analysis of 327 papers and finds power-law improvements as resources increase for robot foundation models and robotics-oriented LLMs.
More recent ego-data work makes the picture more interesting. EgoMimic treats egocentric human videos with 3D hand tracking as embodied demonstration data, not just high-level intent data, and reports a favorable scaling trend where additional hand data can be especially valuable for imitation learning. EMMA extends a similar idea to mobile manipulation by co-training human full-body motion data with static robot data, and reports positive performance scaling as the hours of human data increase.
EgoScale is the most direct recent evidence in this direction. It trains a vision-language-action model on more than 20,854 hours of action-labeled egocentric human video, reports a log-linear scaling law between human data scale and validation loss, and finds that this validation loss strongly correlates with downstream real-robot performance. Its final policy improves average success rate by 54% over a no-pretraining baseline on a 22-DoF dexterous robotic hand, with transfer to lower-DoF hands.
These results are important because they weaken the simple objection that ego data is only semantic or observational. Under the right representation, ego data can encode reusable motor priors. But they also sharpen the qualification: the useful signal is not raw first-person video alone. It comes from action labels, hand or body tracking, cross-domain alignment, human-robot co-training or mid-training, and real robot evaluation. Ego data scales best when it is transformed into a bridge between human motion and robot control.
These are meaningful results. But they should not be read as proving that any particular data distribution is the right one for every task. More ego video may make a model better at predicting human actions, understanding human intent, and recognizing object interactions from a human viewpoint. With the right action representation, it may also improve robot manipulation. It still does not automatically teach the best control strategy for a robot with different sensors, actuators, constraints, and failure modes.
A useful way to read scaling laws is as a conditional statement: if the data distribution and objective remain fixed, what happens when resources increase? That is not the same as asking whether the distribution contains the variables needed for control. A model can scale beautifully on video prediction while still being blind to contact pressure, torque limits, hidden compliance, or the low-level cost of a recovery action. Scale can amplify the signal present in a dataset; it cannot fully recover information that was never observed.
The analogy is board games. Training on human games can give a model a strong human prior, but superhuman play required search, self-play, and the discovery of strategies that did not simply imitate human style. In robotics, ego data can supply a powerful human-experience prior. It is not the same thing as machine-optimal strategy discovery.
Put more carefully: scaling laws are optimization curves, not a complete data philosophy. They can tell us that the current path still has returns. They do not, by themselves, tell us whether the path is the shortest route to embodied intelligence, or whether it is optimizing the right target.
The Missing Pieces: Causality and Counterfactuals
Ego video is primarily observational. It shows that something happened, but often not why it happened, what forces were applied, what alternative actions would have caused, or where the boundary between success and failure lies.
A first-person video may show a person picking up a cup, pouring water, and putting the cup down. But the video usually does not tell us the contact forces, friction coefficients, exact contact patches, tactile feedback, or recovery options if the cup slips. It also does not answer counterfactual questions: what if the grasp were from another side, the force were larger, the object were wet, the table were more slippery, or the agent used a different tool?
The same issue appears in cooking, cleaning, and repair. Watching a person wipe a table tells us the semantic goal and the rough motion pattern. It may not reveal the pressure needed to remove a stain, the surface material, the cloth saturation, or the tradeoff between speed and damage. Watching a person tighten a screw shows a trajectory, but the important variable may be torque. Watching a person carry a bowl shows balance, but the hidden variable may be the liquid's slosh dynamics. These are not small details for a robot; they are often the task.
Control depends on these counterfactuals. A robot does not only need to know what action was observed; it needs to know what action would change the world in the direction of the goal. That requires intervention data, failure data, recovery data, and physical state transition data.
From Human-Centric to Task-Centric Data
This points to a more fundamental shift. Much of today's data is human-centric: it records human actions from a human viewpoint. A better substrate for embodied intelligence may be task-centric: organized around current states, goal states, physical interventions, and the resulting world changes.
Folding clothes should not only be represented as "how humans fold clothes." It can be represented as: how to transform a deformable object from initial state S0 to goal state Sg while minimizing time, failure rate, energy, hardware complexity, and environmental dependence.
More generally, the useful unit of data may need to shift from the clip to the transition. A clip is organized around what was visible to an observer. A transition is organized around what changed in the world. That means the representation should include not just pixels and language, but also object state, contact, constraints, action parameters, success metrics, and the local neighborhood of failed or near-miss actions.
In that formulation, the central object is not a video clip. It is a transition:
S0 -> action / intervention -> S1 -> action / intervention -> S2 -> ... -> Sg
Each step can include object geometry, contact points, force direction, visual observations, tactile feedback, constraints, intermediate failures, recovery attempts, and a task score. This structure asks a different question from imitation learning. Ego data asks, "What would a person do next?" Task-centric transition data asks, "What intervention would move the world closer to the goal?"
Ego data often treats action as imitation. Task-centric data treats action as state transformation.
What the Next Data Layer Might Look Like
The point is not to discard ego data. It may be best understood as one layer in a broader data ecology. Several complementary data forms seem especially important.
State-transition data records the state, action, next state, goal, success signal, and cost. This is closer to physical control than raw video because it makes the consequence of intervention explicit.
Object-centric affordance data shifts attention from the human to the object. It describes where an object can be grasped, pushed, rotated, opened, stretched, supported, or damaged. For many tasks, the key question is not "what did the person do?" but "what does the object allow?"
Failure-rich data records slipping, collision, over-pulling, occlusion, unstable intermediate states, false perception, and recovery. Human everyday video is biased toward fluent success. Robots need to know the boundary of failure and how to recover when execution drifts.
Counterfactual interaction data records not only the action that happened, but what would happen under nearby alternatives. This can come from robot exploration, simulation, planning, reinforcement learning, or active data collection. Its value is that the model learns a local physical model instead of a single demonstrated trajectory.
Human-to-robot alignment data maps egocentric hand trajectories, full-body motion, or action labels into robot action spaces. This is the layer where EgoMimic, EMMA, and EgoScale are especially informative: they show that ego data can become more than semantic pretraining when it is paired with the right action representation and alignment recipe.
Robot-native self-generated data is already becoming more prominent. Open X-Embodiment pools more than 1 million real robot trajectories across 22 robot embodiments. DROID reports 76,000 demonstration trajectories, 350 hours of interaction data, 564 scenes, and 86 tasks. Physical Intelligence's pi-zero is another sign of the trend, combining internet-scale vision-language pretraining with open-source and internally collected robot data to produce low-level motor commands.
These projects show that the field already understands a key point: human video is unlikely to be enough by itself. That is not an argument against ego data; it is an argument for converting ego data into richer action and interaction structure. Even robot trajectory data may be an intermediate stage. The next step is not only more trajectories. It is more structured interaction data with states, goals, interventions, failures, counterfactuals, costs, and physical constraints.
Where Ego Data Fits
The balanced view is not "ego data versus robot data." Ego data is likely the most feasible scalable route today, but it should occupy a specific layer in the stack. It is strong for learning human intent, open-world semantics, daily activity structure, tool use, and behavioral priors. It is weaker for precise control, contact dynamics, force feedback, morphology adaptation, non-human strategy discovery, failure recovery, and counterfactual action selection.
A more complete embodied data stack might look like this: ego data provides semantics and intent; exocentric and 3D data provide spatial structure; robot interaction data grounds action; failure data maps the boundary; simulation and self-play provide counterfactuals; task-centric state-transition data defines the optimization target.
The final goal is not for robots to act like people. The goal is for robots to understand the current world state, the desired goal state, the physical interventions available, and the path that best fits their own embodiment.
Ego data may be embodied AI's ImageNet moment: large, natural, generic, and powerful enough to trigger the first wave of breakthroughs. That makes action along the ego data path not only reasonable, but necessary. Higher-level embodied intelligence may still require more intervention-centered data in the future, or in parallel. The right strategy is therefore not to wait for a perfect task-centric substrate, but to scale ego data now while building the state-transition and robot-interaction layers that can make it more physically grounded.
Cite This Post
@misc{he2026egoDataProxy,
author = {He, Chaoyue},
title = {Ego Data: Today's Feasible Path, Not the Final Substrate},
year = {2026},
month = jun,
howpublished = {\url{https://chaoyue0307.github.io/thoughts/ego-data-scalable-proxy.html}},
note = {Research note}
}
References
- Grauman, K., Westbury, A., Byrne, E., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., et al. (2024). Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Sartor, S., & Thompson, N. (2024). Neural Scaling Laws in Robotics. arXiv:2405.14005.
- Kareer, S., Patel, D., Punamiya, R., et al. (2024). EgoMimic: Scaling Imitation Learning via Egocentric Video. arXiv:2410.24221.
- Zhu, L. Y., Kuppili, P., Punamiya, R., et al. (2025). EMMA: Scaling Mobile Manipulation via Egocentric Human Data. arXiv:2509.04443.
- Zheng, R., Niu, D., Xie, Y., et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data. arXiv:2602.16710.
- Open X-Embodiment Collaboration, O'Neill, A., Rehman, A., Gupta, A., et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv:2310.08864.
- Khazatsky, A., Pertsch, K., Nair, S., et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv:2403.12945.
- Black, K., Brown, N., Driess, D., et al. (2025). π0: A Vision-Language-Action Flow Model for General Robot Control. Robotics: Science and Systems (RSS).