A research-development lab for understanding synchronized egocentric multimodal data, defining embodied-AI tasks, and testing small baselines before omni-model fine-tuning.
Project overview and contributions.
The page is organized like a compact research project: motivation and scope, dataset sample, task suite, method, baselines, research directions, interactive walkthroughs, and resources for continuing the work.
From one public episode to an extensible embodied-AI task lab.
Xperience-10M is much larger than the public sample. This project focuses on the sample available now, turns it into clear task contracts and baseline artifacts, and keeps the same data contract ready for held-out multi-episode training when more episodes are staged.
- 1,161 aligned windows from one public sample episode
- 12 task contracts with minimal and neural heads
- Four research-direction maps and extension probes
The next model-quality stage is a held-out episode pilot over the selected multi-episode relay, with no train/test episode leakage and a completed omni-model evaluation report.
Interactive research roadmap
Use this as the front door for the project: it links the 12 tasks, four research tracks, current sample evidence, and the multi-episode Qwen3-Omni scale-up path.
Multimodal episode pipeline
One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.
Task suite and baseline heads
Every core task has a minimal baseline and a compact PyTorch MLP head over the same windows, splits, and labels.
Dataset source alignment
The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and current project coverage.
Public research artifacts
Metrics, figures, walkthrough data, baseline weights, and project metadata are packaged across GitHub, GitHub Pages, and Hugging Face.
Omni-model scale-up path
Full-dataset access is granted, a 128-episode relay is in progress with chunked parallel transfer and overlapping batch prefetch, and full training results require completed staging, held-out splits, training, and evaluation.
Data governance
Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.
Research roadmap.
The project path is staged from the current public-sample task lab to multi-episode data staging, held-out Qwen3-Omni fine-tuning, robustness runs, and larger foundation/world-model extensions.
Public-Sample Task Lab
One public episode is converted into aligned windows, task contracts, minimal baselines, neural heads, walkthroughs, and figures.
Multi-Episode Data Staging
Stage official gated episodes while preserving episode-level separation and recording missing-view coverage.
Qwen3-Omni LoRA Pilot
Train lightweight adapters on staged selected episodes and evaluate on held-out episodes with committed predictions, metrics, and run reports.
Foundation-Model Selection Matrix
Keep Qwen3-Omni as the first trainable held-out pilot, add Cosmos 3 for world modeling, and stage policy candidates after action targets are explicit.
64-128 Episode Robustness Run
Test whether pilot conclusions survive broader sessions, missing modalities, and stronger ablations.
Cosmos 3 and Policy-Model Extensions
Extend toward future-window prediction, action-conditioned world modeling, synthetic-data tests, policy-style next action, and affordance reasoning.
Evaluation protocol is explicit.
The protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and current limitations before comparing scores.
Data unit
One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by 8,546 synchronized multimodal dimensions.
Split policy
Single-episode chronological 70/30 train/test split. This avoids random future-window mixing; cross-episode generalization is measured in the later multi-episode pilot.
protocol docMetric contract
All 12 tasks list input, target, primary metric, minimal baseline score, and neural MLP score from committed result files.
summary metricsLeakage controls
Scalers fit on train windows only; future labels, target-side signals, caption/object labels, and contact labels stay on the target side unless explicitly queried.
builder scriptAudio ablation
Audio and no-audio variants are evaluated across all 12 task contracts under the same chronological split.
audio summaryFoundation branch selection
Qwen3-Omni is the first trainable baseline, Cosmos 3 becomes the world-model branch, and policy models wait for explicit action targets.
backbone planNext evaluation stage
This public-sample run covers single-episode task development. Cross-episode generalization, audio-visual learning, world modeling, policy targets, and held-out Qwen3-Omni training move to the multi-episode stage after selected data is staged.
pilot statusScale-up requirement
The Omni pilot requires selected staged episodes, held-out episode splits, no train/test episode leakage, training metadata, predictions, metrics, and a run report.
data statusCurrent experiments and next milestones.
The project shows the completed public-sample task suite, then lays out the data requirements for the Qwen3-Omni scale-up path.
Aligned Xperience-10M sample windows
5,821 frames become 1,161 synchronized 20-frame windows with an 8,546-dimensional representation.
12 minimal heads + 12 neural MLP heads
Every task has a minimal interpretable head and a matching neural MLP run over the same windows, splits, and task contract.
Audio contribution is measured task by task
Audio variants improve the primary metric on 6 of 12 task contracts in this single-episode setting.
Four research directions are mapped by evidence type
The Ropedia directions are labeled as direct, proxy, or diagnostic coverage, plus one coded extension probe per direction.
Foundation backbones are separated by role
Qwen3-Omni stays first for held-out LoRA; Cosmos 3 is the world-model branch; OpenVLA/openpi/GR00T are policy candidates after action-space conversion.
Qwen3-Omni pilot setup
The current Qwen3-Omni artifacts use one episode and 128 train windows. A 128-episode selected relay is in accelerated staging for held-out evaluation.
Multi-episode pilot status is explicit
The pilot status report separates setup artifacts, selected relay state, and completed held-out-episode metrics.
Prepared mirrors stay synchronized
The parity report compares critical JSON, figure, and validator files across the repo, HF Space bundle, artifact dataset bundle, and model bundle before upload.
Figures are indexed
The figure index records public visual assets, dimensions, SHA-256 hashes, source scripts, and the role each figure plays in the project narrative.
Brand assets are packaged consistently
The generated logo system is packaged into the website header, favicon, README/HF cards, Open Graph preview, and brand-asset manifest.
Public bundles stay lightweight
The bundle report covers required assets, raw-data exclusion, Python cache exclusion, heavy archive exclusion, accidental HF token strings, and public-card figure freshness across GitHub and the HF bundles.
Website references are validated
The site validator checks local links, anchors, JSON bundles, and referenced image dimensions before publishing.
Release checks are collected
The release-check manifest collects the automated validators and the live post-publish checks used to keep the release current.
Official dataset card is aligned
The source-alignment note mirrors the public Hugging Face dataset-card facts, sample-card facts, and API metadata: gated access, sample license/tooling, modality coverage, episode layout, intended uses, and current project coverage.
Research reading path.
A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.
Understand the current scope
Start with the project status, evidence contract, artifact index, scale-up readiness note, release package report, and website reference report. They separate implemented single-episode work from the prepared Qwen3-Omni scale-up.
Inspect one model input
Use the window table and feature manifest to see the aligned sample unit, modality sources, and leakage controls.
Compare minimal vs neural heads
Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.
Check the scale-up gate
The multi-episode Qwen3-Omni path is prepared. The selected 128-episode result will be added after staging, preprocessing, training, and held-out evaluation pass.
Aligned with the official dataset card.
The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while focusing the implemented artifacts on one public sample episode.
Official scale
About 10M experience units and 10,000 hours, with RGB video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, captions, metadata, and calibration.
HF file-size display
The live Hugging Face page/API currently shows 31.9 TB hosted. This is recorded separately from the card's about-1PB full-scale storage statement.
source JSONHF access path
The source dataset is manually gated for approved non-commercial use, with an external agreement step noted by the public HF metadata.
official HF datasetAPI listing snapshot
HF API metadata observed 803 session folders and 12,103 episode folders with annotation.hdf5. This snapshot supports planning; it is not a local data inventory.
Public sample card
The sample repo lists cc-by-nc-4.0, HOMIE Toolkit for videos/annotations, and Rerun 0.29.0 for .rrd visualization.
Source notes
The source notes summarize full-dataset facts, public sample-card facts, API-listing notes, and project coverage across the repo, website, and HF cards.
alignment reportEpisode layout
Expected folders contain six MP4 streams and annotation.hdf5; visualization.rrd is treated as a viewer artifact and excluded from training downloads.
Current project subset
One public sample episode, 5,821 frames, 1,161 aligned windows, 8,546-dimensional task inputs, and no raw-data redistribution.
modality atlasCovered now
Action/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, and misalignment.
summary metricsResponsible use
The official card notes limited diversity and showcase/production quality. This project excludes identity, surveillance, biometric, sensitive-attribute, and safety-critical uses.
use notesLater milestones
Full audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, and held-out Qwen3-Omni evaluation.
data statusRopedia Xperience-10M 12-task suite.
The task map connects synchronized multimodal windows to 12 research task heads, then the modality atlas shows the sample streams used to build those contracts.
Readable modality atlas.
Each Xperience-10M stream gets a large thumbnail, a plain sample-content line, and the exact current-baseline use. These are small derived images only; no raw MP4, HDF5, or RRD data is redistributed.
Video
6 synchronized camera MP4 streams
RGB/fisheye/stereo frame statistics
Audio
Audio stream embedded in MP4
Acoustic signal
Depth
Depth map + confidence channel
Spatial geometry signal
Pose / SLAM
Trajectory + sparse SLAM map
Position + orientation features
Motion Capture
Body + hand joint tracks
3D mocap feature statistics
Inertial
Accelerometer + gyroscope
Wearable motion statistics
Language
Object tags + action captions
Task labels + semantic targets
The atlas redistributes only small derived thumbnails and metadata. Raw MP4, HDF5, and RRD files remain excluded from this repo and the Hugging Face mirrors.
From raw episode to research artifacts.
Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.
What this project enables
It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.
What still needs more data
General embodied-intelligence model quality requires many episodes and held-out episode splits; the public sample is the development harness for that next stage.
What the current results actually say.
A generated takeaways layer reads the committed metrics, summarizes useful research signals, and identifies what still needs held-out episodes.
One episode becomes a benchmark contract
The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,546-dimensional representation for repeatable task evaluation.
Chronological split exposes class shift
All-feature action reaches 0.9829 macro-F1 on its local split, while the 12-task chronological action head is 0.0500 macro-F1 with four unseen later action labels.
takeawaysNeural heads help dynamics
Hand MPJPE improves from 0.8647 to 0.1079; temporal-order F1 rises from 0.5400 to 0.8520; misalignment F1 rises from 0.5052 to 0.7153.
metricsRetrieval and reconstruction remain open
Ridge/cosine retrieval remains stronger than the neural projection here, and cross-modal feature reconstruction still has negative R2.
retrieval metricsScale means held-out episodes
The next credible model-quality unit is a held-out multi-episode pilot across different sessions, not more adjacent windows from one sample.
scale-up statusSmall baselines, no hidden machinery.
Motion-only and current all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. The neural run keeps the same features and splits, then swaps in PyTorch MLP heads.
Motion-only action
0.9688Current all-feature action
0.9829Motion-only subtask
0.9528Current all-feature subtask
0.9173Neural MLP heads, same task contracts.
The neural baseline uses small PyTorch MLP classifiers/regressors on the same 8,546-dimensional windows, chronological splits, and leakage filters. This isolates the value of a nonlinear head before moving to heavier Qwen/Omni experiments.
Neural hand forecast
0.1079Neural temporal order
0.8520Neural misalignment
0.7153Neural cross-modal retrieval
0.1300The 12 tasks organized into four research directions.
Each task is mapped as direct, proxy, or diagnostic evidence for the Ropedia research tracks. The mapping uses two current baselines: minimal interpretable heads and neural MLP heads over the same feature contract.
A. Human Modeling & Motion Understanding
Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.
B. 3D/4D Reconstruction & Neural Rendering
Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.
C. Egocentric Vision & Interaction
Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.
D. Scene Reconstruction & World Modeling
Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.
Baseline 1: minimal heads
Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.
Baseline 2: neural MLP heads
Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.
Four extra probes make the directions actionable.
These are new data-backed extension tasks computed from the same single-episode feature tensor. They add one concrete input, process, output, and metric for each research direction, while keeping the single-episode limitation explicit.
Body and Hand Motion Intensity
Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.
Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.
Output: high_motion or low_motion.
Multi-View Consistency Retrieval
Case: retrieve the synchronized stereo-left window from a fisheye-camera query.
Input: fisheye_cam0 video features against stereo_left candidate features.
Output: ranked synchronized view candidates.
Action Phase Progress Estimation
Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.
Input: non-caption multimodal features.
Output: 0-to-1 progress inside the current action.
Short-Horizon Ego-Motion Forecasting
Case: predict how the camera translation changes over the next 20 frames.
Input: current sensors excluding camera translation and captions.
Output: future camera-translation delta vector.
What changed
The four research directions now have coded extension probes, prediction/rank CSVs, JSON metrics, a Markdown summary, and a website chart generated from real sample-window features.
What still needs scale
A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.
The 12 tasks share four head families.
The diagram separates the shared episode-window representation from the task-specific heads, so the task contracts stay readable before scaling to larger models.
Interactive task walkthrough.
Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.
Input: inspect the 20-frame multimodal window before choosing the target.
Action Recognition
In the coffee-making sample, a pouring window maps to the current action label.
Metric: macro-F1. Minimal 0.0500; neural MLP 0.0148.
Current limitation: single-episode chronological split.
Task cards and metrics.
The 12 task cards use readable research names, representative modality thumbnails, explicit input-process-output contracts, and verified minimal versus neural scores from the committed result files.
Every model input has a source.
The point is not hidden complexity. Every input group maps back to a source modality and a manifest entry.
Diagnostics separate memorization from signal.
The charts make the main lesson visible: within-episode supervised labels are easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.
Open the single-episode explorer to inspect window-level labels, predictions, modality statistics, object labels, and diagnostic scores. The audio ablation report is available at audio_ablation_summary.json.
Research artifacts for the next experiments.
Metrics, predictions, manifests, lightweight model weights, and derived window artifacts are organized so the project can be inspected, extended, and scaled before rerunning the full pipeline. Raw Xperience-10M data and Qwen weights are not redistributed.
From one episode to task heads
Start with the files that define the sample windows, modality inputs, task contracts, metrics, walkthroughs, and research-direction mapping.
Task-suite report
One JSON file with every task definition, split detail, feature dimension, and minimal/neural metric.
Windows table
Window start/end frames and aligned action/subtask labels for the public sample episode.
windows.csvFeature manifest
Technical source map for the current modality inputs used by the task suite.
feature_manifest.jsonNeural MLP task results
Per-task PyTorch MLP metrics, predictions, histories, and checkpoints for the same 12 task contracts.
neural_mlp/Four-direction taxonomy
Generated JSON, CSV, Markdown, and website data mapping all 12 tasks to the four research tracks.
research_directions/Direction extension probes
Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.
research_direction_extensions/Task walkthroughs
Case studies for all 12 tasks, including input, middle process modules, output, metric, limitation, and task-player data.
task_walkthroughs/Audio ablation and raw upgrade
All 72 task/variant rows comparing current audio, no audio, raw audio, replacement, and combined-input settings.
audio_ablation/Single-episode explorer
Interactive window-level view of labels, predictions, modality statistics, object labels, and diagnostics.
single_episode_explorer.htmlCross-modal retrieval
The strongest self-supervised signal from the single episode.
metrics.jsonProject map, mirrors, and runnable code
Use these files to navigate the whole project, open the published mirrors, or reproduce the public-sample pipeline.
Artifact guide
Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.
Reproduction scripts
Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.
scripts/Hugging Face Space
The dashboard packaged as a public static Space.
HF SpaceDerived HF artifacts
Metrics, predictions, docs, and lightweight derived files without raw data redistribution.
dataset repoHF baseline models
Minimal NumPy softmax, ridge baselines, and neural task-head model files.
model repoHF collection
Space, artifacts, and model baselines grouped into one public project collection.
collectionCurrent all-feature action model
Classifier metrics, predictions, confusion matrix, and model weights.
metrics.jsonProject packet
Machine-readable 90-second project path and scope summary.
project_packet.jsonPrepared for multi-episode training
The multi-episode Qwen3-Omni path is documented and scripted. Full-pilot metrics come after selected data is staged and held-out evaluation passes.
Project scope
Connects implemented single-episode artifacts, setup-stage Omni work, current 128-episode relay, and later multi-episode milestones.
Foundation-model plan
Backbone selection matrix covering Qwen3-Omni, Cosmos 3, GR00T, OpenVLA/openpi, Gemini Robotics, Octo, and SmolVLA-style policy candidates.
foundation_model_plan.jsonMulti-episode access status
Public data-access path, selected 128-episode relay plan, and data requirements.
MULTI_EPISODE_ACCESS_STATUS.mdQwen3-Omni setup artifacts
Manifests, metadata, metrics, and progress logs from the current setup run.
episode_manifest.jsonMulti-episode data requirement
The data status file defines what must be available before full pilot training and held-out metrics.
DATA_ACCESS_STATUS.mdProject files behind the research site
These validator outputs support the public research artifacts by keeping links, mirrors, figures, source wording, and package boundaries consistent.
Artifact index
Selective source-of-truth catalog with existence checks, sizes, and stable-file hashes.
artifact_index.jsonTask-surface integrity
Checks that public task cards use readable research names, modality thumbnails, and the interactive walkthrough/player contract.
task_surface_integrity.jsonWebsite integrity
Checks local links, anchors, JSON files, and referenced website image dimensions.
website_integrity.jsonRendered website check
Records the latest browser-level load, tab, walkthrough deep-link, control-click, and console-health check.
rendered_site_check.jsonRelease checks
One release map for automated validators and live post-publish checks.
quality_gates.jsonMirror parity
Prepared repo, HF Space, artifact dataset, and model bundle parity for critical data, figures, website HTML, and validator files.
mirror_parity.jsonLive publication
Last public GitHub/HF URL verification after upload.
live_publication_status.jsonMulti-episode pilot status
Separates setup artifacts, selected relay state, and completed held-out-episode results.
scope_claims_audit.jsonPublic project surface
Presents repo, website, and Hugging Face cards with consistent naming, links, tab semantics, and reader-facing copy.
public_surface_qa.jsonPublic bundle contents
Summarizes raw-data exclusion, cache exclusion, archive exclusion, token-string checks, and public figure references.
publication_audit.jsonQwen3-Omni pilot is in accelerated data staging.
Full Xperience-10M access is granted. The current plan selects 128 metadata-balanced episodes across 128 different session UUIDs, with chunked parallel transfer and overlapping batch prefetch in progress and no held-out metrics reported yet.
Selection
128 complete episodes selected from 128 unique top-level sessions, balanced across episode-size bands and split 96/16/16 for train/val/test.
Transfer
Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.
Current LoRA artifact
The current LoRA artifact uses the locally available sample data. The multi-episode result begins after selected data is staged, preprocessed, trained, and evaluated on held-out sessions.
Backbone branches
Qwen3-Omni is the immediate LoRA path; Cosmos 3 is the first world-model branch; GR00T/OpenVLA/openpi become policy branches after action targets are well-defined.
backbone planReproduce the suite.
Raw Xperience-10M data is not redistributed here. The reproduction guide states the commands, expected outputs, exact-match reproduction record, and multi-episode requirements.
Reproducibility guide
Human-readable commands, expected artifacts, and current scope for the public single-episode pipeline.
REPRODUCIBILITY.mdReproducibility matrix
Machine-readable command matrix covering sample download, baselines, 12 tasks, figures, and validation.
reproducibility_matrix.jsonExact-match reproduction record
The last metric rebuild reproduced the public-sample outputs from a fresh cache and matched the committed metrics.
reproducibility_audit.mdWebsite reference report
Local HTML references, anchors, JSON bundles, and image dimensions are validated before publishing.
website_integrity.jsonMulti-episode pilot status
The Qwen3-Omni pilot is prepared at the code and selection-plan level; final metrics follow completed staging, preprocessing, training, and held-out evaluation.
DATA_ACCESS_STATUS.mdMinimal path: install the toolkit dependencies, download the official sample, run the 12-task suite with neural heads, regenerate visualizations, then run the artifact index and publication validator.
git clone https://github.com/Ropedia/HOMIE-toolkit.git
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
pip install -r ropedia-xperience-10m-task-suite/requirements.txt
pip install torch
hf download ropedia-ai/xperience-10m-sample \
--repo-type dataset \
--local-dir data/sample/xperience-10m-sample
cd ropedia-xperience-10m-task-suite
export WORKSPACE=/path/to/workspace
python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
python scripts/research_direction_extension_tasks.py
python scripts/task_walkthroughs.py
python scripts/generate_visualizations.py
python scripts/render_overview_figures.py
python scripts/render_task_suite_infographic.py
python scripts/export_modality_atlas_assets.py
python scripts/validate_website_integrity.py
python scripts/validate_scope_claims.py
python scripts/build_artifact_index.py
python scripts/validate_mirror_parity.py
python scripts/validate_publication_package.py