Ropedia Xperience-10M Task Suite

Project overview and evidence map.

The project separates the two evidence lines, the 20-task suite, the result table, and the public artifact mirrors before deeper archive-level files.

Project identity Shared package across GitHub, web, and Hugging Face

The mark identifies the Ropedia Xperience-10M Task Suite package across the GitHub repository, GitHub Pages dashboard, Hugging Face Space, artifact dataset, model mirrors, and social preview. Ropedia is credited here as the Xperience-10M data provider and releaser; this site is the task-suite, evaluation, and public-safe artifact layer built on top of that released dataset.

Logo mark Social card GitHub repo HF Space

new reader path Five steps give the cleanest first read.

Start with the short project summary, then move through the evidence map, task suite, result matrix, and scale-up plan. Roadmap, glossary, research progress, and foundation paths stay available as secondary cards instead of competing in the first navigation row.

01 summaryRead 30-second summaryUnderstand what this project is, what is implemented, and what comes next. 02 evidenceOpen evidence mapSee how GitHub, GitHub Pages, HF Space, artifacts, and model repos connect. 03 tasksInspect 20-task suiteCheck the task vocabulary, source links, radars, and task cards. 04 resultsCheck result matrixRead direct/proxy status, method rows, raw metrics, and source artifacts. 05 scale-upRead scale-up planMove to Qwen3-Omni, Cosmos3, roadmap, and foundation-model paths.

Roadmap

Shipped stages, active work, and scale-up targets after the result table.

Open roadmap

Glossary

Definitions for project terms, metrics, model families, public mirrors, and proxy scores.

Open glossary

Research progress

Current experiments, verified model evidence, public mirrors, and milestone boundaries.

Open progress

Foundation paths

Spatial-intelligence, world-model, and VLA paths placed after current results and scale-up status.

Open foundation paths

Two evidence-line map showing 1 sample episode, 128 selected episodes, and the combined 180 scored method-task records

Line	Data unit	Score statement	Best use	Read separately from	Start here
1 sample episode	One public Xperience-10M sample episode; 5,821 frames; 1,161 aligned 20-frame windows; 8,546-dimensional feature contract.	40/40 direct scores from Minimal and Neural MLP heads.	Raw sample inspection, file organization, task definitions, local reproduction, and controlled Minimal-vs-Neural baseline behavior.	The selected-128 comparison rows and broader held-out model behavior.	Raw browser 1-episode radar data result summary data evidence-line data
128 selected episodes	Selected held-out 96/16/16 split; 34,269 exported windows; public-safe metadata/raw-feature artifacts linked to official gated episode paths.	140/140 selected-128 scores: 134 direct + 6 compact-proxy.	Same-split comparison across metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super, Cosmos3-Nano, and scale-up decisions.	Direct raw-target interpretation for the proxy-marked cells.	128-episode radar data source and feature index HF selected-128 windows result summary data evidence-line note

Evidence line	Method block	Methods	Score statement	Read as
1 sample episode	Task-head baselines	Minimal; Neural MLP	40/40 direct scores.	Task-lab reproducibility and simple-vs-neural behavior.
128 selected episodes	Aligned baseline heads	Metadata simple/NN; raw-feature simple/NN	80/80 scores: 74 direct + 6 compact-proxy.	Same-split metadata/raw-feature baseline comparison.
128 selected episodes	Qwen3-Omni series	Qwen3-Omni v6 LoRA	20/20 direct scores from verified selected-128 Qwen3-Omni LoRA and task-specific probes.	Trainable Qwen3-Omni diagnostic baseline on the selected-128 surface.
128 selected episodes	Cosmos3 series	Cosmos3-Super Reasoner; Cosmos3-Nano Future Window	40/40 direct scores from verified public-safe reasoner and future-window artifacts.	Cosmos3 reasoner and future-window diagnostics on the selected-128 surface.

Cosmos3-Super Forward-Dynamics LoRA is published as a separate fine-tuned adapter with weights/results; it is not counted as a 20-task matrix method row.

Qwen run	Purpose	Main change	Eval signal	Use now
v1	Prove the selected-128 LoRA/eval/package loop.	First verified 96/16/16 selected-episode Qwen3-Omni LoRA run.	448 eval; JSON 0.8750; contact 0.6451.	Lineage only.
v2	Make answers schema-checked.	Structured-JSON contract with full-8-GPU LoRA on the same split.	448 eval; JSON 0.9978; contact 0.7188.	Structured-output ablation.
v3	Separate prompt/eval effects from training.	Strict-label prompt/eval over the v2 adapter; no new adapter training.	448 eval; JSON 1.0000; contact 0.7210.	Prompt/eval ablation.
v4	Test longer structured-JSON LoRA training.	New four-epoch full-8-GPU adapter on the same selected split.	448 eval; JSON 1.0000; contact 0.7299.	Overfit/metric-tradeoff evidence.
v5	Move to denser multiscale evaluation.	Multiscale cap96 export with 4,032 held-out predictions.	4,032 eval; JSON 1.0000; contact 0.7865.	Pinned prior release; stronger on several non-contact metrics.
v6	Publish the current Qwen 20-task row.	Rank64/lr5e-5 multiscale LoRA plus verified task-specific probes.	4,032 eval; JSON 0.9990; contact 0.8177.	Current public 20-task Qwen3-Omni row.

Qwen v1-v6 are run-lineage labels inside the selected-128 evidence line, not project evidence lines. Use v6 for the public 20-task Qwen3-Omni row; keep v5 as the pinned prior multiscale comparator; read v1-v4 as pipeline-hardening and ablation evidence. Full details are available in the Qwen lineage data and Qwen lineage note.

Evidence map

Project links stay tied to the evidence trail.

Source code, visual explanations, derived artifacts, model outputs, and release checks are organized across GitHub, GitHub Pages, and Hugging Face.

This panel maps each published mirror to its evidence type. The project path gives the sequence, the glossary defines terminology, and the artifact library keeps file-level links.

GitHub repoSource of truth for docs, scripts, generated data, validators, and commit history. GitHub PagesVisual dashboard for sample, tasks, results, directions, and resources. HF SpaceHub-hosted dashboard and static app assets. HF artifactsPublic-safe derived reports, metrics, website data, and result packages. HF baselinesCompact baseline weights, figures, metrics, and mirrored task artifacts. HF weights + resultsConsolidated baseline weights, adapters, result summaries, analysis, and manifest. HF collectionGrouped project surfaces, baseline repos, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano repos.

Open evidence-map guide Open map data Open artifact index

Project brief

Two public evidence lines: inspectable task lab and comparison surface.

This project has two public evidence lines: Line 1 uses one public sample episode as an inspectable task lab; Line 2 uses selected-128 public-safe derived artifacts as a same-split comparison surface for baselines and model diagnostics. The public sample remains the raw-inspectable development harness; selected-128 is the model-comparison surface.

What this is

A research-development lab for understanding synchronized egocentric multimodal data, defining embodied-AI tasks, and testing small baselines before omni-model fine-tuning.

What is implemented

Line 1: 1,161 aligned windows from one public sample episode
Line 2: selected-128 same-split comparison surface
20 unified task contracts with minimal and neural evidence
180 scored method-task records with direct/proxy labels
Research-direction maps, extension probes, and scale-up paths

What comes next

The next model-quality stage is stronger action/subtask modeling on the same held-out split, using dense/multiscale windows before requiring more raw episodes.

Data understanding

Maps one public episode into raw-inspectable synchronized windows, then connects selected-128 public-safe artifacts to same-split comparison rows.

Task design

Defines embodied-AI inputs, process modules, outputs, metrics, and case-study walkthroughs instead of treating the sample as a generic classification file.

Evaluation discipline

Keeps chronological splits, predictions, confusion matrices, leakage notes, evidence-line boundaries, and direct/proxy score labels visible before broader model-quality reads.

Scale-up readiness

Connects the same data contract to selected-128 baselines, Qwen3-Omni/Cosmos3 diagnostics, a no-new-episode enhancement pack, policy/VLA candidates, and the later Xperience-native pretraining goal.

Open roadmap summary Read the brief Project summary Roadmap details Project path Current takeaways

Result snapshot

One comparison story, detailed charts in the task-suite section.

The overview keeps only the headline comparison so the same radar visuals are not repeated. Open the task-suite section for the enlarged unified radar, the 1-episode split radar, the 128-episode split radar, and the source-linked 180-record table.

1-episode line

Minimal and Neural MLP baselines cover 40/40 method-task records on the original public-sample episode.

128-episode line

Metadata, raw-feature, Qwen3-Omni, and Cosmos3 methods cover 140/140 rows with direct/proxy evidence notes preserved.

Best detailed view

The task-suite section holds the large readable radars, normalized-score notes, and the downloadable chart data.

Open detailed radars Inspect 180-result table Inspect radar source

featured

Research roadmap details

This section links the unified 20 tasks, four research tracks, current sample evidence, and the multi-episode Qwen3-Omni scale-up path.

tracks 4 tasks 20 setup unified roadmap phases 5

Open roadmap summary Roadmap details

verified

Multimodal episode pipeline

One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.

frames 5,821 windows 1,161 features 8,546

verified

Task suite and baseline heads

The unified task suite has minimal and neural baseline evidence across one 20-axis task surface with shared windows, splits, and label discipline.

tasks 20 minimal heads 20 neural heads 20

verified

Dataset source alignment

The public description is aligned to the official gated Xperience-10M Hugging Face dataset card, including modalities, scale, access, and current project coverage. The source snapshot records 31.9 TB on the HF dataset surface, an about-1PB full-scale storage statement, 12,103 episode folders as upstream metadata, not a local data inventory, public sample license cc-by-nc-4.0, HOMIE Toolkit and Rerun 0.29.0 source tooling, and the official limited diversity note. See the source-alignment record for exact provenance.

full dataset HF gated sample scope 1 episode raw data mirrored no

verified

Public research artifacts

Metrics, figures, walkthroughs, baseline weights, Qwen3-Omni results, and Cosmos3 public-safe packages are staged across GitHub, GitHub Pages, and Hugging Face.

tasks 20 baselines minimal + neural project route tabs

verified diagnostic

Qwen3-Omni held-out pilot

The first selected-episode LoRA pilot is packaged with real held-out predictions and metrics. It proves the pipeline, while the weak scores make it a baseline for error analysis.

split 96 / 16 / 16 test windows 4,032 JSON validity 99.90%

current plan

No-new-episode stress plan

Shows how the current selected split can be stressed without more episodes: dense windows, hierarchical labels, raw-feature shards, and `multiscale_20s10_40s20_80s40` as the next export target.

current windows 3,808 multiscale estimate 106,095 data reference selected-128 enhancement plan

not redistributed

Data governance

Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.

raw Xperience-10M excluded full Qwen weights excluded derived artifacts included

Project brief Summary mirror Project status Status mirror Roadmap mirror Project packet Project materials

Aligned with the official dataset card.

The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while focusing the implemented artifacts on one public sample episode. The source-alignment record keeps 31.9 TB, about-1PB, 12,103 episode folders, cc-by-nc-4.0, HOMIE Toolkit, Rerun 0.29.0, not a local data inventory, and limited diversity visible on the public site.

Official dataset

Xperience-10M is a gated large-scale egocentric multimodal dataset for embodied AI, robotics, spatial intelligence, and world modeling.

official HF dataset

Line 1 public sample

The one-episode line builds the inspectable 20-task lab. Use Line 2 for selected-128 held-out comparison.

sample dataset

Public sample inspection

One public sample episode is exposed through the raw browser and modality atlas: synchronized video, embedded audio, HDF5 annotation groups, depth, pose/SLAM, mocap, IMU, calibration, language-derived signals, and source links in one place.

open raw browser raw manifest stream metadata

Data explorer analysis

Compare the one-sample line, selected-128 feature exports, and Hugging Face gated full-dataset metadata with scope stats, split counts, modality breakdowns, and readable charts.

open analysis analysis report

Multi-episode pilot

The selected 128-episode Qwen3-Omni LoRA v6 diagnostic run is verified with 4,032 held-out test predictions and 99.90% JSON validity. Action/subtask metrics are still weak, so this remains a baseline for error analysis.

LoRA adapter v5/v6 comparison

Data boundary

Raw MP4, HDF5, RRD files are streamed from the official public sample source when opened here; private gated data and full Qwen weights are not redistributed in this project.

data notice

Implemented subset

The public task lab uses one sample episode with 5,821 frames, 1,161 aligned windows, 8,546-dimensional task inputs, and 20 integrated task contracts.

task suite task contract data

Covered now

Action/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, misalignment, long-horizon forecasting, interaction text, action-object relation, sensor bridging, camera sync, and transition timing.

summary metrics

Responsible use

This project is for research exploration and excludes identity recognition, surveillance, biometric profiling, sensitive-attribute inference, and safety-critical deployment.

use notes

Later milestones

Full audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, held-out Qwen3-Omni evaluation, and future Xperience-native pretraining.

native pretraining

Data explorer analysis.

Read the data at three separate scales: the official public sample episode, the selected 128-episode feature surface, and authenticated file metadata for the Hugging Face-hosted gated full dataset. The charts are generated by a reproducible script and keep raw data boundaries explicit.

selected scope / public sample

Public sample episode: inspectable raw evidence.

The public sample is the only raw episode mirrored into the browser. Use it to understand file roles, synchronized streams, HDF5 contents, modality features, and how one episode becomes 20-frame task windows.

Raw contents6 MP4 streams, audio in MP4, annotation.hdf5, and visualization.rrd. Training view1,161 aligned windows with 8,546 feature dimensions. Best useFile inspection, task construction, and local reproducibility. BoundaryOne official episode; no diversity or scale claim.

1. Inspect one episodeOpen media, HDF5, RRD, labels, and synchronized windows directly.

2. Score selected 128Use public-safe derived rows for same-split baselines and model probes.

3. Read HF full scaleUse authenticated Hugging Face file metadata to understand upstream corpus size.

evidence map

What each data scale can answer.

The goal is to make the data understandable without mixing raw public sample inspection, selected-128 derived artifacts, and Hugging Face gated full-dataset metadata into one ambiguous inventory.

Data scale	Use case	What is counted	Boundary
Public sample	Raw file roles, synchronized streams, action/object timelines, and how one episode becomes task windows.	Files, frames, 20-frame windows, modality feature dimensions, action labels, object labels, HDF5 groups.	One official sample episode only; it is not a diversity or generalization claim.
Selected 128	Same-split baselines, model scoring surfaces, feature export density, and 128-episode task coverage.	Selected episodes, splits, size bands, feature rows, multiscale exports, raw20/proxy status.	Public-safe derived artifacts; raw selected-episode data remains governed by the official gated dataset.
Full HF dataset	Scale of the upstream dataset: sessions, episode folders, file counts, file types, and storage footprint.	Authenticated Hub file metadata: HDF5 count, MP4 count, RRD count, complete episode folders, total bytes.	Metadata-only. Dataset Viewer row/stat endpoints are gated and raw contents are not redistributed.

scale comparisonall scopes

One sample, selected 128, and the HF full dataset are different reading scopes.

Log-scale bars keep the large jumps visible without hiding the small public-sample line.

Scope ladder chart comparing public sample, selected 128 episodes, and the Hugging Face gated full dataset by episode count, window rows, and training bytes.

sample feature anatomypublic sample

Video dominates the 8,546-dimensional public-sample input.

Motion capture, depth, language, pose/SLAM, audio, and inertial signals remain explicit feature groups.

Bar chart of public sample feature dimensions by modality group.

sample timeline labelspublic sample

Window labels show which actions occupy the sample.

This separates what the one-episode line can and cannot represent before the task cards.

Bar chart of the most frequent action labels by 20-frame window count.

selected-128 exportsselected 128

Feature rows grow by export density and window scale.

The selected-128 split is stable; sparse, Qwen v6 multiscale, and dense compact exports expose different row counts.

Stacked bar chart of selected-128 processed rows by train, validation, and test split.

HF full-dataset metadataHF metadata

The Hugging Face full dataset is mostly repeated episode file structure.

The counts come from the gated ropedia-ai/xperience-10m repository on Hugging Face. Each complete episode usually has one annotation HDF5 and six MP4 streams; RRD viewer recordings are optional.

Log-scale file composition chart for the gated Hugging Face full-dataset metadata.

Source boundary: the full-corpus counts are the Hugging Face-hosted dataset version: authenticated Hugging Face Hub file metadata for ropedia-ai/xperience-10m. They describe repository scale and file organization; they do not inspect private row contents or redistribute raw data.

Open raw sample browser Open 20-task suite Open structured analysis Open analysis report

Raw public sample browser.

Open each official Xperience-10M sample file from the project page. Video and audio use compact browser previews derived from the official MP4 files, with direct links beside them for the full raw Hugging Face sources. HDF5 and RRD files are shown with their role, size, organization, and direct source links.

fisheye_cam0.mp4

Fisheye camera 0 stream and the public sample audio source. This file can be played as video and as the embedded audio track.

video + audio

Playing a 12 second fast-start preview derived from the official raw MP4. Use the source link for the complete file.

Embedded audio preview from fisheye_cam0.mp4

Video features feed visual tasks; the embedded audio stream feeds audio ablation and acoustic feature blocks.

open playable preview open full raw source official sample dataset machine-readable manifest

Sample folder organization

The official public sample is one episode folder. The task suite reads the HDF5 annotations and six synchronized MP4 streams, then writes 20-frame windows with a 5-frame stride.

xperience-10m-sample/
  annotation.hdf5
  fisheye_cam0.mp4
  fisheye_cam1.mp4
  fisheye_cam2.mp4
  fisheye_cam3.mp4
  stereo_left.mp4
  stereo_right.mp4
  visualization.rrd

annotation.hdf5 group map

The raw HDF5 is a binary container, so the browser shows its organization rather than loading the whole file into memory.

calibrationCamera intrinsics/extrinsics and static alignment values.

captionJSON text with actions, objects, interactions, segments, and global summary.

depthDepth maps and confidence channels aligned to the episode timeline.

full_body_mocapFull-body joint and contact signals for human motion modeling.

hand_mocapLeft and right hand joint trajectories used by forecast tasks.

imuAccelerometer and gyroscope streams sampled above video rate.

metadataEpisode metadata, frame indexing, and source bookkeeping.

slamCamera trajectory, pose, and sparse SLAM point-cloud information.

videoVideo metadata and per-frame alignment information.

Stream-to-feature use

The source streams are summarized once here, next to the playable files and HDF5 map.

VideoSix synchronized MP4 streams feed RGB, fisheye, stereo, and frame-statistic task inputs.

AudioThe embedded fisheye_cam0 audio feeds acoustic feature blocks and audio ablations.

DepthDepth maps and confidence channels provide geometry signals for spatial and reconstruction probes.

Pose / SLAMTrajectory, camera pose, and sparse map values become position and orientation features.

Motion captureBody and hand joint tracks support motion, contact, hand forecast, and policy-style targets.

InertialAccelerometer and gyroscope streams become wearable-motion statistics.

LanguageObject tags and caption-derived labels become semantic targets; raw caption text remains governed by the official sample.

Small derived modality thumbnails remain in the modality atlas data; raw MP4, HDF5, and RRD files are not redistributed.

Xperience-10M 20-task suite.

Task map, radar comparisons, task cards, and the 180-result table are kept in one section. The map introduces the task axes, the score surfaces compare methods, and each task card lists input, process, output, metric, and all nine public method scores.

unified task index

All 20 tasks are one suite, not a separate tier.

Each axis below has a task card with nine raw method scores, normalized radar values, source artifacts, and matching method-row entries in the 180-result matrix.

01 Action Recognitioncurrent action label

02 Procedure Step Recognitioncurrent subtask label

03 Action Boundary Detectiontemporal transition

04 Next-Action Predictionnear future action

05 Hand Trajectory Forecastingfuture hand path

06 Contact State Predictionhand/object contact

07 Object Relevance Predictiontask-relevant object

08 Language Groundingcaption target

09 Cross-Modal Retrievalmatched modality window

10 Cross-Modal Reconstructionmissing stream features

11 Temporal Order Verificationbefore/after order

12 Multimodal Sync Detectionaligned vs shifted streams

13 Long-Horizon Next-Action Forecastingfuture action label

14 Long-Horizon Next-Subtask Forecastingfuture subtask label

15 Interaction Text Predictionnatural interaction class

16 Action-Object Relation Predictionaction/object pair

17 Future Object-Set Forecastingfuture object set

18 IMU-to-Hand Pose Reconstructioninertial to hand pose

19 Camera-View Sync Retrievalmatching camera view

20 Time-to-Next-Transition Regressionseconds to boundary

Open full generated task-map figure Large overview kept here for visual scanning after the compact index.

Infographic showing Xperience-10M task families with compact modality cards and visible thumbnails

Unified plus split radars

The unified radar keeps all nine methods in one comparison board, but groups them into small-multiple panels so each method family can be read directly. The split radars separate the 1-episode Minimal/NN baseline comparison from the 128-episode metadata/raw, Qwen3-Omni v6 LoRA, and Cosmos3-Super/Nano comparison.

Metric normalization

Higher-is-better metrics are normalized to 0-1; lower-is-better metrics are converted to best/value within the task. The SVG uses sqrt(normalized score) only for visual radius, while raw values, linear normalized scores, status reasons, sources, and compact proxy notes remain in the data mirrors.

Score/proxy audit

The matrix has 180/180 scored method-task records: 174 direct scores and 6 compact-proxy scores. The audit records the source artifact, metric key, and proxy reason for each marked cell.

1-Episode 20-Task Radar

Minimal and Neural MLP are both scored on all 20 public-sample task contracts in one enlarged panel without 128-episode methods competing for attention.

Single-episode 20-task radar comparing Minimal and Neural MLP across all 20 scored task axes

View chart source Inspect chart source

128-Episode 20-Task Radar

Seven aligned 128-episode methods cover all 20 axes across metadata/text, raw-feature, and foundation-model panels. Proxy axes stay labeled in the chart and source data.

128-episode grouped 20-task radar comparing raw-feature baselines, metadata baselines, Qwen3-Omni, and Cosmos3 series with explicit score counts

View chart source Inspect chart source Review score/proxy audit

Open full 9-method radar board Use after the split radars when you need all methods in one grouped visual analysis board.

Unified grouped 20-task radar comparing Minimal, Neural MLP, 128-episode metadata/raw baselines, Qwen3-Omni, and Cosmos3 with task names, method details, 20-record counts, score counts, and proxy notes

View chart source Inspect chart source Read raw-score table

All 20 task cards in the same suite.

Each card uses its assigned icon and shows the task name, input sources, process, output target, metric, and compact nine-method score panel. Use the filters for scanning; the cards stay tied to the task map, radar axes, and 180-record matrix.

Assigned visual language for the 20 tasks.

The overall generated atlas keeps the icon family visible, while each task card below uses its own crisp assigned SVG for reliable loading and public mirrors.

Interactive task walkthrough.

Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.

Representative sample modality for the selected task

Step 1 / 4 · Input

Action Recognition Egocentric Action Recognition

Input: inspect the 20-frame multimodal window before choosing the target.

01 / 12

supervised multiclass classifier

Action Recognition

In the coffee-making sample, a pouring window maps to the current action label.

Metric: macro-F1. Minimal 0.0500; neural MLP 0.0148.

Current limitation: single-episode chronological split.

Evaluation protocol is explicit.

The protocol is generated from committed metric artifacts and records the exact data unit, split, task targets, leakage controls, and current limitations before score comparison.

Data unit

One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by 8,546 synchronized multimodal dimensions.

evaluation protocol

Split policy

Single-episode chronological 70/30 train/test split. This avoids random future-window mixing; cross-episode generalization is measured in the later multi-episode pilot.

protocol document

Metric contract

All 20 tasks list input, target, primary metric, baseline score, and source artifact path in the unified suite file.

task contract data

Leakage controls

Scalers fit on train windows only; future labels, target-side signals, caption/object labels, and contact labels stay on the target side unless explicitly queried.

builder script

Audio ablation

Audio and no-audio variants are evaluated across the walkthrough-backed task contracts under the same chronological split.

audio summary

Foundation track selection

Qwen3-Omni is the first trainable baseline, Cosmos 3 is the world-model track with a camera-pose proxy forward-dynamics contract ready for trainer work, policy models wait for robot-compatible action targets, and Xperience-native pretraining remains a later full-corpus goal.

backbone plan

Next evaluation stage

This public-sample run covers single-episode task development. The selected multi-episode Qwen3-Omni final diagnostic result is verified and meets the JSON-validity target; Cosmos3-Nano has a verified future-window compatibility package; and Cosmos3-Super has a verified base-weight JSON-task evaluation plus a fine-tuned forward-dynamics LoRA branch. The next stage is action/subtask error analysis, stronger held-out metrics, and policy-target conversion.

result comparison

Selected-128 next stressor

Before adding episodes, the suite should try `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, label-normalized scoring, and compact raw-feature shards for unsupported tasks.

enhancement data

Public-safe scale-up gate

Future Omni, Cosmos, and policy tracks use the same episode split discipline, training metadata, held-out predictions, metrics, run report, and public-safe package gate.

scale-up status

From raw episode to research artifacts.

Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.

from data to score One pipeline, five verifiable checkpoints.

This is the shortest way to understand the method section: start from official synchronized files, build aligned windows, create task targets, run method heads or model probes, then publish source-linked metrics.

Official streams

RGB/fisheye/stereo video, audio, HDF5 annotations, pose, depth, calibration, IMU, mocap, and Rerun visualization sources.

Aligned windows

20-frame windows with a 5-frame stride keep the input unit consistent across task heads and selected-128 exports.

Task targets

Each window maps to one of the 20 contracts: classification, forecasting, retrieval, reconstruction, synchronization, or regression.

Method rows

Minimal, Neural MLP, selected-128 metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano produce the public method rows.

Verified outputs

Metrics, predictions, matrices, radar data, validators, and mirror parity checks are published with direct/proxy status preserved.

Verified Xperience-10M multimodal pipeline diagram

Qwen3-Omni LoRA training flow

Raw valid episodes move through split validation, parallel export, video/audio/text formatting, sensor-bridge features, LoRA training, and sealed held-out evaluation.

What the figure represents

It documents the selected 128-episode final diagnostic result and the action/subtask improvement path for stronger held-out metrics.

Detailed Qwen3-Omni LoRA training pipeline from raw Xperience-10M episodes to adapter outputs, predictions, metrics, and reports

What this project enables

It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.

What still needs more data

General embodied-intelligence model quality requires many episodes and held-out episode splits; the public sample is the development harness for that next stage.

Model inputs are traceable.

Each task reads an aligned window, not a black-box feature blob. The cards below show the path from public-sample streams to window rows and feature groups, then the chart shows the size of each input block.

1 / raw source

Start from official streams

Six synchronized MP4 camera streams, embedded audio, HDF5 annotations, depth, pose/SLAM, mocap, IMU, calibration, and language-derived fields remain tied to the public sample episode.

Open sample browser

2 / aligned window

Slice into the same unit

The task suite builds 20-frame windows with a 5-frame stride. Each row records episode id, start frame, end frame, center frame, timestamps, and split assignment.

Open window manifest

3 / feature contract

Name every input block

The feature contract groups the 8,546 dimensions by modality and derived signal. Omitted or target-side fields stay visible so leakage checks can be audited.

Open feature contract

source modality

Where the signal comes from: video, audio, depth, pose/SLAM, mocap, IMU, calibration, or language-derived annotation fields.

feature block

The named column group used by Minimal baselines, Neural MLP heads, and public-sample task cards.

chart width

Bar length means number of input dimensions. It is not a model score and it is not a task ranking.

Feature block chart showing input dimension counts for hand joints, body joints, pose, IMU, depth, six video streams, audio, language text, SLAM point cloud, and calibration

Read this as provenance. A task card may say “20-frame multimodal window”; this section shows which source streams and derived feature groups make up that window before any task-specific label is attached.

Back to task cards

The baseline task heads share four head families.

The diagram separates the shared episode-window representation from the task-specific heads, so the task contracts stay readable before scaling to larger models.

Verified minimal and neural architecture diagram for Xperience-10M task heads

Results by evidence line.

Read results in this order: choose the line, open the matching radar, inspect the matrix row, then check proxy flags before interpreting totals.

Result Interpretation and Limitations These results are diagnostic, not a final model-quality leaderboard.

This project has two public evidence lines. Line 1 uses one public sample episode as an inspectable task lab. Line 2 uses a selected-128 public-safe comparison surface for same-split baselines and model diagnostics. The current public results do not estimate full Xperience-10M corpus performance.

The 180-record matrix should be read task-by-task. Raw metric values are the citeable values. Radar-normalized values are visual summaries only and should not be used as leaderboard scores.

Direct and compact-proxy scores remain separate. Proxy-marked cells are bounded public-safe substitutes where a direct raw target is not publicly available; they should not be treated as direct raw-target evidence.

The selected-128 surface is a diagnostic comparison subset, not a distributionally representative sample of the full Xperience-10M corpus. Broader model-quality claims require larger episode coverage, locked held-out splits, seed variance, and episode-level confidence intervals.

Because 20-frame windows with stride 5 can create local temporal overlap, split construction matters. Window counts should be read as evaluation units, not statistically independent human-experience episodes.

High local baseline scores are within-surface sanity checks, not proof of general embodied-intelligence performance. The Qwen3-Omni and Cosmos rows are model-family diagnostics; weak action/subtask results are kept visible because they define the next error-analysis and model-improvement stage.

Interpretation note: read these scores as task-definition and diagnostic evidence, not final model-quality claims.

Some task metrics are intentionally hard and can be low, especially for action/subtask labels in the current selected-128 diagnostic runs. The current value of this work is the public-safe evaluation pipeline, traceable targets, split discipline, direct/proxy score labeling, and failure-analysis surface for future model improvement.

cite this Raw metric value

The score printed in each table cell is the value emitted by the runner or verified package. Use this value in text, tables, and comparisons.

visual only Normalized radar value

The radar converts mixed metrics to a 0-1 plotting scale. It helps pattern recognition, but it is not a replacement for the raw metric.

status first Direct vs compact-proxy

Direct scores use the task target. Compact-proxy scores are bounded substitutes where a raw public target is unavailable, and stay marked before comparison.

Result citation ledger
Use case	Use this value	Why	Where it appears
Write a paper or README number	Raw metric value	This is emitted by the runner or verified result package and keeps the original metric scale.	180-result table cells, result JSON, method summaries.
Compare shapes across many tasks	Normalized radar value	This is a 0-1 plotting transform for visual comparison only; cite the raw metric beside it.	Unified radar and split radar charts.
Interpret a low-availability target	Compact-proxy score plus its audit note	The task target is represented by a bounded substitute, so the proxy label and reason travel with the value.	Proxy audit, matrix badges, selected-128 baseline rows.

01 Choose the line

Use 1 episode for the task lab. Use 128 episodes for the selected comparison surface.

02 Open the radar

Single-episode radar shows Minimal vs Neural MLP. The 128-episode radar shows metadata/raw baselines, Qwen3-Omni v6, Cosmos3-Super, and Cosmos3-Nano.

03 Inspect the matrix

Each score keeps method, task, metric key, source artifact, and status.

04 Check proxy cells

Six selected-128 scores are compact proxies and stay marked in the audit.

1 episode results

Task-lab evidence

Minimal and Neural MLP heads are both scored on all 20 public-sample task contracts. All 40 scores are direct task-target metrics.

best read as

A reproducible public task suite and baseline behavior check.

2methods 20task axes 40/40scores

Open 1-episode radar

128 episode results

Scale-up evidence

Metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Super Reasoner, and Cosmos3-Nano Future Window use the aligned 128-episode surface. It has 134 direct scores plus 6 compact-proxy scores.

best read as

A same-split comparison table with explicit source and proxy status.

7methods 20task axes 140/140scores

Open 128-episode radar

Line	Methods	Tasks	Scored records	Direct scores	Proxy scores	Machine-readable source
1 sample episode	2	20	40/40	40	0	single-episode radar data
128 selected episodes	7	20	140/140	134	6 compact-proxy scores, each source-linked and reasoned.	128-episode radar data
Total public matrix	9	20	180/180	174	6	two-line result summary data

Line	Block	Methods	Records	Evidence type	Primary artifact
1 sample episode	Task-head baselines	Minimal; Neural MLP	40 direct	Direct target metrics on the public sample windows.	single-episode radar data
128 selected episodes	Aligned baseline heads	Metadata simple/NN; raw-feature simple/NN	74 direct + 6 compact-proxy	Processed-target metrics where available; proxy cells remain source-linked.	score/proxy audit
128 selected episodes	Qwen3-Omni series	Qwen3-Omni v6 LoRA	20 direct	Verified selected-128 LoRA and task-specific probe artifacts.	model comparison data
128 selected episodes	Cosmos3 series	Cosmos3-Super Reasoner; Cosmos3-Nano Future Window	40 direct	Verified reasoner and future-window public-safe artifacts; forward-dynamics LoRA is a separate adapter artifact outside the 20-task method rows.	model comparison data

180-result table

All methods x all 20 tasks, in one source-linked table.

Each cell shows the raw metric value to cite, the normalized radar value, the metric key, and a direct/proxy badge. The table is generated from the same structured matrix data used by the radar, so values stay aligned across GitHub, the website, and Hugging Face mirrors.

Matrix data Proxy audit Full table note

180method-task records

9method rows

20task columns

174direct scores

6compact-proxy scores

direct task scorei compact proxy scorei raw value is citeablei normalized value is radar-onlyi

open full 180-score audit table Use this after the citation ledger. It contains the method summary, every method-task cell, direct/proxy badges, raw metric values, normalized radar values, and source-aligned evidence notes.

Methodi	Linei	Recordsi	Directi	Proxyi	Scope
Loading result summary...

Raw and normalized scores for 9 methods across 20 Xperience-10M tasks.
Method	Loading tasks...
Loading 180-result matrix...

Best-practice reading rule: compare methods within the same evidence line first, then use the proxy badges before interpreting cross-method totals. Six compact-proxy cells are intentionally visible rather than blended into direct raw-target scores.

Small baselines, no hidden machinery.

Motion-only and all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. They now sit beside the result matrix instead of opening a separate results page.

Motion-only action

0.9688macro-F1, 18 classes

Current all-feature action

0.9829macro-F1, 8,546 dimensions

Motion-only subtask

0.9528macro-F1, 14 classes

Current all-feature subtask

0.9173macro-F1, chronological caveats

Neural MLP heads, same task contracts.

The neural baseline keeps the same windows, splits, and leakage filters, then swaps in small PyTorch MLP heads. Read it as a nonlinear-head check before heavier Qwen3/Cosmos model branches.

Neural hand forecast

0.1079MPJPE, down from 0.8647 minimal

Neural temporal order

0.8520F1, adjacent-window diagnostic

Neural misalignment

0.7153F1, shifted motion/visual/audio pairs

Neural cross-modal retrieval

0.1300MRR; ridge remains stronger here

Minimal versus neural MLP episode task score chart

Diagnostics separate memorization from signal.

These charts keep the main lesson visible: within-episode labels can be easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.

Measured audio delta chart across walkthrough-backed task contracts

Open the single-episode explorer to inspect window-level labels, predictions, modality statistics, object labels, and diagnostic scores. The audio ablation summary records the task-by-task audio contribution.

One episode becomes a benchmark contract

The public sample is converted into 5,821 frames, 1,161 aligned 20-frame windows, and an 8,546-dimensional representation for repeatable task evaluation.

research takeaways

Chronological split exposes class shift

All-feature action reaches 0.9829 macro-F1 on its local split, while the chronological action head in the core task suite is 0.0500 macro-F1 with four unseen later action labels.

takeaways

Neural heads help dynamics

Hand MPJPE improves from 0.8647 to 0.1079; temporal-order F1 rises from 0.5400 to 0.8520; misalignment F1 rises from 0.5052 to 0.7153.

metrics

Retrieval and reconstruction remain open

Ridge/cosine retrieval remains stronger than the neural projection here, and cross-modal feature reconstruction still has negative R2.

retrieval metrics

Scale means held-out episodes

The next credible model-quality unit is a held-out multi-episode pilot across different sessions, not more adjacent windows from one sample.

scale-up status

Four research directions over the same 20 tasks.

This is the 4-direction layer of the public structure. Each direction groups task evidence by research question; it does not create a separate task set. Each task is mapped as direct, proxy, or diagnostic evidence using the same minimal and Neural MLP baseline contracts.

relationship map 20 tasks / 4 directions / 3 pipelines / 1 unified model target.

The structure is fixed: the 20 tasks are scored contracts, the four directions group what those scores study, the three pipelines define training recipes, and the unified embodied model is the long-term integration target.

scored task layer One unified task suite.

Every metric, radar axis, and method row uses these same 20 task contracts.

4-direction layer Four research views over the same tasks.

Human motion, 3D/4D reconstruction, egocentric interaction, and world modeling are groupings over the same tasks.

3-pipeline layer Three ways to train larger models.

Spatial intelligence, human-video world models, and vision-language-action models reuse the same files with different input-output recipes.

unified target One embodied foundation model goal.

The pipeline outputs converge toward perception, 3D memory, language, action, and planning in one model family.

score question If it has a metric, it belongs to the 20-task layer.

Task cards, radars, and the 180-record table all use the same numbered task IDs.

research question If it explains what the evidence studies, it belongs to the 4-direction layer.

Directions A-D group the same 20 tasks. They are not extra tasks or a second tier.

training question If it describes model inputs and targets, it belongs to the 3-pipeline layer.

Spatial, world-model, and VLA pipelines are recipes for future scale-up and task-grounded model training.

synthesis question If it combines pipeline abilities, it belongs to the unified-model target.

This is the expandable integration goal, not an extra scored task axis in the 180-result matrix.

Labeled Ropedia Xperience-10M Task Suite relationship map showing 20 task contracts flowing into four named research directions, three foundation-model training pipelines, and one unified embodied foundation model target, with each layer open to expand. — **20 tasks**Score axes used by every method row. **4 directions**Research questions for reading those scores. **3 pipelines**Training recipes for larger foundation models. **1 unified model**Future integration target, not a new score axis.

20 task contracts Every score axis is one of these tasks.

The generated image uses one node per task; this list gives the exact public task names.

01Action Recognition
02Procedure Step Recognition
03Action Boundary Detection
04Next-Action Prediction
05Hand Trajectory Forecasting
06Contact State Prediction
07Object Relevance Prediction
08Language Grounding
09Cross-Modal Retrieval
10Cross-Modal Reconstruction
11Temporal Order Verification
12Multimodal Synchronization Detection
13Long-Horizon Next-Action Forecasting
14Long-Horizon Next-Subtask Forecasting
15Interaction Text Prediction
16Action-Object Relation Prediction
17Future Object-Set Forecasting
18IMU-to-Hand Pose Reconstruction
19Camera-View Synchronization Retrieval
20Time-to-Next-Transition Regression

4 research directions The same tasks are grouped by research question.

Tasks can support more than one direction. This layer answers what the scores study; it is not a separate benchmark or task tier.

AHuman Modeling & Motion Understanding
B3D/4D Reconstruction & Neural Rendering
CEgocentric Vision & Interaction
DScene Reconstruction & World Modeling

3 foundation pipelines The training tracks reuse the task and modality contracts.

These are scale-up recipes for model training. They answer how the same files can be turned into larger model inputs and targets.

1Spatial intelligence models
2Human-video world models
3Vision-language-action models

1 unified target The pipelines converge into one embodied model goal.

This target integrates perception, 3D memory, language, action, and planning. It is a research direction for scale-up, not a tenth method or twenty-first task.

PPerception and multimodal input
M3D memory and scene state
LLanguage-grounded reasoning
AAction and planning

scored contracts 20 tasks

Action, subtask, object, language, motion, synchronization, retrieval, and forecasting targets evaluated by each public method row.

research grouping 4 directions

Human Modeling & Motion Understanding; 3D/4D Reconstruction & Neural Rendering; Egocentric Vision & Interaction; Scene Reconstruction & World Modeling.

training tracks 3 foundation pipelines

Spatial intelligence models; Human-video world models; Vision-language-action models.

synthesis target 1 unified embodied model

A long-term model target where perception, spatial memory, language, action, and planning are trained against the same evidence structure.

20 tasks are the score axes. -> 4 directions organize what the scores study. -> 3 pipelines describe how to train larger models. -> 1 unified model is the integration target.

partially implemented

A. Human Modeling & Motion Understanding

Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.

2direct2proxy0diagnostic

prerequisite evidence

B. 3D/4D Reconstruction & Neural Rendering

Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.

0direct2proxy1diagnostic

strongest implemented

C. Egocentric Vision & Interaction

Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.

6direct2proxy3diagnostic

world-model prerequisites

D. Scene Reconstruction & World Modeling

Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.

0direct6proxy3diagnostic

Coverage of the original Xperience-10M tasks across four research directions

Baseline 1: minimal heads

Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.

Baseline 2: neural MLP heads

Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.

Unified 20-task evidence and provenance.

All 20 tasks live in the same task table, task-card grid, radar, and 180-record result matrix. Historical result paths are retained only for exact provenance links.

Unified task artifact package

The public task package has one structured 20-task contract, per-task metrics, prediction/rank files, summaries, radar charts, and the 180-record method-task matrix.

Inspect task-contract source · Read 180-result table · View unified radar

One setup, one task surface

Every task uses the same 20-frame window unit, 5-frame stride, 8,546-dimensional feature manifest, chronological split discipline, and minimal/neural comparison pattern unless a task-specific leakage rule removes target-side features.

Historical provenance data and historical provenance chart remain available for exact source tracing.

Four Xperience-10M research-direction extension probes with minimal and neural metrics

A / motion

Body and Hand Motion Intensity

Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.

Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.

Output: high_motion or low_motion.

0.7827minimal macro-F10.7986neural macro-F1

B / views

Multi-View Consistency Retrieval

Case: retrieve the synchronized stereo-left window from a fisheye-camera query.

Input: fisheye_cam0 video features against stereo_left candidate features.

Output: ranked synchronized view candidates.

0.5534minimal MRR0.3469neural MRR

C / phase

Action Phase Progress Estimation

Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.

Input: non-caption multimodal features.

Output: 0-to-1 progress inside the current action.

0.3416minimal MAE0.3038neural MAE

D / world

Short-Horizon Ego-Motion Forecasting

Case: predict how the camera translation changes over the next 20 frames.

Input: current sensors excluding camera translation and captions.

Output: future camera-translation delta vector.

0.1989minimal MAE0.0989neural MAE

What changed

The four research directions now have coded extension probes, prediction/rank CSVs, structured metrics, project summaries, and charts generated from real sample-window features.

What still needs scale

A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.

Research roadmap.

The project path moves from the current public-sample task lab to the latest verified Qwen3-Omni diagnostic run, same-split 128-episode baseline alignment, a no-new-episode enhancement pack, action/subtask error analysis, robustness runs, world/policy tracks, and the future Xperience Embodied Foundation Model pretraining goal.

Roadmap stages

Current results, next gates, and scale-up targets.

The roadmap connects shipped public-sample work, the selected-128 pilot, active quality work, and later model-training goals.

Review stage cards Roadmap details Pipeline note 180-result table Roadmap note

1. Public-sample task labTask contracts, windows, baselines, neural heads, walkthroughs, and figures.
2. Selected-128 pilotSame-split baselines, Qwen3-Omni diagnostics, and public-safe result packages.
3. Scale-up pathQuality analysis, robustness runs, world/policy tracks, and Xperience-native pretraining.

implemented

Public-Sample Task Lab

One public episode is converted into aligned windows, task contracts, minimal baselines, neural heads, walkthroughs, and figures.

Entry

Public Xperience-10M sample episode available.

Evidence

Status, protocol, takeaways, summary metrics, and episode-task outputs.

implemented

Multi-Episode Data Preparation

Prepare official gated episodes while preserving episode-level separation and recording missing-view coverage. The first selected split is available for Qwen3-Omni diagnostics.

Entry

Gated data access and enough storage for selected episodes.

Evidence

Selected-episode plan, data boundary, preparation notes, and verified package summary.

verified latest run

Qwen3-Omni LoRA Latest Diagnostic Branch

Train lightweight adapters on selected prepared episodes and evaluate on held-out episodes with committed predictions, metrics, and run reports.

Entry

Selected episodes prepared with no train/test episode leakage.

Evidence

Verified result summary, v5/v6 comparison, dataset manifest, training metadata, progress logs, metrics, and predictions.

verified companion result

128-Episode Same-Split Simple/NN Baselines

Align simple metadata/text baselines, raw-feature proxies, and neural MLP baselines to the same selected 96/16/16 split and the unified 20-task axes used by the public result matrix.

Entry

Derived Qwen JSONL export for the selected 96/16/16 split.

Evidence

Baseline alignment report, summary metrics, task metrics, and the 128-episode baseline runner.

current no-new-episode plan

Selected-128 enhancement stage

Use the same selected split, estimate dense/multiscale window exports, define hierarchical action/subtask targets, and prioritize raw-feature shards for tasks that metadata baselines cannot cover.

Entry

Current 3,808-window selected 96/16/16 export and verified Qwen v4 metrics.

Evidence

Enhancement note, structured enhancement data, dense-window CSV, and the enhancement builder script.

active next step

Action/Subtask Error-Analysis Pass

Keep the 96/16/16 split, tighten JSON decoding or target formatting, and analyze action/subtask failures before presenting stronger model-quality numbers.

Entry

The final diagnostic package is verified, meets strict JSON validity, and exposes weak action/subtask quality.

Evidence

Updated quality-target report, error-analysis tables, held-out metrics, and public-safe package.

current

Foundation-Model Selection Matrix

Keep Qwen3-Omni as the first trainable held-out pilot, use Cosmos 3 for world modeling and forward-dynamics trainer development, and stage policy candidates after robot-compatible action targets are explicit.

Entry

Completed 128-episode preparation or a smaller 3-8 episode preprocessing dry run.

Evidence

Foundation model plan, source links, model-specific entry conditions, and evaluation additions.

partially implemented

64-128 Episode Robustness Run

Test whether pilot conclusions survive broader sessions, missing modalities, and stronger ablations.

Entry

Selected multi-episode pilot trains and evaluates cleanly.

Evidence

Metrics by session, task, modality, ablation, and failure type.

planned

Cosmos 3 and Policy-Model Extensions

Extend toward future-window prediction, action-conditioned world modeling, synthetic-data tests, policy-style next action, and affordance reasoning.

Entry

Enough multi-episode data, compute budget, and model-specific action or world-state targets.

Evidence

Task-specific held-out evaluations, qualitative inspection, and updated model cards.

future

Xperience Embodied Foundation Model Pretraining

Pretrain an Xperience-native domain model over synchronized video, audio, depth, pose, mocap, IMU, and language after smaller scaling stages prove value.

Entry

Full-corpus access, PB-scale storage path, multi-node compute, and positive scaling evidence.

Evidence

Pretraining manifests, scaling curves, held-out evaluations, checkpoint inventory, model card, and data-boundary report.

Roadmap resources

Roadmap details and source records.

The roadmap links the concise stage view, the full track-and-task detail page, and structured data records used for the public release.

detail page

Focused roadmap detail page

The detail page keeps the four research directions, linked task heads, stage gates, current evidence, and next steps in one sequence without nested controls.

Open details Read roadmap note

training plan

Foundation model paths

Spatial intelligence, human-video world modeling, and VLA training use different inputs and outputs while sharing the Xperience-10M sample structure.

Pipeline data Pipeline note

scale-up state

Selected-128 and data-boundary evidence

Use these notes to understand which selected-episode artifacts are public-safe, which raw files remain gated, and how the 128-episode comparison is staged.

Project status Data status

future model

Xperience-native pretraining plan

The long-range goal is an Xperience-native embodied foundation model after smaller selected-episode stages prove value and infrastructure is ready.

Pretraining plan Four directions

Structured roadmap data

These records keep roadmap stages, foundation-model plans, and selected-128 enhancement data aligned across the public release.

Roadmap stage mirror Roadmap detail data Foundation-plan mirror Three-pipeline mirror Development-direction mirror Selected-128 enhancement mirror

Current experiments and next milestones.

The project shows the completed public-sample task suite and the first verified multi-episode Qwen3-Omni diagnostic pilot, then lays out the next quality-improvement and model-extension steps.

verified

Sample windows and 20-task results

5,821 frames become 1,161 synchronized 20-frame windows with an 8,546-dimensional representation, 20 task contracts, and the current 180-record public result matrix.

task results feature inputs result matrix 20-task suite

verified

Audio and modality evidence

Audio variants improve the primary metric on 6 walkthrough-backed task contracts, while the feature manifest and raw browser keep each modality source inspectable.

audio summary delta chart audio findings raw browser

verified

Research directions and foundation pipelines

The Ropedia directions are labeled as direct, proxy, or diagnostic coverage and connected to spatial intelligence, human-video world-model, and vision-language-action training paths.

direction map extension probes pipeline data directions

current plan

Foundation backbones are separated by role

Current implemented/diagnostic tracks are Qwen3-Omni v6 and Cosmos3-Super/Nano. Candidate future policy tracks are OpenVLA/openpi/GR00T-style models after robot-compatible action conversion. Long-term, this points toward an Xperience-native embodied foundation model.

foundation model plan plan doc pretraining plan

verified diagnostic

Qwen3-Omni and Cosmos3 model evidence

The selected 96/16/16 split has a verified Qwen3-Omni v6 package with 4,032 held-out predictions and 99.90% JSON validity. Cosmos3-Nano, Cosmos3-Super Reasoner, and Cosmos3-Super Forward-Dynamics LoRA remain separate public-safe diagnostic branches.

result comparison pilot result verified package LoRA adapter

current plan

Selected-128 export and enhancement path

The current selected-128 line has a source/feature index and can be pushed through dense/multiscale windows without changing the held-out episode split; the recommended scenario is `multiscale_20s10_40s20_80s40`.

source/feature index enhancement data enhancement report dense scenarios

verified

Published mirrors and source anchors

The website, GitHub repo, HF Space, artifact dataset, baseline model repo, consolidated weights/results repo, collection, official dataset, and HOMIE toolkit point to the same research package.

HF Space artifact dataset collection official dataset HOMIE toolkit

verified

Visual system and figure index

The logo, raw-sample stream thumbnails, task-suite figure, unified radar, model-architecture figures, and LoRA training-flow figure are indexed and reused across published mirrors.

figure guide 20-task radar LoRA figure logo card

verified

Governance and reproduction

The project shares derived task artifacts, figures, reports, lightweight baselines, scripts, validators, and audit records. Raw videos, raw annotations, RRD visualizations, gated data, and full Qwen weights stay outside the repo.

data notice reproducibility latest rebuild publication audit

Research reading path.

A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.

Understand the current scope

Start with the project brief, status, dataset context, task results, roadmap, and Qwen3-Omni scale-up notes. They separate implemented single-episode work from the prepared multi-episode stage.

brief current status dataset notes task metrics roadmap scale-up

Inspect one model input

Use the window table and feature manifest to see the aligned sample unit, modality sources, and leakage controls.

windows features

Compare minimal vs neural heads

Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.

summary metrics neural heads

Check the scale-up gate

The multi-episode Qwen3-Omni path now has a final verified diagnostic package and public LoRA adapter. The native-pretraining plan shows how this can grow into a full-corpus research direction after action/subtask improvements and stronger task metrics.

scale-up status data access native pretraining project packet

Push the current 128 episodes harder

Use the no-new-episode enhancement pack before requesting more storage: it records dense-window estimates, `multiscale_20s10_40s20_80s40`, hierarchical labels, and raw-feature shard priorities.

enhancement data enhancement report builder script

Verified nowOne public episode, 5,821 frames, 1,161 aligned windows, 8,546 dimensions, 20 unified task contracts, 20 Minimal and Neural MLP task-head rows, and 4 direction-extension probes.

Next: no-new-episode scaleThe selected 128-episode suite should next use dense/multiscale windows, hierarchical labels, and raw-feature shards before adding more episodes.

Next: error analysisThe selected 128-episode Qwen3-Omni LoRA result has a final verified diagnostic package; JSON validity meets target, and the next pass should improve action/subtask quality.

Not redistributedRaw videos, raw annotations, full Qwen weights, and private gated Xperience-10M data are not included in the public repo or HF bundles.

Glossary for project and field terms.

This glossary covers the overloaded project terms plus adjacent technical terms from embodied AI, egocentric multimodal data, spatial geometry, world models, VLA/policy learning, training, evaluation, and public artifact reading.

data Read multimodal sample terms

Episode, window, modality, fisheye, depth, IMU, calibration, and synchronization terms are grouped with the project data boundary terms.

space + time Decode geometry and world-model language

Camera pose, SLAM, point clouds, rollouts, forward dynamics, long-horizon forecasting, and temporal leakage are defined next to the relevant tasks.

robotics Connect VLA and policy terms

Action chunks, policies, imitation learning, behavior cloning, end effectors, dexterity, contact, and language grounding are included for extension work.

evidence Keep scores and published mirrors distinct

Direct/proxy scores, raw metrics, radar values, held-out evaluation, Qwen/Cosmos branches, adapters, and HF mirrors remain explicitly separated.

Search glossary

Loading glossary terms...

Scrollable term index

Search and filters update this box; scroll inside it to scan the full glossary without stretching the page.

Term	Meaning here	Use it for	Do not confuse with
Egocentric videomultimodal sensing	Video captured from a first-person or body-mounted viewpoint.	The sample streams are egocentric views of human interaction and are the visual basis for many tasks.	Third-person robot-camera footage.
Camera posespatial geometry	The camera position and orientation at a time step.	Supports spatial-intelligence tasks, view synchronization, and geometry diagnostics.	The human body pose.
Forward dynamicstemporal and world models	Predicting the next state from the current state and action/context.	The Cosmos3-Super LoRA branch uses a forward-dynamics-style diagnostic contract.	Reverse inference from result back to cause.
Vision-language-action modelrobotics and VLA	A model that maps visual context and language into actions.	The VLA direction is a future path after action targets are converted into robot-compatible chunks.	A vision-language model that only answers text.
Direct scoretasks and metrics	A metric computed against the task target directly.	The preferred score type in the 20-task matrix.	Compact-proxy score.
Held-out evaluationtraining and evaluation	Testing on examples not used for training.	Required before promoting Qwen/Cosmos results to public evidence.	Training-set loss.

Open complete glossary note Inspect glossary source Evidence map Review score/proxy audit

Choose the artifact by the question you have.

For public links, score checks, reproducible runs, or the next model-training package, start with the route cards below. The full artifact library remains available after that first choice. Raw Xperience-10M data and Qwen weights are not redistributed.

if you want public links Open the right published mirror

Open GitHub, HF Space, artifact dataset, baseline models, or consolidated weights/results without guessing.

Open published mirrors

if you want to trust a score Check result parity

Check source alignment, mirror parity, live URL hashes, and result audits before citing a number.

Open checks

if you want technical files Open GitHub and Hugging Face

Scripts, manifests, task contracts, model cards, and derived artifacts stay in the source and mirror repositories.

Open repositories

if you want next models Continue model work

Use Qwen3-Omni v6, Cosmos3-Super/Nano packages, the 128-episode feature index, and foundation-model plans for the next runs.

Open scale-up

Research artifacts

From one episode to task heads

Start with the files that define the sample windows, modality inputs, task contracts, metrics, walkthroughs, and research-direction mapping.

Task results

Every task definition, split detail, feature dimension, and minimal/neural metric in one project output.

task results

Windows table

Window start/end frames and aligned action/subtask labels for the public sample episode.

window table

Feature inputs

Source map for the current modality inputs used by the task suite.

feature inputs

Neural MLP task results

Per-task PyTorch MLP metrics, predictions, histories, and checkpoints for the unified task contracts, with historical result-bundle paths retained for provenance.

neural MLP outputs

Four-direction taxonomy

Maps the walkthrough-backed task contracts to the four research tracks: human modeling, 3D/4D reconstruction, egocentric interaction, and world modeling.

research direction outputs

Direction extension probes

Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.

extension probe outputs

Task walkthroughs

Case studies for the walkthrough-backed task contracts, including input, middle process modules, output, metric, limitation, and task-player data.

walkthrough outputs

Audio ablation and raw upgrade

All 72 task/variant rows comparing current audio, no audio, raw audio, replacement, and combined-input settings.

audio ablation outputs

Single-episode explorer

Interactive window-level view of labels, predictions, modality statistics, object labels, and diagnostics.

open explorer

Cross-modal retrieval

The strongest self-supervised signal from the single episode.

retrieval metrics

Published mirrors

Project map, mirrors, and runnable code

Project files, published mirrors, and scripts for the public-sample pipeline.

complete generated inventory

Open scripts, result artifacts, website data, and HF bundle files directly.

The map is generated from the project folders instead of hand-curated cards. It keeps every script and result file indexed in place, marks local-only files when they are not tracked, and links to GitHub Pages, GitHub, or Hugging Face raw files when a public surface exists.

...loading inventory

File	Type	Role	Size	Direct access
Loading project resource map...

Generated sources: structured resource map and human-readable resource map. Rerun python scripts/build_project_resource_map.py after new result runs or script additions.

Evidence map

Single navigation view for GitHub, GitHub Pages, HF Space, artifact dataset, baseline model repo, Qwen3-Omni/Cosmos3 repos, and result lanes.

evidence map

Glossary

Definitions for project-specific terms plus broader embodied-AI, egocentric multimodal data, spatial geometry, world-model, VLA, training, evaluation, adapter, and public mirror terms.

glossary data full glossary note

Project resource map

Complete generated inventory for scripts, result artifacts, website JSON, visual assets, configs, notes, and prepared Hugging Face bundles.

resource-map data resource-map note

Artifact guide

Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.

artifact guide

Reproduction scripts

Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.

scripts/

Hugging Face Space

The dashboard packaged as a public static Space.

HF Space

GitHub Package

Static dashboard container published to GitHub Container Registry for local browsing with Docker, without raw data or model weights.

GHCR package

Derived HF artifacts

Metrics, predictions, docs, and lightweight derived files without raw data redistribution.

artifact collection

HF baseline models

Minimal NumPy softmax, ridge baselines, and neural task-head model files.

model repo

HF weights + results

Consolidated public-safe baseline weights, Qwen3-Omni and Cosmos3 adapters/packages, verified results, analysis files, and manifest.

weights/results repo

HF collection

Space, artifacts, baseline models, Qwen3-Omni v6 LoRA, Cosmos3-Super, and Cosmos3-Nano repos grouped into one public project collection.

collection

Current all-feature action model

Classifier metrics, predictions, confusion matrix, and model weights.

model metrics

Project packet

Compact route from project scope to results after choosing a mirror.

project packet

Scale-up path

Verified diagnostic pilot

The multi-episode Qwen3-Omni path is documented, scripted, and verified as a validation-monitored held-out pilot. The next stronger metrics come from structured-output and error-analysis improvements.

Two-line model comparison

Groups Line 1 task-head baselines and Line 2 selected-128 methods: metadata/raw baselines, Qwen3-Omni v6 LoRA, Cosmos3-Nano Future Window, and Cosmos3-Super Reasoner.

result comparison

128-episode source + features

Maps every selected official Xperience-10M episode id to its gated source tree and the public-safe processed features: Qwen v6 multiscale windows, dense multiscale rows, and metadata matrices.

source/feature index

128-Episode Task Suite Enhancement Pack

Canonical no-new-episode plan for denser supervision: `multiscale_20s10_40s20_80s40`, hierarchical action/subtask labels, stronger scoring slices, and raw-feature shard priorities.

enhancement data

Foundation-model plan

Separates current implemented/diagnostic tracks (Qwen3-Omni v6, Cosmos3-Super/Nano), candidate future policy tracks (OpenVLA/openpi/GR00T-style models after action conversion), and the long-term Xperience-native embodied foundation model goal.

foundation model plan

Multi-episode data access

Public data-access path, selected 128-episode pilot plan, and preparation requirements.

data access

Qwen3-Omni LoRA group

Separates the 1-episode sensor-adapter smoke test from Qwen run v1-v6. v6 is the current 20-task matrix row, while v5 remains the pinned prior release.

Qwen v1-v6 lineage Qwen group

Cosmos3 groups

Shows the verified Nano future-window compatibility package, the Super base-weight Reasoner JSON-task evaluation, and the Super fine-tuned forward-dynamics LoRA artifact with separate loss metrics.

Cosmos groups

Scale-up requirement

Future runs need validation tracking, held-out predictions, quality-target reporting, and the same public-safe package gate.

training requirements

Xperience-native pretraining

Future plan for a domain-specific embodied foundation model trained from scratch over full-corpus video, audio, geometry, motion, inertial, and language streams.

pretraining plan

Qwen3-Omni diagnostic run is verified.

The selected pilot uses 128 source-balanced episodes across 128 different session UUIDs. The latest v6 held-out package is verified, and its weak metrics define the next structured-output and error-analysis pass.

Selection

128 complete episodes selected from 128 unique top-level sessions, balanced across episode-size bands and split 96/16/16 for train/val/test.

source/feature index

Transfer

Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.

Current LoRA artifact

The current Qwen3-Omni LoRA artifact is the verified v6 selected 128-episode diagnostic adapter. The v5 row remains pinned as the prior release, and the 1-episode Qwen entry is only a sensor-adapter smoke test.

model groups

No-new-episode suite push

The next suite push does not need more episodes first: use `multiscale_20s10_40s20_80s40`, hierarchical action/subtask targets, and raw-feature shards while keeping the held-out split fixed.

enhancement data

Backbone tracks

Qwen3-Omni uses a separate LoRA model repo; Cosmos3-Nano remains a compatibility package; Cosmos3-Super now has a verified forward-dynamics LoRA artifact with weights in a dedicated model repo.

Cosmos3-Super weights

Native foundation model

The long-term goal is a full-corpus Xperience Embodied Foundation Model trained on synchronized perception, geometry, motion, inertial, audio, and language streams after smaller scaling stages validate the approach.

pretraining plan

Foundation-model scale-up paths.

These three paths translate Xperience-10M files into larger spatial, world-model, and vision-language-action training directions after the current task suite and selected-128 diagnostic results.

High-resolution direction slide

Spatial intelligence models

Train spatial-memory models from multiview RGB, egocentric video, depth, pose, calibration, object/contact cues, and language prompts; evaluate spatial QA, object permanence, counting, retrieval, and pose-aware consistency.

Sample input

Use windows.csv and shared_windows.npz to slice each 20-frame window, then join six MP4 RGB streams with annotation.hdf5 depth, camera pose, SLAM/calibration, object cues, contacts, and optional language questions.

Training output

Build targets such as camera-view match, object relevance, object-set memory, depth/pose reconstruction proxy, caption-grounded retrieval, and spatial QA answers derived from the same public annotation timeline.

Existing hooks

Tasks 8/10/12/14/16 20-task contract data window manifest

Structured pipeline mirror Open image

High-resolution direction slide

Human-video world models

Train future-prediction models from observed interaction windows to score next action, next subtask, future object set, contact transition, camera-motion delta, and latent future state, with Qwen-style probes and Cosmos-style dynamics kept separate.

Sample input

Take the current 20-frame observed window at time t from shared_windows.npz: RGB/audio/sensor summaries, hand/body motion, camera pose, current object/contact state, and current action/subtask context only.

Training output

Shift the same episode timeline forward to produce next-action, next-subtask, future object-set, contact-transition, time-to-transition, camera-motion delta, or latent/future-feature targets. Future labels stay out of the input.

Existing hooks

Tasks 4/13/14/17/20 model scores future probes

Track note Open image

High-resolution direction slide

Vision-language-action models

Train VLA or policy-compatible heads only after converting egocentric video, captions, hand/body motion, contacts, objects, and procedures into traceable action tokens, chunks, and object-conditioned action targets.

Sample input

Use egocentric/fisheye video windows, caption/object context from annotation.hdf5, hand/body mocap, contact state, and current subtask text as the observation-language side of each training pair.

Training output

For the one-sample suite, output action-token proxies: current/next action, object-conditioned action relation, contact state, interaction-text class, subtask transition, or hand-trajectory/action-chunk proxy. Robot action chunks need a later retargeting converter.

Existing hooks

Tasks 1/4/5/6/15/18 task walkthroughs track contract

Structured pipeline mirror Open image

Episode taxonomy and data engine

Build an episode atlas, category tags, balance report, and split builder across activities, objects, scenes, sessions, people, and missing modalities.

Open development plan

Standardized benchmark protocol

Version train/val/test manifests, task cards, leakage checks, metric scripts, and reference baselines so future model scores are comparable.

Open benchmark protocol note

Multimodal representation learning

Train contrastive and masked-prediction encoders over synchronized video, audio, depth, pose, mocap, IMU, and language windows.

Open representation-learning plan

Skill and procedure graphs

Mine action steps, transitions, preconditions, effects, and temporal graphs that connect egocentric perception to planning.

current task map

Human-object affordances

Add contact, reachable-object, tool-use, and next-affordance tasks using hands, mocap, objects, contacts, video, and language.

task walkthroughs

3D/4D scene and object memory

Fuse depth, pose/SLAM, multiview video, and object cues into persistent scene/object maps for spatial reasoning and object permanence.

model tracks

Quality and sync diagnostics

Track timestamp drift, missing streams, calibration consistency, corrupted files, and degraded-mode manifests before large training runs.

evidence contract

Policy and simulation transfer

Convert mocap, hand trajectories, contacts, and object states into action tokens, robot-compatible targets, and imitation-learning examples.

foundation plan

Project overview and evidence map.

Project links stay tied to the evidence trail.

Two public evidence lines: inspectable task lab and comparison surface.

One comparison story, detailed charts in the task-suite section.

Research roadmap details

Multimodal episode pipeline

Task suite and baseline heads

Dataset source alignment

Public research artifacts

Qwen3-Omni held-out pilot

No-new-episode stress plan

Data governance

Aligned with the official dataset card.

Official dataset

Line 1 public sample

Public sample inspection

Data explorer analysis

Multi-episode pilot

Data boundary

Implemented subset

Covered now

Responsible use

Later milestones

Data explorer analysis.

One inspectable episode.

Held-out comparison surface.

Full-corpus metadata.

Public sample episode: inspectable raw evidence.

What each data scale can answer.

One sample, selected 128, and the HF full dataset are different reading scopes.

Video dominates the 8,546-dimensional public-sample input.

Window labels show which actions occupy the sample.

Feature rows grow by export density and window scale.

The Hugging Face full dataset is mostly repeated episode file structure.

Raw public sample browser.

fisheye_cam0.mp4

Sample folder organization

annotation.hdf5 group map

Stream-to-feature use

Xperience-10M 20-task suite.

All 20 tasks are one suite, not a separate tier.

Unified plus split radars

Metric normalization

Score/proxy audit

1-Episode 20-Task Radar

128-Episode 20-Task Radar

All 20 task cards in the same suite.

Assigned visual language for the 20 tasks.

Interactive task walkthrough.

Action Recognition

Evaluation protocol is explicit.

Data unit

Split policy

Metric contract

Leakage controls

Audio ablation

Foundation track selection

Next evaluation stage

Selected-128 next stressor

Public-safe scale-up gate

From raw episode to research artifacts.

Qwen3-Omni LoRA training flow

What the figure represents

What this project enables

What still needs more data

Model inputs are traceable.

Start from official streams

Slice into the same unit

Name every input block

The baseline task heads share four head families.

Results by evidence line.

Task-lab evidence

Scale-up evidence

All methods x all 20 tasks, in one source-linked table.

Small baselines, no hidden machinery.

Motion-only action

Current all-feature action

Motion-only subtask

Current all-feature subtask

Neural MLP heads, same task contracts.