public sample episode / multimodal task lab

Ropedia Xperience-10M Research Task Lab.

This project uses the public Xperience-10M sample from Ropedia to explore embodied-AI task design, multimodal feature construction, lightweight baselines, and future Omni-model fine-tuning. It starts from the sample episode available now, then keeps the same data contracts ready for held-out multi-episode training when more Xperience-10M data is staged.

5,821frames in sample episode
1,16120-frame windows
8,378current feature dimensions
12+12+4core, neural, and extension probes
current feature allocation window vector
mocap
2,121
camera+imu
126
depth
980
video
4,116
language
896
static
139

Project overview and contributions.

The page is organized like a compact research project: motivation and scope, dataset sample, task suite, method, baselines, research directions, interactive walkthroughs, and resources for continuing the work.

verified

Multimodal episode pipeline

One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.

frames 5,821 windows 1,161 features 8,378
verified

Task suite and baseline heads

Every core task has a minimal baseline and a compact PyTorch MLP head over the same windows, splits, and labels.

core tasks 12 neural heads 12 extension probes 4
verified

Dataset scope alignment

The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and unsupported claims.

full dataset gated sample scope 1 episode raw data mirrored no
verified

Public research artifacts

Metrics, figures, walkthrough data, baseline weights, and project metadata are packaged across GitHub, GitHub Pages, and Hugging Face.

site integrity pass mirror parity pass live status pass
data-gated

Omni-model scale-up path

The 32-episode LoRA path is prepared, but no model-quality claim is made until gated data access, held-out splits, training, and evaluation pass.

current claim readiness only target gate 32 episodes held-out eval pending
not redistributed

Data governance

Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.

raw Xperience-10M excluded full Qwen weights excluded derived artifacts included

Evaluation protocol is explicit.

The protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and unsupported interpretations before comparing scores.

Data unit

One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by the current 8,378-d feature vector.

protocol JSON

Split policy

Single-episode chronological 70/30 train/test split. This avoids random future-window mixing, but it cannot prove cross-episode generalization.

protocol doc

Metric contract

All 12 tasks list input, target, primary metric, minimal baseline score, and neural MLP score from committed result files.

summary metrics

Leakage controls

Scalers fit on train windows only; future labels, target feature blocks, caption/object labels, and contact labels stay on the target side unless explicitly queried.

builder script

Unsupported readings

No cross-episode generalization, no audio-visual learning claim, no pixel-depth or neural-rendering claim, and no real 32-episode Qwen3-Omni quality claim.

scope guard

Scale-up gate

A real Omni claim requires at least 32 valid episodes, held-out episode splits, no train/test episode leakage, training metadata, predictions, metrics, and run report.

data gate

Research progress without overclaiming.

The project separates implemented single-episode task development from the prepared but still data-gated Qwen3-Omni scale-up path.

verified

12 minimal heads + 12 neural MLP heads

Every task has a minimal interpretable head and a matching neural MLP run over the same windows, splits, and task contract.

verified

Four research directions are mapped honestly

The Ropedia directions are labeled as direct, proxy, or diagnostic evidence, plus one coded extension probe per direction.

data-gated

Qwen3-Omni remains readiness-only

The current Qwen3-Omni artifacts use one episode and 128 train windows. No 32-episode metric is claimed.

verified

Scope claims are machine-checked

The scope guard confirms historical 32ep run/path strings stay confined to readiness-artifact provenance and are not presented as real 32-episode results.

verified

Prepared mirrors are byte-checked

The parity report compares critical JSON, figure, and validator files across the repo, HF Space bundle, artifact dataset bundle, and model bundle before upload.

verified

Figures are indexed as evidence

The figure index records public visual assets, dimensions, SHA-256 hashes, source scripts, and the role each figure plays in the project narrative.

verified

Brand assets are packaged consistently

The generated logo system is packaged into the website header, favicon, README/HF cards, Open Graph preview, and brand-asset manifest.

verified

Publication bundles pass hygiene checks

The validator checks required assets, raw-data exclusion, Python cache exclusion, heavy archive exclusion, accidental HF token strings, and public-card figure freshness across GitHub and the HF bundles.

verified

Website integrity is checked

The site validator checks local links, anchors, JSON bundles, and referenced image dimensions before publishing.

verified

Official dataset card is aligned

The source-alignment note mirrors the public Hugging Face dataset-card facts, sample-card facts, and API metadata: gated access, sample license/tooling, modality coverage, episode layout, intended uses, and unsupported claims.

Research reading path.

A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.

02

Inspect one model input

Use the window table and feature manifest to see the exact aligned sample unit, feature blocks, dimensions, and omitted audio feature status.

03

Compare minimal vs neural heads

Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.

04

Check the scale-up gate

The multi-episode Qwen3-Omni path is prepared, but no real 32-episode result is claimed until the data gate and held-out evaluation pass.

Verified nowOne public episode, 5,821 frames, 1,161 windows, 8,378 current features, 12 minimal heads, 12 neural heads, and 4 direction-extension probes.
Pending scaleA 32-episode held-out Qwen3-Omni LoRA pilot is gated on Xperience-10M access and must pass manifest, training, and evaluation checks.
Not redistributedRaw videos, raw annotations, full Qwen weights, and private gated Xperience-10M data are not included in the public repo or HF bundles.

Aligned with the official dataset card.

The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while keeping its own claims limited to one public sample episode.

Official scale

About 10M experience units and 10,000 hours, with RGB video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, captions, metadata, and calibration.

alignment JSON

HF file-size display

The live Hugging Face page/API currently shows 31.9 TB hosted. This is recorded separately from the card's about-1PB full-scale storage statement.

source JSON

HF access boundary

The source dataset is manually gated for approved non-commercial use, with an external agreement step noted by the public HF metadata.

official HF dataset

API listing snapshot

HF API metadata observed 803 session folders and 12,103 episode folders with annotation.hdf5. This is planning metadata, not local data possession or a result claim.

metadata JSON

Public sample card

The sample repo lists cc-by-nc-4.0, HOMIE Toolkit for videos/annotations, and Rerun 0.29.0 for .rrd visualization.

sample dataset

Source alignment

The source-alignment validator checks full-dataset facts, public sample-card facts, API-listing caveats, and boundary markers across the repo, website, and HF cards.

alignment report

Episode layout

Expected folders contain six MP4 streams and annotation.hdf5; visualization.rrd is treated as a viewer artifact and excluded from training downloads.

alignment note

Current project subset

One public sample episode, 5,821 frames, 1,161 windows, 8,378 current features, audio documented but not yet featurized, and no raw-data redistribution.

modality atlas

Covered now

Action/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, and misalignment.

summary metrics

Responsible use

The official card notes limited diversity and showcase/production quality. This project excludes identity, surveillance, biometric, sensitive-attribute, and safety-critical uses.

use boundary

Not claimed yet

Full audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, and real 32-episode Qwen3-Omni quality.

data gate

Ropedia Xperience-10M 12-task suite, first.

Start with the full 12-task map, then inspect the large native modality atlas below it. Audio is present in the sample MP4 stream, but the current 8,378-d baseline manifest does not featurize it.

Infographic showing all 12 Ropedia Xperience-10M tasks with enlarged full-width modality cards

Readable modality atlas.

Each Xperience-10M stream gets a large thumbnail, a plain sample-content line, and the exact current-baseline use. These are small derived images only; no raw MP4, HDF5, or RRD data is redistributed.

modality_atlas.json
01

Video

visual stream
Public sample fisheye and stereo camera thumbnails
sample contains

6 synchronized camera MP4 streams

current baseline use

RGB/fisheye/stereo frame statistics

02

Audio

acoustic stream
AAC waveform thumbnail from the public sample MP4 stream
sample contains

AAC stream embedded in MP4

current baseline use

Documented, not featurized in the 8,378-d vector

03

Depth

geometry map
Public sample depth and confidence thumbnails
sample contains

Depth map + confidence channel

current baseline use

Spatial geometry feature block

04

Pose / SLAM

camera pose
Public sample camera trajectory and sparse SLAM map thumbnail
sample contains

Trajectory + sparse SLAM map

current baseline use

Position + orientation features

05

Motion Capture

human motion
Public sample body and hand motion capture thumbnail
sample contains

Body + hand joint tracks

current baseline use

3D mocap feature statistics

06

Inertial

wearable sensor
Public sample accelerometer and gyroscope time-series thumbnail
sample contains

Accelerometer + gyroscope

current baseline use

Wearable motion statistics

07

Language

semantic annotation
Public sample object tags and action caption thumbnail
sample contains

Object tags + action captions

current baseline use

Task labels + semantic targets

The atlas redistributes only small derived thumbnails and metadata. Raw MP4, HDF5, and RRD files remain excluded from this repo and the Hugging Face mirrors.

From raw episode to checked artifacts.

Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.

Verified Xperience-10M multimodal pipeline diagram

What this project proves

It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.

What it does not prove

It does not claim general embodied intelligence. A single episode cannot support cross-environment generalization; that requires many episodes and held-out episode splits.

Small baselines, no hidden machinery.

Motion-only and current all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. The neural run keeps the same features and splits, then swaps in PyTorch MLP heads.

Motion-only action

0.9688macro-F1, 18 classes

Current all-feature action

0.9791macro-F1, 8,378 features

Motion-only subtask

0.9528macro-F1, 14 classes

Current all-feature subtask

0.9308macro-F1, chronological caveats
Macro-F1 comparison chart

Neural MLP heads, same task contracts.

The neural baseline uses small PyTorch MLP classifiers/regressors on the same 8,378-d window features, chronological splits, and leakage filters. This isolates the value of a nonlinear head before moving to heavier Qwen/Omni experiments.

Neural hand forecast

0.1116MPJPE, down from 0.8223 minimal

Neural temporal order

0.8718F1, adjacent-window diagnostic

Neural misalignment

0.7335F1, shifted motion/visual pairs

Neural cross-modal retrieval

0.1530MRR; ridge remains stronger here
Neural MLP episode task score chart Minimal versus neural MLP episode task score chart

The 12 tasks organized into four research directions.

Each task is mapped as direct, proxy, or diagnostic evidence for the Ropedia research tracks. The mapping uses two current baselines: minimal interpretable heads and neural MLP heads over the same feature contract.

partially implemented

A. Human Modeling & Motion Understanding

Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.

2direct2proxy0diagnostic
proxy tasks only

B. 3D/4D Reconstruction & Neural Rendering

Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.

0direct2proxy1diagnostic
strongest implemented

C. Egocentric Vision & Interaction

Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.

6direct2proxy3diagnostic
early proxy tasks

D. Scene Reconstruction & World Modeling

Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.

0direct6proxy3diagnostic
Coverage of the 12 Xperience-10M tasks across four research directions

Baseline 1: minimal heads

Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.

Baseline 2: neural MLP heads

Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.

Four extra probes make the directions actionable.

These are new data-backed extension tasks computed from the same single-episode feature tensor. They add one concrete input, process, output, and metric for each research direction, while keeping the single-episode limitation explicit.

Four Xperience-10M research-direction extension probes with minimal and neural metrics
A / motion

Body and Hand Motion Intensity

Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.

Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.

Output: high_motion or low_motion.

0.7827minimal macro-F10.7986neural macro-F1
B / views

Multi-View Consistency Retrieval

Case: retrieve the synchronized stereo-left window from a fisheye-camera query.

Input: fisheye_cam0 video features against stereo_left candidate features.

Output: ranked synchronized view candidates.

0.5534minimal MRR0.3469neural MRR
C / phase

Action Phase Progress Estimation

Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.

Input: non-caption multimodal features.

Output: 0-to-1 progress inside the current action.

0.3416minimal MAE0.3038neural MAE
D / world

Short-Horizon Ego-Motion Forecasting

Case: predict how the camera translation changes over the next 20 frames.

Input: current sensors excluding camera translation and captions.

Output: future camera-translation delta vector.

0.1989minimal MAE0.0989neural MAE

What changed

The four research directions now have coded extension probes, prediction/rank CSVs, JSON metrics, a Markdown summary, and a website chart generated from real sample-window features.

What still needs scale

A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.

The 12 tasks share four head families.

The diagram separates the shared episode-window feature pipeline from the task-specific heads, and notes that audio remains dataset context rather than a current baseline feature block.

Verified minimal and neural architecture diagram for all 12 Ropedia Xperience-10M tasks

Interactive task walkthrough.

Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.

Representative sample modality for the selected task
Step 1 / 4 · Input
Action Recognition Egocentric Action Recognition

Input: inspect the 20-frame multimodal window before choosing the target.

01 / 12
supervised multiclass classifier

Action Recognition

In the coffee-making sample, a pouring window maps to the current action label.

    Metric: macro-F1. Minimal 0.0500; neural MLP 0.0263.

    Current limitation: single-episode chronological split.

    Task cards and metrics.

    The 12 task cards use readable research names, representative modality thumbnails, explicit input-process-output contracts, and verified minimal versus neural scores from the committed result files.

    Every feature block has a source.

    The point is not hidden complexity. Every block has a source modality, a dimensional footprint, and a manifest entry.

    All modality feature block chart

    Diagnostics separate memorization from signal.

    The charts make the main lesson visible: within-episode supervised labels are easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.

    Episode task suite score chart Cross modal retrieval chart Neural MLP task score chart Minimal versus neural score chart

    Where the evidence lives.

    Metrics, predictions, confusion matrices, manifests, lightweight model weights, and derived window artifacts are committed so the repo is easy to explore before rerunning anything. Raw Xperience-10M data and Qwen weights are not redistributed.

    Start here

    Project map and scope boundary

    These files tell a reader what has been implemented, which artifacts support it, and where the current scale-up boundary stops.

    Artifact guide

    Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.

    ARTIFACT_GUIDE.md

    Evidence contract

    Defines verified, readiness-only, blocked, and out-of-scope claims.

    EVIDENCE_CONTRACT.md

    Quality gates

    One release checklist for automated validators and live post-publish checks.

    quality_gates.json

    Artifact index

    Selective source-of-truth catalog with existence checks, sizes, and stable-file hashes.

    artifact_index.json

    Mirror parity

    Prepared repo, HF Space, artifact dataset, and model bundle parity for critical data, figures, website HTML, and validator files.

    mirror_parity.json

    Scope claim guard

    Machine check that historical 32ep identifiers are provenance, not real 32-episode result claims.

    scope_claims_audit.json

    Publication hygiene

    Checks raw-data exclusion, cache exclusion, heavy-archive exclusion, token-string hygiene, and public-card figure freshness.

    publication_audit.json

    Website integrity

    Checks local links, anchors, JSON files, and referenced website image dimensions.

    website_integrity.json

    Task-surface integrity

    Checks that public task cards use readable research names, modality thumbnails, and the interactive walkthrough/player contract.

    task_surface_integrity.json
    Data and task evidence

    From one episode to checked task heads

    These artifacts show the exact sample windows, feature blocks, task contracts, metrics, walkthroughs, and research-direction mapping.

    Task-suite report

    One JSON file with every task definition, split detail, and metric.

    summary_report.json

    Windows table

    Window start/end frames and aligned action/subtask labels.

    windows.csv

    Neural MLP task results

    Per-task PyTorch MLP metrics, predictions, histories, and checkpoints for the same 12 task contracts.

    neural_mlp/

    Four-direction taxonomy

    Generated JSON, CSV, Markdown, and website data mapping all 12 tasks to the four research tracks.

    research_directions/

    Direction extension probes

    Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.

    research_direction_extensions/

    Task walkthroughs

    Case studies for all 12 tasks, including input, middle process modules, output, metric, limitation, and task-player data.

    task_walkthroughs/

    Cross-modal retrieval

    The strongest self-supervised signal from the single episode.

    metrics.json
    Mirrors and reproducibility

    Public surfaces and runnable code

    The same evidence is mirrored across GitHub Pages, Hugging Face, scripts, and lightweight baseline model repositories.

    Reproduction scripts

    Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.

    scripts/

    Current all-feature action model

    Classifier metrics, predictions, confusion matrix, and model weights.

    metrics.json

    Hugging Face Space

    The dashboard packaged as a public static Space.

    HF Space

    Derived HF artifacts

    Metrics, predictions, docs, and lightweight derived files without raw data redistribution.

    dataset repo

    HF baseline models

    Minimal NumPy softmax, ridge baselines, and neural task-head model files.

    model repo

    HF collection

    Space, artifacts, and model baselines grouped into one public project collection.

    collection
    Scale-up boundary

    Prepared, but not overclaimed

    The multi-episode Qwen3-Omni path is documented and scripted, but no full-pilot metric is claimed until the data gate and held-out evaluation pass.

    Multi-episode access status

    Public data-access boundary and selected 32-episode pilot plan, without private infrastructure details.

    MULTI_EPISODE_ACCESS_STATUS.md

    Qwen3-Omni readiness artifacts

    Manifests, metadata, metrics, and progress logs from the current readiness run.

    episode_manifest.json

    32-episode data gate

    The readiness gate remains the source of truth before any full pilot training claim.

    DATA_BLOCKER_REPORT.md

    Qwen3-Omni pilot is approval-ready.

    The full Xperience-10M Hugging Face dataset is gated. While access is pending, the public plan has selected a 32-episode pilot across 32 different session UUIDs.

    Selection

    Stratified round-robin over 64 top-level sessions; 680 complete candidates scanned; 32 sessions selected.

    Transfer

    Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.

    Boundary

    The current LoRA artifact is a readiness checkpoint. A real 32-episode result requires local gated data and held-out evaluation.

    Reproduce the suite.

    Raw Xperience-10M data is not redistributed here. The public reproduction contract states the commands, expected outputs, exact-match reproduction evidence, and current non-reproducible scale-up boundary.

    Reproducibility contract

    Human-readable commands, expected artifacts, and boundaries for the public single-episode pipeline.

    REPRODUCIBILITY.md

    Reproducibility matrix

    Machine-readable command matrix covering sample download, baselines, 12 tasks, figures, and validation.

    reproducibility_matrix.json

    Exact-match reproduction check

    The last metric check rebuilt the public-sample outputs from fresh cache and matched the committed metrics.

    reproducibility_audit.md

    Website integrity

    Local HTML references, anchors, JSON bundles, and image dimensions checked before publishing.

    website_integrity.json

    Scale-up boundary

    The 32-episode Qwen3-Omni pilot is prepared but not publicly reproducible until gated data access and held-out evaluation pass.

    DATA_BLOCKER_REPORT.md

    Minimal path: install the toolkit dependencies, download the official sample, run the 12-task suite with neural heads, regenerate visualizations, then run the artifact index and publication validator.

    git clone https://github.com/Ropedia/HOMIE-toolkit.git
    python3.12 -m venv .venv
    source .venv/bin/activate
    pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
    git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
    pip install -r ropedia-xperience-10m-task-suite/requirements.txt
    pip install torch
    
    hf download ropedia-ai/xperience-10m-sample \
      --repo-type dataset \
      --local-dir data/sample/xperience-10m-sample
    
    cd ropedia-xperience-10m-task-suite
    export WORKSPACE=/path/to/workspace
    python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
    python scripts/research_direction_extension_tasks.py
    python scripts/task_walkthroughs.py
    python scripts/generate_visualizations.py
    python scripts/render_overview_figures.py
    python scripts/render_task_suite_infographic.py
    python scripts/export_modality_atlas_assets.py
    python scripts/validate_website_integrity.py
    python scripts/validate_scope_claims.py
    python scripts/build_artifact_index.py
    python scripts/validate_mirror_parity.py
    python scripts/validate_publication_package.py