Multimodal episode pipeline
One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.
This project uses the public Xperience-10M sample from Ropedia to explore embodied-AI task design, multimodal feature construction, lightweight baselines, and future Omni-model fine-tuning. It starts from the sample episode available now, then keeps the same data contracts ready for held-out multi-episode training when more Xperience-10M data is staged.
mocapcamera+imudepthvideolanguagestaticThe page is organized like a compact research project: motivation and scope, dataset sample, task suite, method, baselines, research directions, interactive walkthroughs, and resources for continuing the work.
One Xperience-10M public sample episode is converted into aligned windows and a documented feature contract.
Every core task has a minimal baseline and a compact PyTorch MLP head over the same windows, splits, and labels.
The public description is aligned to the official gated Xperience-10M dataset card, including modalities, scale, access, and unsupported claims.
Metrics, figures, walkthrough data, baseline weights, and project metadata are packaged across GitHub, GitHub Pages, and Hugging Face.
The 32-episode LoRA path is prepared, but no model-quality claim is made until gated data access, held-out splits, training, and evaluation pass.
Raw MP4/HDF5/RRD files, private gated Xperience-10M data, and full Qwen weights are excluded from the public repo and HF mirrors.
The protocol is generated from committed metric artifacts so readers can see the exact data unit, split, task targets, leakage controls, and unsupported interpretations before comparing scores.
One 20-frame aligned window from the public sample episode, stride 5 frames, 1,161 windows total, represented by the current 8,378-d feature vector.
Single-episode chronological 70/30 train/test split. This avoids random future-window mixing, but it cannot prove cross-episode generalization.
protocol docAll 12 tasks list input, target, primary metric, minimal baseline score, and neural MLP score from committed result files.
summary metricsScalers fit on train windows only; future labels, target feature blocks, caption/object labels, and contact labels stay on the target side unless explicitly queried.
builder scriptNo cross-episode generalization, no audio-visual learning claim, no pixel-depth or neural-rendering claim, and no real 32-episode Qwen3-Omni quality claim.
scope guardA real Omni claim requires at least 32 valid episodes, held-out episode splits, no train/test episode leakage, training metadata, predictions, metrics, and run report.
data gateThe project separates implemented single-episode task development from the prepared but still data-gated Qwen3-Omni scale-up path.
5,821 frames become 1,161 synchronized 20-frame windows with an explicit 8,378-d feature contract.
Every task has a minimal interpretable head and a matching neural MLP run over the same windows, splits, and task contract.
The Ropedia directions are labeled as direct, proxy, or diagnostic evidence, plus one coded extension probe per direction.
The current Qwen3-Omni artifacts use one episode and 128 train windows. No 32-episode metric is claimed.
The scope guard confirms historical 32ep run/path strings stay confined to readiness-artifact provenance and are not presented as real 32-episode results.
The parity report compares critical JSON, figure, and validator files across the repo, HF Space bundle, artifact dataset bundle, and model bundle before upload.
The figure index records public visual assets, dimensions, SHA-256 hashes, source scripts, and the role each figure plays in the project narrative.
The generated logo system is packaged into the website header, favicon, README/HF cards, Open Graph preview, and brand-asset manifest.
The validator checks required assets, raw-data exclusion, Python cache exclusion, heavy archive exclusion, accidental HF token strings, and public-card figure freshness across GitHub and the HF bundles.
The site validator checks local links, anchors, JSON bundles, and referenced image dimensions before publishing.
The quality-gate manifest collects the automated validators and the live post-publish checks required before the release is presented as current.
The source-alignment note mirrors the public Hugging Face dataset-card facts, sample-card facts, and API metadata: gated access, sample license/tooling, modality coverage, episode layout, intended uses, and unsupported claims.
A newcomer should be able to move from the dataset sample to the task design, model baselines, current limitations, and scale-up plan without reading every file first.
Start with the project status, evidence contract, artifact index, scope guard, publication hygiene report, and website integrity report. They separate implemented single-episode work from the prepared Qwen3-Omni scale-up.
Use the window table and feature manifest to see the exact aligned sample unit, feature blocks, dimensions, and omitted audio feature status.
Every task has a small interpretable baseline and a matching neural MLP head over the same feature contract and chronological split.
The multi-episode Qwen3-Omni path is prepared, but no real 32-episode result is claimed until the data gate and held-out evaluation pass.
The official Xperience-10M card describes a gated, large-scale 4D egocentric multimodal dataset. This project records that full upstream scope while keeping its own claims limited to one public sample episode.
About 10M experience units and 10,000 hours, with RGB video, audio, depth, camera pose/SLAM, hand/body mocap, IMU, captions, metadata, and calibration.
The live Hugging Face page/API currently shows 31.9 TB hosted. This is recorded separately from the card's about-1PB full-scale storage statement.
source JSONThe source dataset is manually gated for approved non-commercial use, with an external agreement step noted by the public HF metadata.
official HF datasetHF API metadata observed 803 session folders and 12,103 episode folders with annotation.hdf5. This is planning metadata, not local data possession or a result claim.
The sample repo lists cc-by-nc-4.0, HOMIE Toolkit for videos/annotations, and Rerun 0.29.0 for .rrd visualization.
The source-alignment validator checks full-dataset facts, public sample-card facts, API-listing caveats, and boundary markers across the repo, website, and HF cards.
alignment reportExpected folders contain six MP4 streams and annotation.hdf5; visualization.rrd is treated as a viewer artifact and excluded from training downloads.
One public sample episode, 5,821 frames, 1,161 windows, 8,378 current features, audio documented but not yet featurized, and no raw-data redistribution.
modality atlasAction/subtask labels, next-action prediction, temporal diagnostics, hand trajectory, contact, object relevance, caption grounding, retrieval, reconstruction, and misalignment.
summary metricsThe official card notes limited diversity and showcase/production quality. This project excludes identity, surveillance, biometric, sensitive-attribute, and safety-critical uses.
use boundaryFull audio-visual learning, caption generation, depth-pixel prediction, SLAM estimation, neural rendering, policy learning, cross-episode generalization, and real 32-episode Qwen3-Omni quality.
data gateStart with the full 12-task map, then inspect the large native modality atlas below it. Audio is present in the sample MP4 stream, but the current 8,378-d baseline manifest does not featurize it.
Each Xperience-10M stream gets a large thumbnail, a plain sample-content line, and the exact current-baseline use. These are small derived images only; no raw MP4, HDF5, or RRD data is redistributed.
6 synchronized camera MP4 streams
RGB/fisheye/stereo frame statistics
AAC stream embedded in MP4
Documented, not featurized in the 8,378-d vector
Depth map + confidence channel
Spatial geometry feature block
Trajectory + sparse SLAM map
Position + orientation features
Body + hand joint tracks
3D mocap feature statistics
Accelerometer + gyroscope
Wearable motion statistics
Object tags + action captions
Task labels + semantic targets
The atlas redistributes only small derived thumbnails and metadata. Raw MP4, HDF5, and RRD files remain excluded from this repo and the Hugging Face mirrors.
Every script works from one data contract: aligned multimodal windows, explicit labels, cached feature extraction, and a manifest that makes omitted modalities visible.
It demonstrates the full development loop: reading Xperience-10M sample data, aligning modalities, converting them into model-ready windows, defining meaningful tasks, producing metrics, and packaging artifacts for continued research.
It does not claim general embodied intelligence. A single episode cannot support cross-environment generalization; that requires many episodes and held-out episode splits.
Motion-only and current all-feature classifiers use lightweight heads so the comparison stays readable on a laptop and easy to inspect. The neural run keeps the same features and splits, then swaps in PyTorch MLP heads.
The neural baseline uses small PyTorch MLP classifiers/regressors on the same 8,378-d window features, chronological splits, and leakage filters. This isolates the value of a nonlinear head before moving to heavier Qwen/Omni experiments.
Each task is mapped as direct, proxy, or diagnostic evidence for the Ropedia research tracks. The mapping uses two current baselines: minimal interpretable heads and neural MLP heads over the same feature contract.
Direct evidence comes from hand trajectory forecasting and contact prediction; action and object relevance are supporting proxies.
Cross-modal retrieval, modality reconstruction, and misalignment detection check reconstruction prerequisites, not full geometry.
Action, subtask, transition, next-action, object, caption, order, and alignment tasks directly stress egocentric understanding.
Current probes cover task state, object relevance, retrieval, reconstruction, temporal order, and alignment but no persistent map yet.
Softmax, logistic, ridge, and retrieval heads keep every input/output contract readable. They are the first sanity check for whether a task is well-posed.
Small PyTorch MLP classifiers/regressors reuse the same features and splits. They test nonlinear gains before heavier Omni fine-tuning.
These are new data-backed extension tasks computed from the same single-episode feature tensor. They add one concrete input, process, output, and metric for each research direction, while keeping the single-episode limitation explicit.
Case: classify fast reach/pour windows as high motion and steady holding windows as low motion.
Input: non-mocap video, depth, pose, IMU, SLAM, calibration, and language features.
Output: high_motion or low_motion.
Case: retrieve the synchronized stereo-left window from a fisheye-camera query.
Input: fisheye_cam0 video features against stereo_left candidate features.
Output: ranked synchronized view candidates.
Case: estimate whether a Pour coffee window is near the start, middle, or end of its action segment.
Input: non-caption multimodal features.
Output: 0-to-1 progress inside the current action.
Case: predict how the camera translation changes over the next 20 frames.
Input: current sensors excluding camera translation and captions.
Output: future camera-translation delta vector.
The four research directions now have coded extension probes, prediction/rank CSVs, JSON metrics, a Markdown summary, and a website chart generated from real sample-window features.
A full research result still needs many Xperience-10M episodes, held-out episode splits, stronger encoders, and direction-specific models such as body priors, renderers, or persistent scene graphs.
The diagram separates the shared episode-window feature pipeline from the task-specific heads, and notes that audio remains dataset context rather than a current baseline feature block.
Each task uses a common research name and a concrete case study, then opens into the input, middle modules, output, modality evidence, metric, and current limitation.
Input: inspect the 20-frame multimodal window before choosing the target.
In the coffee-making sample, a pouring window maps to the current action label.
Metric: macro-F1. Minimal 0.0500; neural MLP 0.0263.
Current limitation: single-episode chronological split.
The 12 task cards use readable research names, representative modality thumbnails, explicit input-process-output contracts, and verified minimal versus neural scores from the committed result files.
The point is not hidden complexity. Every block has a source modality, a dimensional footprint, and a manifest entry.
The charts make the main lesson visible: within-episode supervised labels are easy under some splits, while retrieval, grounding, forecasting, and alignment remain the useful probes.
Metrics, predictions, confusion matrices, manifests, lightweight model weights, and derived window artifacts are committed so the repo is easy to explore before rerunning anything. Raw Xperience-10M data and Qwen weights are not redistributed.
These files tell a reader what has been implemented, which artifacts support it, and where the current scale-up boundary stops.
Human-readable map from project scope to data contract, task evidence, platform mirrors, and scale-up status.
Defines verified, readiness-only, blocked, and out-of-scope claims.
EVIDENCE_CONTRACT.mdOne release checklist for automated validators and live post-publish checks.
quality_gates.jsonLast public GitHub/HF URL verification after upload.
live_publication_status.jsonSelective source-of-truth catalog with existence checks, sizes, and stable-file hashes.
artifact_index.jsonPrepared repo, HF Space, artifact dataset, and model bundle parity for critical data, figures, website HTML, and validator files.
mirror_parity.jsonMachine check that historical 32ep identifiers are provenance, not real 32-episode result claims.
Checks raw-data exclusion, cache exclusion, heavy-archive exclusion, token-string hygiene, and public-card figure freshness.
publication_audit.jsonChecks local links, anchors, JSON files, and referenced website image dimensions.
website_integrity.jsonChecks that public task cards use readable research names, modality thumbnails, and the interactive walkthrough/player contract.
task_surface_integrity.jsonMachine-readable 90-second project path and scope summary.
project_packet.jsonThese artifacts show the exact sample windows, feature blocks, task contracts, metrics, walkthroughs, and research-direction mapping.
One JSON file with every task definition, split detail, and metric.
summary_report.jsonWindow start/end frames and aligned action/subtask labels.
windows.csvStart/end index and dimension for every current feature block.
feature_manifest.jsonPer-task PyTorch MLP metrics, predictions, histories, and checkpoints for the same 12 task contracts.
neural_mlp/Generated JSON, CSV, Markdown, and website data mapping all 12 tasks to the four research tracks.
research_directions/Four coded probes, one per research direction, with minimal and neural metrics plus prediction/rank CSVs.
research_direction_extensions/Case studies for all 12 tasks, including input, middle process modules, output, metric, limitation, and task-player data.
task_walkthroughs/The strongest self-supervised signal from the single episode.
metrics.jsonThe same evidence is mirrored across GitHub Pages, Hugging Face, scripts, and lightweight baseline model repositories.
Training, visualization, taxonomy, walkthrough, validator, and omni-readiness scripts.
scripts/Classifier metrics, predictions, confusion matrix, and model weights.
metrics.jsonThe dashboard packaged as a public static Space.
HF SpaceMetrics, predictions, docs, and lightweight derived files without raw data redistribution.
dataset repoMinimal NumPy softmax, ridge baselines, and neural task-head model files.
model repoSpace, artifacts, and model baselines grouped into one public project collection.
collectionThe multi-episode Qwen3-Omni path is documented and scripted, but no full-pilot metric is claimed until the data gate and held-out evaluation pass.
Public data-access boundary and selected 32-episode pilot plan, without private infrastructure details.
MULTI_EPISODE_ACCESS_STATUS.mdManifests, metadata, metrics, and progress logs from the current readiness run.
episode_manifest.jsonThe readiness gate remains the source of truth before any full pilot training claim.
DATA_BLOCKER_REPORT.mdThe full Xperience-10M Hugging Face dataset is gated. While access is pending, the public plan has selected a 32-episode pilot across 32 different session UUIDs.
Stratified round-robin over 64 top-level sessions; 680 complete candidates scanned; 32 sessions selected.
Download raw episodes only from official gated sources, exclude visualization.rrd, validate files, then stage them for training.
The current LoRA artifact is a readiness checkpoint. A real 32-episode result requires local gated data and held-out evaluation.
Raw Xperience-10M data is not redistributed here. The public reproduction contract states the commands, expected outputs, exact-match reproduction evidence, and current non-reproducible scale-up boundary.
Human-readable commands, expected artifacts, and boundaries for the public single-episode pipeline.
REPRODUCIBILITY.mdMachine-readable command matrix covering sample download, baselines, 12 tasks, figures, and validation.
reproducibility_matrix.jsonThe last metric check rebuilt the public-sample outputs from fresh cache and matched the committed metrics.
reproducibility_audit.mdLocal HTML references, anchors, JSON bundles, and image dimensions checked before publishing.
website_integrity.jsonThe 32-episode Qwen3-Omni pilot is prepared but not publicly reproducible until gated data access and held-out evaluation pass.
DATA_BLOCKER_REPORT.mdMinimal path: install the toolkit dependencies, download the official sample, run the 12-task suite with neural heads, regenerate visualizations, then run the artifact index and publication validator.
git clone https://github.com/Ropedia/HOMIE-toolkit.git
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r HOMIE-toolkit/requirements.txt huggingface_hub hf_xet
git clone https://github.com/ChaoYue0307/ropedia-xperience-10m-task-suite.git
pip install -r ropedia-xperience-10m-task-suite/requirements.txt
pip install torch
hf download ropedia-ai/xperience-10m-sample \
--repo-type dataset \
--local-dir data/sample/xperience-10m-sample
cd ropedia-xperience-10m-task-suite
export WORKSPACE=/path/to/workspace
python scripts/episode_task_suite.py --workspace "$WORKSPACE" --include-neural
python scripts/research_direction_extension_tasks.py
python scripts/task_walkthroughs.py
python scripts/generate_visualizations.py
python scripts/render_overview_figures.py
python scripts/render_task_suite_infographic.py
python scripts/export_modality_atlas_assets.py
python scripts/validate_website_integrity.py
python scripts/validate_scope_claims.py
python scripts/build_artifact_index.py
python scripts/validate_mirror_parity.py
python scripts/validate_publication_package.py