TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dita: Scaling Diffusion Transformer for Generalist Vision-...

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen

2025-03-25DenoisingRobot ManipulationVision-Language-Action
PaperPDFCode

Abstract

While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

Results

TaskDatasetMetricValueModel
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation0.652Dita-300M
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Move Near0.73Dita-300M
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Open/Close Drawer0.37Dita-300M
Robot ManipulationSimplerEnv-Google RobotVariant Aggregation-Pick Coke Can0.855Dita-300M
Robot ManipulationSimplerEnv-Google RobotVisual Matching0.687Dita-300M
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Move Near0.76Dita-300M
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Open/Close Drawer0.463Dita-300M
Robot ManipulationSimplerEnv-Google RobotVisual Matching-Pick Coke Can0.837Dita-300M

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing2025-07-15AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air2025-07-15Vision Language Action Models in Robotic Manipulation: A Systematic Review2025-07-14