- [2025/10/24] 🔥 Code for fine-tuning and data processing have been released! Everything is fully open-source.
- [2025/08/30] 🔥 Code for replicating MolmoAct's training pipeline has been released
- [2025/08/15] 🔥 Code for MolmoAct Evaluation on SimplerEnv has been released at allenai/SimplerEnv
- [2025/08/12] 🔥 Datasets used for our pre-training and mid-training have been released
- [2025/08/12] 🔥 Models have been released
- Overview
- Release Notes
2.1 Datasets
2.2 Models - Installation
- Training
4.1 Train Your Own MolmoAct
4.1.1 Data Processing
4.1.2 Fine-tuning (Post-training)
4.1.3 Merge LoRA
4.1.4 Inference
4.1.5 Visualization
4.2 Training Replication
4.2.1 Pre-training
4.2.2 Mid-training
4.2.3 Post-training (LIBERO) - Evaluation
5.1 SimplerEnv
5.2 LIBERO
5.3 Real-world - License and Use
- Model and Hardware Safety
- Citation
- Contacts
MolmoAct is a repository for training and using Ai2’s open-sourced Action Reasoning Model that can reason in space.
| Data | Description | Dataset Path |
|---|---|---|
| MolmoAct Dataset | MolmoAct dataset in LeRobot format. All contents were collected in-house by Ai2. | https://huggingface.co/datasets/allenai/MolmoAct-Dataset |
| MolmoAct Pre-training Mixture | Data mixture for MolmoAct pre-training. Contains a subset of OXE formulated as Action Reasoning data, auxiliary robot data, and web data. | https://huggingface.co/datasets/allenai/MolmoAct-Pretraining-Mixture |
| MolmoAct Mid-training Mixture | Data mixture for MolmoAct mid-training. Contains MolmoAct Dataset formulated as Action Reasoning data. | https://huggingface.co/datasets/allenai/MolmoAct-Midtraining-Mixture |
| Model | Use Case | Description | Checkpoint Path |
|---|---|---|---|
| MolmoAct-7B-D | Fine-tuning | Best/demo MolmoAct; adapt to real robots by fine-tuning on your datasets. | https://huggingface.co/allenai/MolmoAct-7B-D-0812 |
| MolmoAct-7B-O | Fine-tuning | Most open MolmoAct; adapt to real robots by fine-tuning on your datasets. | https://huggingface.co/allenai/MolmoAct-7B-O-0812 |
| MolmoAct-7B-D-Pretrain | Inference | Checkpoint to replicate zero-shot results on SimplerEnv (Google Robot). | https://huggingface.co/allenai/MolmoAct-7B-D-Pretrain-0812 |
| MolmoAct-7B-D-Pretrain-RT-1 | Inference | Checkpoint to replicate RT-1 fine-tuned results on SimplerEnv (Google Robot). | https://huggingface.co/allenai/MolmoAct-7B-D-Pretrain-RT-1-0812 |
We provide the Dockerfile to build the docker, where we ran all our training experiments on. We strongly recommand to build the same docker on your own and run training on that.
If you want to install environment on your own, first install python 3.11, then install PyTorch according to the instructions specific to your operating system.
Next, in both cases, go to your working molmoact folder, and run:
git clone https://github.com/allenai/molmoact.git
cd molmoact
pip install -e .[all]We provide instructions on both how to train your own MolmoAct and how to replicate all of our training stages:
Installation for Data Processing
Command
git clone https://github.com/DepthAnything/Depth-Anything-V2.git
cd Depth-Anything-V2 &&
pip install -r requirements.txt &&
pip uninstall -y opencv-python opencv-python-headless opencv-contrib-python &&
pip install opencv-python-headless --no-cache-dir &&
pip install lerobot==0.3.3Download Depth Anything V2 Checkpoint
Command
wget https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/main/depth_anything_v2_vitb.pth
mv <path/to/depth_anything_v2_vitb.pth> <path/to/Depth-Anything-V2/checkpoints>Download MolmoAct VQVAE Checkpoint
Command
wget https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/main/vae-final.ptTo preprocess conventional lerobot dataset format into Action Reasoning Data, first run the preprocessing command:
Command
export DEPTH_CHECKPOINT_DIR="<path/to/Depth-Anything-V2/checkpoints>"
export VQVAE_MODEL_PATH="<path/to/vqvae.pt>"
python preprocess/action_reasoning_data.py \
--dataset-path <lerobot/repo_id> \
--output-path <path/to/processed_dataset> \
--depth-encoder vitb \
--line-length 5 \
--process-actions \
--action-bins 256 \
--action-chunk-size 8Note that after you finished the data processing before, you should get a folder /path/to/processed_dataset where it has all the data and dataset_statistics.json. Then, you need to change finetune:/path/to/processed_dataset with the actual path in launch_scripts/train_multitask_model.py. To run the training, the following script is provided, which should work well on 8 A100/H100 GPUs. You should customize the gloabal batch size under your GPU setup to avoid OOM.
WANDB_API_KEY=<your_wandb_api_key> torchrun \
--nnodes=1 --nproc-per-node=8 \
--node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
launch_scripts/train_multitask_model.py \
robot-finetune allenai/MolmoAct-7B-D-0812 \
--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project> \
--norm_stats_path /path/to/dataset_statistics.json \
--save_folder=checkpoints/<exp_name> \
--save_overwrite \
--duration 10000 \
--ft_embedding all \
--depth_tokens \
--global_batch_size 16 \
--lr_connector 5e-4 \
--lr_vit 5e-4 \
--lr_llm 5e-4 \
--save_interval 2000 \
--save_num_checkpoints_to_keep 5 \
--max_images 2 \
--lora_enable --lora_rank 32 --lora_alpha 16 --lora_dropout 0.0 \
--img_augNote that during fine-tuning, we by default disable the high-resolution crops by downsizing all training images to sizes smaller than 378x378, as all of our training stages doesn't enable this feature. For more details on these flags, please refer to section 4.2 Training Replication.
If you perform LoRA fine-tuning instead of full-parameter fine-tuning, which is what we did for most of our post-training experiments, we need to merge the adapters with the original model weights. When training with LoRA, Our checkpointer will save sharded checkpoints and LoRA adapters (named with stepXXX and stepXXX-lora). The sharded checkpoints contains base model parameters and LoRA adapters, which is only used for resume training. The LoRA adapters should be used for merging with the base model and run inference. To merge the base model with LoRA adapters, run the following script:
python3 -m scripts.merge_lora \
--base_dir /path/to/base_model \
--lora_dir /path/to/checkpoints/exp_name/stepXXX-lora \
--output_dir /path/to/checkpoints/exp_name/stepXXX-mergeWe perform most of MolmoAct's infernce using huggingface transformers and vLLM. To enable this, we need to first wrap our model using huggingface transformers. We provide the script to make this conversion:
python3 -m olmo.hf_model.molmoact.convert_molmoact_to_hf \
--checkpoint_dir /path/to/checkpoints/exp_name/stepXXX-merge \
--output_dir /path/to/checkpoints/exp_name/stepXXX-hf \
--style demo \
--norm_stats_path /path/to/dataset_statistics.jsonwhere style is just a flag for system prompt, and by default should be set to demo. There are more options but we won't use them. Optionally, you can pass the path to dataset_statistics.json through --norm_stats_path to overwrite the existing dataset statistics or adding one if it doesn't exist.
Note that checkpoint_dir has to be the path to the unsharded checkpoint. Usually, for LoRA fine-tuning case, it will be the merged checkpoint, which is also unsharded. For full fine-tuning case (like in pre-training and mid-training), you can just replace --checkpoint_dir with something like /path/to/checkpoints/exp_name/stepXXX-unsharded as there won't be a merged checkpoint. If by any chance you only have the sharded checkpoints, we also provide a script to convert sharded checkpoints to unsharded ones:
python3 -m scripts.convert_to_unsharded \
--checkpoint_dir /path/to/checkpoints/exp_name/stepXXX \
--output_dir /path/to/checkpoints/exp_name/stepXXX-unshardedOnce we have the converted checkpoint, we can follow this example script olmo/hf_model/molmoact/test_molmoact.py to run inference:
python3 olmo/hf_model/molmoact/test_molmoact.py \
--checkpoint_dir /path/to/checkpoints/exp_name/stepXXX-hf \
--images /path/to/img1 /path/to/img2 \
--instruction "task instruction" \
--unnorm_key unnorm_keyFor vLLM inference, follow this example olmo/vllm/molmoact/test_molmoact.py:
python3 -m olmo.vllm.molmoact.test_molmoact \
--checkpoint_dir /path/to/checkpoints/exp_name/stepXXX-hf \
--images /path/to/img1 /path/to/img2 \
--instruction "task instruction" \
--unnorm_key unnorm_keyRunning inference could be performed on the provided docker, though it only requires the following dependencies:
pip install einops torchvision accelerate vllm==0.8.5 transformers==4.52You can also refer to MolmoAct Inference Setup.
Besides robot actions, MolmoAct's inference also includes depth tokens and visual trace. Visual trace consists of 2D coordinates where all values are integers bounded between [0, 256). For visualization purposes, you should scale those coordinates according to the actual image size. To visualize depth from the predicted depth tokens, we need to use the decoder of the VQVAE we trained. We provide the following script to run visualization:
python3 scripts/reconstruct_from_tokens.py \
--ckpt_path /path/to/vae-final.pt \
--depth_tokens "<DEPTH_START><DEPTH_1>...<DEPTH_END>" \
--output_path /path/to/depth.png
If you want to train your own VAVQE for depth estimation, please follow Aurora-perception. We use the batch size of 64 and learning rate of 1e-3, while all other hyperparameters stay the same. The other difference is that we use Depth-Anything-V2 instead of its prior version. Note that we train our VQVAE on the generated depth maps of BC-Z, BridgeData V2, and RT-1 subsets published in our pretraining data mixture.
MolmoAct pulls most datasets via Hugging Face Datasets; those files go into the Hugging Face cache. A few extra assets are stored under a separate root defined by MOLMOACT_DATA_DIR.
Set both paths (example: store everything under /data/molmoact):
export MOLMOACT_DATA_DIR=/data/molmoact
export HF_HOME=/data/molmoact/huggingface
HF_HOMEcontrols the Hugging Face cache location. See the official docs on managing the cache here.
You can download our robot datasets in many ways, as shown in the following:
All robot datasets:
python3 scripts/download_robot_data.py all --n_proc 16Specific training stage:
python3 scripts/download_robot_data.py <stage> --n_proc 16Use one of: pretrain, midtrain, libero.
Single robot dataset class by name:
python3 scripts/download_robot_data.py MolmoActDatasetHomePrimary --n_proc 16All robot dataset class names are listed at the end of
olmo/data/robot_datasets.py.
These are the Multimodal Web Data used during MolmoAct pre-training.
All web datasets (after setting MOLMOACT_DATA_DIR and HF_HOME):
python3 scripts/download_data.py all --n_proc 16Single web dataset (example):
python3 scripts/download_data.py ChartQa --n_proc 16- Pixmo datasets fetch images from URLs. The script does this automatically but may take a long time; a full fresh download can take up to a day.
--n_proccontrols parallelism. More processes can speed things up but also increase the chance of rate limiting.- Downloads are resumable if you cancel or hit an error.
- Some datasets (InfoQa, Scene-Text) require manual downloads. The scripts will raise an error if those files are missing.
- The Android control dataset needs extra dependencies because it parses original TFRecords.
- We recommend ensuring the data is downloaded and then using the environment variable
HF_DATASETS_OFFLINE=1during training to ensure the nodes don't flood HF with requests as they all initialize and then potentially get rate limited.
Command
WANDB_API_KEY=<your_wandb_api_key> torchrun \
--nnodes=32 --nproc-per-node=8 \
--node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
launch_scripts/train_multitask_model.py \
molmoact-pretrain allenai/MolmoAct-7B-D-Captioner-0812 \
--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project> \
--save_folder=checkpoints/<exp_name> \
--save_overwrite \
--duration 100000 \
--ft_embedding all \
--depth_tokens \
--global_batch_size 512 \
--lr_connector 1e-5 \
--lr_vit 1e-5 \
--lr_llm 2e-5 \
--save_interval 20000 \
--save_num_checkpoints_to_keep 5 \
--save_final_unsharded_checkpointFill these placeholders
WANDB_API_KEY=<your_wandb_api_key>→ your Weights & Biases (W&B) API key.--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project>→ your Weights & Biases (W&B) run info.--save_folder=checkpoints/<exp_name>→ folder name for checkpoints (use a unique experiment name).
W&B logging
- Offline logging:
WANDB_MODE=offline. - Turn off wandb: replace
--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project>with--wandb=null.
Checkpoints & formats
- By default all intermediate checkpoints are sharded; only the final checkpoint is also saved unsharded (
--save_final_unsharded_checkpoint). - To save unsharded copies for every checkpoint, add:
--save_intermediate_unsharded_checkpoint.
Cluster launch variables
- Set these per your cluster/launcher:
--node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}".
Notes
- Avoid
--pin_memoryfor large datasets; it can cause OOM during loading.
Inference
- Please refer to section 4.1.4 Inference.
Command
WANDB_API_KEY=<your_wandb_api_key> torchrun --nnodes=16 --nproc-per-node=8 \
--node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
launch_scripts/train_multitask_model.py \
molmoact-midtrain allenai/MolmoAct-7B-D-Pretrain-0812 \
--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project> \
--save_folder=checkpoints/<exp_name> \
--save_overwrite \
--duration 50000 \
--ft_embedding all \
--depth_tokens \
--global_batch_size 256 \
--lr_connector 5e-6 \
--lr_vit 5e-6 \
--lr_llm 1e-5 \
--save_interval 10000 \
--save_num_checkpoints_to_keep 5 \
--save_final_unsharded_checkpoint \
--max_images 2What’s different from pre-training
- Base checkpoint:
allenai/MolmoAct-7B-D-Pretrain-0812. - Hyperparameters change (shorter
--duration, smaller--global_batch_size, lower LRs). --max_images 2indicates each training example uses two images.- All other setup (W&B, saving, cluster vars) follows the pre-training instructions.
Inference
- Please refer to section 4.1.4 Inference.
Command
WANDB_API_KEY=<your_wandb_api_key> torchrun --nnodes=8 --nproc-per-node=8 \
--node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
launch_scripts/train_multitask_model.py \
libero-<task_suite> allenai/MolmoAct-7B-D-0812 \
--wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project> \
--save_folder=checkpoints/<exp_name> \
--save_overwrite \
--duration <steps> \
--ft_embedding all \
--depth_tokens \
--global_batch_size 128 \
--lr_connector 5e-4 \
--lr_vit 5e-4 \
--lr_llm 5e-4 \
--save_interval 10000 \
--save_num_checkpoints_to_keep 5 \
--max_images 2 \
--lora_enable --lora_rank 32 --lora_alpha 16 --lora_dropout 0.0 \
--img_augWhat’s different here
- Base checkpoint:
allenai/MolmoAct-7B-D-0812. - Uses LoRA fine-tuning (
--lora_enable ...) and image augmentation (--img_aug). --max_images 2again indicates two images per input.- Choose
--duration <steps>based on the LIBERO task suite.
Choose <task_suite> and <steps>
<task_suite> |
<steps> |
|---|---|
| spatial | 50000 |
| object | 50000 |
| goal | 40000 |
| long | 80000 |
Reminder
- Follow the pre-training notes for W&B setup, checkpointing behavior, and cluster launch variables; those apply here as well.
Merge LoRA & Running Inference
- Please refer to sections 4.1.3 Merge LoRA 4.1.4 Inference.
We release the SimplerEnv evaluation code for MolmoAct at allenai/SimplerEnv. Please first install the dependencies for SimplerEnv Evaluation environment following allenai/SimplerEnv and dependencies for MolmoAct Inference Setup. After installing all the dependencies, evaluation scripts are located at:
# under the project dir of SimplerEnv/
bash scripts/molmoact_pick_coke_can_visual_matching.sh
bash scripts/molmoact_pick_coke_can_variant_agg.sh
bash scripts/molmoact_move_near_visual_matching.sh
bash scripts/molmoact_move_near_variant_agg.sh
bash scripts/molmoact_drawer_visual_matching.sh
bash scripts/molmoact_drawer_variant_agg.sh# under the project dir of molmoact/
cd experiments/LIBERO
pip install -e .
pip install einops torchvision accelerate
pip install transformers==4.52.1
pip install vllm==0.8.5
export VLLM_WORKER_MULTIPROC_METHOD=spawn
cd ../libero
# to replicate molmoact results with vllm
python run_libero_eval_vllm.py --task spatial --checkpoint allenai/MolmoAct-7B-D-LIBERO-Spatial-0812
python run_libero_eval_vllm.py --task object --checkpoint allenai/MolmoAct-7B-D-LIBERO-Object-0812
python run_libero_eval_vllm.py --task goal --checkpoint allenai/MolmoAct-7B-D-LIBERO-Goal-0812
python run_libero_eval_vllm.py --task 10 --checkpoint allenai/MolmoAct-7B-D-LIBERO-Long-0812
# we also provide the code to run libero with only huggingface
python run_libero_eval.py --task spatial --checkpoint allenai/MolmoAct-7B-D-LIBERO-Spatial-0812
python run_libero_eval.py --task object --checkpoint allenai/MolmoAct-7B-D-LIBERO-Object-0812
python run_libero_eval.py --task goal --checkpoint allenai/MolmoAct-7B-D-LIBERO-Goal-0812
python run_libero_eval.py --task 10 --checkpoint allenai/MolmoAct-7B-D-LIBERO-Long-0812Content coming soon.
MolmoAct is licensed under Apache 2.0 and intended for research and educational use.
For more information, please see our Responsible Use Guidelines.
MolmoAct can display a visual reasoning trace of its intended actions before execution, enabling proactive auditing and adjustment of behavior. The model’s action space is bounded within the data provided, and compliance is built in to limit excessive force when resistance is detected. Always follow hardware manufacturer guidelines and operate in a safely configured environment.
@misc{molmoact2025,
title={MolmoAct: Action Reasoning Models that can Reason in Space},
author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
year={2025},
eprint={2508.07917},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.07917}
}For questions, collaborations, or support, please contact with:
{haoquanf,jasonl,jiafeid}@allenai.org
Found a bug or have a feature request? Please open a GitHub issue.