FiVE-Bench (ICCV 2025)
Minghan Li1*, Chenxi Xie2*, Yichen Wu13, Lei Zhang2, Mengyu Wang1†
1Harvard University 2The Hong Kong Polytechnic University 3City University of Hong Kong
*Equal contribution †Corresponding Author
💜 Leaderboard (coming soon) | 💻 GitHub | 🤗 Hugging Face
📝 Project Page | 📰 Paper | 🎥 Video Demo
- [🔜] Add leaderboard support
- [🔜] Add
Wan-Edit
demo page on HF - [✅ Aug-26-2025] Fix two issues: mp4_to_frames_ffmpeg and skip_timestep=17
- [✅ Aug-05-2025] Release `Wan-Edit' implementation
- [✅ Aug-05-2025] Release
Pyramid-Edit
implementation - [✅ Aug-02-2025] Add Wan-Edit results to HF for eval demo
- [✅ Aug-02-2025] Evaluation code released
- [✅ Mar-31-2025] Dataset uploaded to Hugging Face
- FiVE-Bench Overview
- Running Your Model on FiVE-Bench
- Evaluate Editing Results
- Citation
- Acknowledgement
The FiVE-Bench dataset offers a rich, structured benchmark for fine-grained video editing. The dataset includes 420 high-quality source-target prompt pairs spanning six fine-grained video editing tasks:
- Object Replacement (Rigid)
- Object Replacement (Non-Rigid)
- Color Alteration
- Material Modification
- Object Addition
- Object Removal
-
Download the dataset from Hugging Face: 🔗 FiVE-Bench on Hugging Face
-
Follow the instructions in Installation Guide to download the dataset and install the evaluation code (
FiVE_Bench
). -
Place the downloaded dataset in the directory:
./FiVE_Bench/data
. The data structure should looks like:📁 /path/to/code/FiVE_Bench/data ├── 📁 assets/ ├── 📁 edit_prompt/ │ ├── 📄 edit1_FiVE.json │ ├── 📄 edit2_FiVE.json │ ├── 📄 edit3_FiVE.json │ ├── 📄 edit4_FiVE.json │ ├── 📄 edit5_FiVE.json │ └── 📄 edit6_FiVE.json ├── 📄 README.md ├── 📦 bmasks.zip ├── 📁 bmasks │ ├── 📁 0001_bus │ ├── 🖼️ 00001.jpg │ ├── 🖼️ 00002.jpg │ ├── 🖼️ ... │ ├── 📁 ... ├── 📦 images.zip ├── 📁 images │ ├── 📁 0001_bus │ ├── 🖼️ 00001.jpg │ ├── 🖼️ 00002.jpg │ ├── 🖼️ ... │ ├── 📁 ... ├── 📦 videos.zip ├── 📁 videos │ ├── 🎞️ 0001_bus.mp4 │ ├── 🎞️ 0002_girl-dog.mp4 │ ├── 🎞️ ...
Use your video editing method to edit the FiVE-Bench videos based on the provided text prompts and generate the corresponding edited results.
Example implementations of our proposed rectified flow (RF)-based video editing methods are provided provided in the models/
directory:
- **[Pyramid-Edit](models/README.md#pyramid-edit)**: Diffusion-based video editing using Pyramid-Flow architecture
- **[Wan-Edit](models/README.md#wan-edit)**: Rectified flow-based video editing with Wan2.1-T2V-1.3B model
Run Pyramid-Edit:
# Setup model
cd models/pyramid-edit && mkdir -p hf/pyramid-flow-miniflux
# Download model checkpoint to hf/ directory
bash scripts/run_FiVE.sh
Run Wan-Edit:
# Setup model
cd models/wan-edit && mkdir -p hf/Wan2.1-T2V-1.3B
# Download model checkpoint to hf/ directory
bash scripts/run_FiVE.sh
For detailed setup instructions and configuration options, see the Models Documentation.
Follow the installation guide in Installation Guide to get the evaluation results.
sh scripts/eval_FiVE.sh
Evaluation Support Elements:
-
Editing Masks: Generated using SAM2 to assist in localized metric evaluation.
-
Editing Instructions: Structured directives for each source-target pair to guide model behavior.
FiVE-Bench provides comprehensive evaluation through two major components:
These metrics quantitatively measure various dimensions of video editing quality:
- Structure Preservation
- Background Preservation
(PSNR, LPIPS, MSE, SSIM outside the editing mask) - Edit Prompt–Image Consistency
(CLIP similarity on full and masked images) - Image Quality Assessment
(NIQE) - Temporal Consistency
(MFS: Motion Fidelity Score): - Runtime Efficiency
We use a vision-language model (VLM) to automatically assess whether the intended edits are reflected in the video outputs by asking it questions about the content. If the source video contains a swan, and the target prompt requests a flamingo. For the edited video, we ask
-
Yes/No Questions:
- Is there a swan in the video?
- Is there a flamingo in the video?
✅ The edit is considered successful only if the answers are "No" to the first question and "Yes" to the second.
-
Multiple-choice Questions:
- What is in the video? a) A swan b) A flamingo
✅ The edit is considered successful if the model selects the correct target object (e.g., b) A flamingo) and avoids selecting the original source object.
FiVE-Acc evaluates editing success using a vision-language model (VLM) by asking content-related questions:
- YN-Acc: Yes/No question accuracy
- MC-Acc: Multiple-choice question accuracy
- U-Acc: Union accuracy – success if any question is correct
- ∩-Acc: Intersection accuracy – success only if all questions are correct
- FiVE-Acc ↑: Final score = average of all above metrics (higher is better)
If you use FiVE-Bench in your research, please cite us:
@article{li2025five,
title={Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models},
author={Li, Minghan and Xie, Chenxi and Wu, Yichen and Zhang, Lei and Wang, Mengyu},
journal={arXiv preprint arXiv:2503.13684},
year={2025}
}
Part of the code is adapted from PIE-Bench.
We thank the authors for their excellent work and for making their code publicly available.