TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos (ACL 2025 Main)
Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.
cd VLMEvalKit
pip install -e .
You can run inference for TUNA-CAP using VLMEvalKit:
bash scripts/infer_tuna_cap.sh
## Equivalent to:
# cd VLMEvalKit
# bash scripts_tuna/infer_tuna_cap.sh
This will generate an output file, such as: VLMEvalKit/outputs/MODEL/T2025xxx/MODEL_TUNA_CAP_1fps.xlsx
.
To convert the above inference result into the standard JSON format required for submission and evaluation:
python evaluation/infer_result_to_submission.py
You can refer to the example submission format here: evaluation/example/MyModel_TUNA_CAP_1fps_submission.json
.
Finally, evaluate TUNA-CAP and obtain the score:
bash scripts/eval_tuna_cap.sh
Note: Due to the stochastic nature of some evaluation steps — primarily caused by the variability of API-based models and randomness in text generation — the scores for individual instances may slightly differ between evaluation runs. However, our evaluation framework ensures that the average performance across the entire TUNA-CAP dataset remains robust and statistically significant.
To evaluate TUNA-MCQ, simply run:
cd VLMEvalKit
bash scripts_tuna/eval_tuna_mcq.sh
The code is largely based on the VLMEvalKit. We thank the authors for their great work.
If you find our work helpful, feel free to give us a cite.
@article{kong2025tuna,
title={TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos},
author={Kong, Fanheng and Zhang, Jingyuan and Zhang, Hongzhi and Feng, Shi and Wang, Daling and Yu, Linhao and Ji, Xingguang and Tian, Yu and W., Victoria and Zhang, Fuzheng},
journal={arXiv preprint arXiv:2505.20124},
year={2025}
}