TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos (ACL 2025 Main)

👀 Overall

Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.

🔍 Dataset

🏆 Leaderboard

here

⚖️ Evaluation

Installation

cd VLMEvalKit
pip install -e .

TUNA-CAP (Video Captioning)

You can run inference for TUNA-CAP using VLMEvalKit:

bash scripts/infer_tuna_cap.sh
## Equivalent to:
# cd VLMEvalKit
# bash scripts_tuna/infer_tuna_cap.sh

This will generate an output file, such as: VLMEvalKit/outputs/MODEL/T2025xxx/MODEL_TUNA_CAP_1fps.xlsx.

To convert the above inference result into the standard JSON format required for submission and evaluation:

python evaluation/infer_result_to_submission.py

You can refer to the example submission format here: evaluation/example/MyModel_TUNA_CAP_1fps_submission.json.

Finally, evaluate TUNA-CAP and obtain the score:

bash scripts/eval_tuna_cap.sh

Note: Due to the stochastic nature of some evaluation steps — primarily caused by the variability of API-based models and randomness in text generation — the scores for individual instances may slightly differ between evaluation runs. However, our evaluation framework ensures that the average performance across the entire TUNA-CAP dataset remains robust and statistically significant.

TUNA-MCQ (Video Multi-Choice QA)

To evaluate TUNA-MCQ, simply run:

cd VLMEvalKit
bash scripts_tuna/eval_tuna_mcq.sh

Acknowledgments

The code is largely based on the VLMEvalKit. We thank the authors for their great work.

📋 Citation

If you find our work helpful, feel free to give us a cite.

@article{kong2025tuna,
  title={TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos},
  author={Kong, Fanheng and Zhang, Jingyuan and Zhang, Hongzhi and Feng, Shi and Wang, Daling and Yu, Linhao and Ji, Xingguang and Tian, Yu and W., Victoria and Zhang, Fuzheng},
  journal={arXiv preprint arXiv:2505.20124},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
VLMEvalKit		VLMEvalKit
asserts		asserts
evaluation		evaluation
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos (ACL 2025 Main)

👀 Overall

🔍 Dataset

🏆 Leaderboard

⚖️ Evaluation

Installation

TUNA-CAP (Video Captioning)

TUNA-MCQ (Video Multi-Choice QA)

Acknowledgments

📋 Citation

About

Uh oh!

Releases

Packages

Languages

License

friedrichor/TUNA

Folders and files

Latest commit

History

Repository files navigation

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos (ACL 2025 Main)

👀 Overall

🔍 Dataset

🏆 Leaderboard

⚖️ Evaluation

Installation

TUNA-CAP (Video Captioning)

TUNA-MCQ (Video Multi-Choice QA)

Acknowledgments

📋 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages