EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Published in AAAI 2026 (CCF-A), 2024

EvalMuse-40K is a large-scale, fine-grained benchmark with comprehensive human annotations designed to evaluate text-to-image generation models.

Key Contributions:

  • A 40K-scale dataset with fine-grained human annotations covering multiple quality dimensions.
  • Fine-tuned BLIP model for human-aligned fine-grained scoring.
  • Fair ranking of 20+ state-of-the-art T2I models.

My Contribution: Responsible for the structural image quality evaluation task and contributed to the design of the innovative BLIP fine-tuning strategy.

Venue: AAAI 2026 (CCF-A)

[arXiv]

Recommended citation: Han, S., Fan, H., Fu, J., Li, L., Li, T., et al. (2024). EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation. AAAI 2026.
Download Paper