A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

1Shanghai Jiao Tong University 2Nanyang Technological University
*Equal Contribution. #Corresponding Author.

Abstract

How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at https://github.com/Q-Future/A-Bench.

Technical Report

-->
Overview

Two key diagnostic subsets are defined: A-Bench-P1 → high-level semantic understanding, and A-Bench-P2 → low-level quality perception.

A-Bench Construction

For high-level semantic understanding, A-Bench-P1 targets three critical areas: Basic Recognition, Bag-of-Words Pitfalls Discrimination, and Outside Knowledge Realization, which are designed to progressively test the LMM’s capability in AIGI semantic understanding, moving from simple to complex prompt-related content. For low-level quality perception, A-Bench-P2 concentrates on Technical Quality Perception, Aesthetic Quality Evaluation, and Generative Distortion Assessment, which are designed to cover the common quality issues and AIGI-specific quality problems. Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled, including 1,408 AIGIs for A-Bench-P1 and 1,456 for A-Bench-P2. Each AIGI is paired with a question-answer set annotated by human experts. We are open to submission-based evaluation for A-Bench.

BibTeX

@inproceedings{zhang2024abench,
    author = {Zhang, Zicheng and Wu, Haoning and Li, Chunyi and Zhou, Yingjie and Sun, Wei and Xiongkuo, Min and Chen, Zijian and Liu, Xiaohong and Lin, Weisi and Zhai, Guangtao},
    title = {A-Bench: Are LMMs Masters at Evaluating AI-generated Images?},
    booktitle = {Arxiv},
    year = {2024}
}