A hybrid architecture composed of a convolutional neural network (CNN) and a Transformer is the new trend in realizing various vision tasks while pushing the limits of learning representation. From the perspective of mechanisms of CNN and Transformer, a functional combination of them is suitable for the image quality assessment (IQA) since which requires leveraging both local distortion perception and global quality aggregation, however, there has been scarce study employing such an approach. This paper presents an end-to-end CNN-Transformer hybrid model for full-reference IQA named convolved quality transformer (CQT). The CQT is inspired by the human's perceptual characteristics and is designed to unify the advantages of both CNN and Transformer for evaluating quality score. In CQT, convolutional layers specialize in local distortion feature extraction whereas Transformer aggregates them to estimate holistic quality via long-range interaction between them. Such a series of processes is repeated on multi-scale feature maps to capture quality representation sensitively. To verify submodules in CQT perform their roles properly, we in-depth analyze the interaction between local distortions inferring global quality with attention visualization. Finally, the perceptually pooled information from stage-wise feature embeddings derives the final quality level. The experimental results demonstrate that the proposed model achieves superior performance in comparison to previous data-driven approaches, and which is even well-generalized over standard datasets.
Bibliographical notePublisher Copyright:
© 2013 IEEE.
All Science Journal Classification (ASJC) codes
- General Engineering
- General Materials Science
- General Computer Science