A transformer-based deep learning model for evaluation of accessibility of image descriptions
Chapter, Peer reviewed, Conference object
Accepted version
View/ Open
Date
2022Metadata
Show full item recordCollections
Original version
https://doi.org/10.1145/3529836.3529856Abstract
Images have become an integral part of digital and online media and they are used for creative expression and dissemination of knowledge. To address image accessibility challenges to the visually impaired community, adequate textual image descriptions or captions are provided, which can be read through screen readers. These descriptions could be either human-authored or software-generated. It is found that most of the image descriptions provided tend to be generic, inadequate, and often unreliable making them inaccessible. There are tools, methods, and metrics used to evaluate the quality of the generated text, but almost all of them are word-similarity-based and generic. There are standard guidelines such as NCAM image accessibility guidelines to help write accessible image descriptions. However, web content developers and authors do not seem to use them much, possibly due to the lack of knowledge, undermining the importance of accessibility coupled with complexity and difficulty understanding the guidelines. To our knowledge, none of the quality evaluation techniques take into account accessibility aspects. To address this, a deep learning model based on the transformer, a most recent and most effective architecture used in natural language processing, which measures compliance of the given image description to ten NCAM guidelines, is proposed. The experimental results confirm the effectiveness of the proposed model. This work could contribute to the growing research towards accessible images not only on the web but also on all digital devices.