Shapley visual transformers for image-to-text generation
Peer reviewed, Journal article
Published version
Date
2024Metadata
Show full item recordCollections
Abstract
In the contemporary landscape of the web, text-to-image generation stands out as a crucial information service. Recently, deep learning has emerged as the cutting-edge methodology for advancing text-to-image generation systems. However, these models are typically constructed using domain knowledge specific to the application at hand and a very particular data distribution. Consequently, data scientists must be well-versed in the relevant subject. In this research work, we target a new foundation for text-to-image generation systems by introducing a consensus method that facilitates self-adaptation and flexibility to handle different learning tasks and diverse data distributions. This paper presents I2T-SP (Image-to-Text Generation for Shapley Pruning) as a consensus method for general-purpose intelligence without the assistance of a domain expert. The trained model is developed using a general deep-learning approach that investigates the contribution of each model in the training process. Multiple deep learning models are trained for each set of historical data, and the Shapley Value is determined to compute the contribution of each subset of models in the training. Subsequently, the models are pruned according to their contribution to the learning process. We present the evaluation of the generality of I2T-SP using different datasets with varying shapes and complexities. The results reveal the effectiveness of I2T-SP compared to baseline image-to-text generation solutions. This research marks a significant step towards establishing a more adaptable and broadly applicable foundation for image-to-text generation systems.