Can XAI-guided Point-of-Interest Proposal and SAM Perform Crossmodal Food Image Segmentation?
Abstract
Food image segmentation is a task with multiple problems. There are limited food image segmentation datasets, considerable variability in the types of food available across geographical regions and ethnicities, and adapting to new modalities and domains can be resource intensive.
In this thesis, we explore the use of methods from explainable AI as a way to give new modalities general-purpose segmentation models. It achieves this by using CLIP as a multimodal model with a shared embedding space for text and image data. We propose a system with modular components that integrate to form a promptable text to image segmentation model. Through an extensive set of hyperparameter impact studies, we evaluate how the system performs under different configurations, and determine which components have the highest impact on the system’s performance.
We found that using techniques from explainable AI together with filtering techniques and the sampling of points of interest to be highly effective in guiding a general-purpose segmentation model. While we did not beat state of the art models, our results were reasonably competitive. However, we had problems developing a classifier for dishes and ingredients using high-dimensional embeddings and k-nearest neighbors search.
In conclusion, the research proved interesting, and deserves further research. We were able to segment food images under supervision. However, our classification results did not prove fruitful in this thesis. We conjecture that using a stronger multimodal model like SigLiT could improve our results. Additionally, both the classification and segmentation tasks could be improved by fine-tuning the respective component on a large food-specific dataset like Recipe1M.