Optimal AI through Minimal Data: Enhancing Sentiment Analysis with Data Diversity for Norwegian
Abstract
This thesis challenges the prevailing notion in machine learning that "more data equals better performance" by demonstrating the effectiveness of undersampling techniques in enhancing model performance. Focusing on sentiment analysis for the Norwegian language—a relatively low-resource language—we explored various undersampling methods to manage data volume while preserving content diversity. After a thorough preprocessing and ETL process on an extensive raw dataset crawled from a rich online resource, covering 22 thematic categories with labels scaled from 1 to 5 and spanning from February 2012 to June 2023, we employed a common LSTM model architecture to ensure a fair comparison between the undersampled models and the large one. Then we implemented multi-level fine-tuning and sophisticated analysis.
Our main finding reveals that the KMeans undersampled model, trained on only 18\% of the data, consistently outperforms the full dataset model across multiple metrics, including F1-scores, validation, and training losses. It even achieves the same level of accuracy as the large model when evaluated on a binary scale. This indicates superior learning and generalization capabilities, suggesting that strategic data reduction can be more beneficial than traditional data-heavy approaches in certain contexts.
This thesis explores the practical benefits of using undersampling techniques, specifically KMeans clustering, in sentiment analysis for Norwegian. It demonstrates that a reduced dataset not only boosts model performance and generalization—ideal for resource-limited devices like mobile phones—but also supports sustainable AI development by reducing environmental impact and enhancing data privacy. Additionally, the clarity and transparency of models trained on representative data subsets support the development of more ethical and comprehensible AI systems, encouraging the adoption of policies that emphasize data quality and responsible AI governance.