Exploring multilingual and contextual properties in word representations from BERT

Aaby, Pernille

Aaby, Pernille

Master thesis

Published version

Åpne

aaby-acit2022.pdf (3.653Mb)

Permanent lenke

https://hdl.handle.net/11250/3017423

Utgivelsesdato

2022

Metadata

Vis full innførsel

Samlinger

TKD - Master i Anvendt data- og informasjonsteknologi (ACIT) [243]

Sammendrag

Nowadays, contextual language models can solve a wide range of language tasks such as text classification, question answering and machine translation. These tasks often require the model to have knowledge about general language understanding, like how words relate to each other. This understanding is acquired through a pre-training stage where the model learn features from raw text data. However, we do not fully understand all the features the model learns through this pre-training stage. Does there exists information yet to be utilized? Can we make predictions more explainable? This thesis aims to extend the knowledge of what features a language model have acquired. We have chosen the model architecture BERT and have analyzed its word representations from two feature perspectives. The first perspective investigated similarities and dissimilarities between English and Norwegian word representations by evaluating their performance on a word retrieval task and a language detection task. The second perspective analyzed how a word representation changes if the word stands in the wrong context or if the word was inferred through the model without context.

Utgiver

OsloMet - storbyuniversitetet

Serie

ACIT;2022