Improving Classification of Tweets Using Linguistic Information from a Large External Corpus

Hammer, Hugo Lewi; Yazidi, Anis; Bai, Aleksander; Engelstad, Paal E.

Hammer, Hugo Lewi; Yazidi, Anis; Bai, Aleksander; Engelstad, Paal E.

Journal article, Peer reviewed

Accepted version

Åpne

Manuscript.pdf (287.7Kb)

Permanent lenke

https://hdl.handle.net/10642/4326

Utgivelsesdato

2016

Metadata

Vis full innførsel

Samlinger

TKD - Institutt for informasjonsteknologi [940]

Originalversjon

Hammer HL, Yazidi A, Bai A, Engelstad P.E.: Improving Classification of Tweets Using Linguistic Information from a Large External Corpus. In: Maglaras. Industrial Networks and Intelligent Systems, 2016. Springer p. 122-134

Sammendrag

The bag of words representation of documents is often unsat-

isfactory as it ignores relationships between important terms that do not

co-occur literally. Improvements might be achieved by expanding the

vocabulary with other relevant word, like synonyms.

In this paper we use word-word co-occurence information from a large

corpus to expand the vocabulary of another corpus consisting of tweets.

Several different methods on how to include the co-occurence information

are constructed and tested out on the classification of real twitter data.

Our results show that we are able to reduce the number of erroneous

classifications by 14% using co-occurence information.