Improving Classification of Tweets Using Linguistic Information from a Large External Corpus
Journal article, Peer reviewed
Accepted version

View/ Open
Date
2016Metadata
Show full item recordCollections
Original version
Hammer HL, Yazidi A, Bai A, Engelstad P.E.: Improving Classification of Tweets Using Linguistic Information from a Large External Corpus. In: Maglaras. Industrial Networks and Intelligent Systems, 2016. Springer p. 122-134Abstract
The bag of words representation of documents is often unsat-
isfactory as it ignores relationships between important terms that do not
co-occur literally. Improvements might be achieved by expanding the
vocabulary with other relevant word, like synonyms.
In this paper we use word-word co-occurence information from a large
corpus to expand the vocabulary of another corpus consisting of tweets.
Several different methods on how to include the co-occurence information
are constructed and tested out on the classification of real twitter data.
Our results show that we are able to reduce the number of erroneous
classifications by 14% using co-occurence information.