Crawling JavaScript websites using WebKit - with application to analysis of hate speech in online discussions
Journal article, Peer reviewed
Permanent lenke
https://hdl.handle.net/10642/1834Utgivelsesdato
2013-11Metadata
Vis full innførselSamlinger
Originalversjon
Hammer, H., Bratterud, A. & Fagernes, S. (2013). Crawling JavaScript websites using WebKit - with application to analysis of hate speech in online discussions. NIK: Norsk Informatikkonferanse. Trondheim: Tapir AkademiskSammendrag
JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when JavaScript CSHW from an external website is seamlessly included as part of the web pages. We have developed a prototype web crawler that efficiently extracts content from CSHW. The crawler uses WebKit to render web pages and to emulate human web page activities to reveal dynamic content. The WebKit crawler was used to collect text from 39 Norwegian online newspaper debate articles, where the online user discussions were included as JavaScript CSHW from other websites. The average speed to extract the main content and the JavaScript-generated discussions were 36.3 kB/sec and 8.8 kB/sec, respectively. Analyzing the collected text from the news paper debate articles using opinion mining, documents that the debate articles are more positive to Islam and Muslims than the following discussions. The results demonstrate the importance of being able to collect such JavaScript CSHW discussion content to get an overview of existing hate speech on the Internet