Extracting Informative Textual Parts from Web Pages Containing User-Generated Content

Type of publication:	Conference paper
Citation:	Pappas_I-KNOW_2012
Publication status:	Published
Booktitle:	12th International Conference on Knowledge Management and Knowledge Technologies
Series:	i-KNOW '12
Number:	8
Year:	2012
Month:	June
Pages:	4:1--4:8
Publisher:	ACM
Location:	Graz, Austria
Organization:	ACM ICPS
Address:	New York, NY, USA
ISBN:	978-1-4503-1242-4
URL:	http://doi.acm.org/10.1145/236...
Abstract:	The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its efectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.
Keywords:
Projects:	Idiap
Authors:	Pappas, Nikolaos Katsimpras, Georgios Stamatatos, Efstathios
Added by:	[UNK]
Total mark:	0
Attachments
Pappas_I-KNOW_2012.pdf
Notes

processing time: 0.0003 seconds.