CONF Pappas_I-KNOW_2012/IDIAP Extracting Informative Textual Parts from Web Pages Containing User-Generated Content Pappas, Nikolaos Katsimpras, Georgios Stamatatos, Efstathios EXTERNAL http://publications.idiap.ch/attachments/papers/2012/Pappas_I-KNOW_2012.pdf PUBLIC ACM ICPS - 12th International Conference on Knowledge Management and Knowledge Technologies Graz, Austria i-KNOW '12 8 4:1--4:8 978-1-4503-1242-4 2012 ACM New York, NY, USA http://doi.acm.org/10.1145/2362456.2362462 URL The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its efectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.