Automating Content Extraction of HTML Documents - Abstract

 

Web pages often contain clutter (such as unnecessary images and extraneous links) around the
body of an article that distracts a user from actual content. Extraction of "useful and
relevant" content from web pages has many applications, including cell phone and PDA browsing,
speech rendering for the visually impaired, and text summarization. Most approaches to making
content more readable involve changing font size or removing HTML and data components such as
images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting",
which aims to reproduce the entire webpage in a more convenient form, our solution directly
addresses "Content Extraction". We have developed a framework that employs an easily extensible
set of techniques. It incorporates advantages of previous work on content extraction. Our key
insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically
access document structure, rather than with raw HTML markup. We have implemented our approach in
a publicly available Web proxy to extract content from HTML web pages. This proxy can be used
both centrally, administered for groups of users, as well as by individuals for personal
browsers. We have also, after receiving feedback from users about the proxy, created a revised
version with improved performance and accessibility in mind.

 

 

 

 

 

 

 

 

 

Home | About | People | Publications | Software | Register

Copyright (c) 2005: The Trustees of Columbia University in the City of New York. All Rights Reserved.