Web pages often contain clutter (such as unnecessary images and extraneous
links) around the
body of an article that distracts a user from actual content. Extraction
of "useful and
relevant" content from web pages has many applications, including
cell phone and PDA browsing,
speech rendering for the visually impaired, and text summarization. Most
approaches to making
content more readable involve changing font size or removing HTML and
data components such as
images, which takes away from a webpage's inherent look and feel. Unlike
"Content Reformatting",
which aims to reproduce the entire webpage in a more convenient form,
our solution directly
addresses "Content Extraction". We have developed a framework
that employs an easily extensible
set of techniques. It incorporates advantages of previous work on content
extraction. Our key
insight is to work with DOM trees, a W3C specified interface that allows
programs to dynamically
access document structure, rather than with raw HTML markup. We have implemented
our approach in
a publicly available Web proxy to extract content from HTML web pages.
This proxy can be used
both centrally, administered for groups of users, as well as by individuals
for personal
browsers. We have also, after receiving feedback from users about the
proxy, created a revised
version with improved performance and accessibility in mind.
Home | About | People | Publications | Software | Register