About Crunch

 

Web pages may often contain “clutter” (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of “useful and relevant” content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage’s inherent look and feel. Unlike “Content Reformatting”, which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses “Content Extraction”.


We introduce Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with “content extraction” from HTML web pages. Crunch is implemented as a transparent web proxy and is practically usable by end-users.

 

We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre (context) of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the webpages. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.




 

 

 

 

 

Home | About | People | Publications | Software | Register

Copyright (c) 2005: The Trustees of Columbia University in the City of New York. All Rights Reserved.