Naive Bayes Classification of Web Pages

John Guidi, Chief Scientist
Terra Lycos, S.A.

December 15, 2000
11 a.m.
Fuller Labs 320

Abstract

This talk discusses application of Bayesian learning methods to classify web pages at Terra Lycos. Results are compared and contrasted with similar efforts involving a canonical set of Usenet news articles. Various feature subset selection methods are explored, including information gain, cross entropy, and odds ratio. The paucity of labeled web pages led to an attempt to augment the existing training data with articles from appropriate Usenet news groups. Typical web pages and news articles are substantially different. Yet, this augmented approach, coupled with a simple, threshold based, feature subset selection method, yielded a classification accuracy of 60% for placing authored web pages into 15 categories.

Host

Professor Elke Rundensteiner

Coordinator:

Professor Dave Brown

Maintained by webmaster@wpi.edu
Last modified: Sep 27, 2006, 16:05 EDT
[WPI] [Home] [Back] [Top]