Naive Bayes Classification of Web Pages
John Guidi, Chief Scientist
Terra Lycos, S.A.
December 15, 2000
11 a.m.
Fuller Labs 320
Abstract
This talk discusses application of Bayesian learning methods to classify web pages at Terra Lycos. Results are compared and contrasted with similar efforts involving a canonical set of Usenet news articles. Various feature subset selection methods are explored, including information gain, cross entropy, and odds ratio. The paucity of labeled web pages led to an attempt to augment the existing training data with articles from appropriate Usenet news groups. Typical web pages and news articles are substantially different. Yet, this augmented approach, coupled with a simple, threshold based, feature subset selection method, yielded a classification accuracy of 60% for placing authored web pages into 15 categories.
Host
Professor Elke Rundensteiner
Coordinator:
Professor Dave Brown
Maintained by webmaster@wpi.eduLast modified: Sep 27, 2006, 16:05 EDT
