Improving a Page Classifier with Anchor Extraction and Link Analysis
–Neural Information Processing Systems
Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier's performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential "main-category anchor wrappers"; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.
Neural Information Processing Systems
Dec-31-2003
- Country:
- North America
- United States
- Wisconsin > Dane County
- Madison (0.04)
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.14)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Wisconsin > Dane County
- Canada
- Ontario > Toronto (0.04)
- Alberta > Census Division No. 11
- Edmonton Metropolitan Region > Edmonton (0.04)
- United States
- North America
- Genre:
- Research Report (0.47)
- Technology: