Document Classification for Focused Topics

Power, Russell (New York University) | Chen, Jay (New York University) | Karthik, Trishank (New York University) | Subramanian, Lakshminarayanan (New York University)

Mar-22-2010–AAAI Conferences

Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.

classification, health & medicine, immunology, (21 more...)

AAAI Conferences

Mar-22-2010

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.46)

Industry:
- Health & Medicine > Therapeutic Area
  - Immunology (0.72)
  - Infections and Infectious Diseases (0.51)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.72)
    - Performance Analysis > Accuracy (1.00)
  - Natural Language > Text Classification (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found