Exploiting Crowd-Based Labels for Domain Focused Information Retrieval

Miniter, John Cory (University of Massachusetts Lowell) | Mehta, Vineet (University of Massachusetts Lowell) | Chandra, Kavitha (University of Massachusetts Lowell)

AAAI Conferences 

Information search and retrieval from online sources or social forums is often performed with term based boolean queries. Such queries can produce low relevance documents in situations where the user is interested in retrieving in- formation related to a concept, or belonging to a specific domain. In this work an approach for concept-based infor- mation retrieval is presented, which exploits word and doc- ument distributions derived from topic modeling performed on data from online sources. Documents acquired from the Reddit and Stack Exchange online social forums are used for extracting concepts, and subsequently training and testing a detector that aids in identifying and retrieving documents associated with the concept of interest. The selection of training sets for our concept based detector is aided by pre-partitioning of documents by online users (or crowd) into concept focused sub-forums, such as sub-reddits. Topics derived from a sample of the overall document set are taken to represent concepts. These topics then form the basis for identifying sub-forums that have a strong correspondence with the concept of interest, and documents within are assigned (noisy) binary labels. The applicability of our approach is demonstrated by creating a domain focused detector for Cyber Security content from Reddit data. The cross utility of this detector is demonstrated by success- fully retrieving relevant Cyber Security documents from an alternate test online source: Stack Exchange. Document classification results of the proposed approach are compared favorably with classifications performed by human analysts.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found