Bao, Shenghua
Optimizing Hierarchical Classification with Adaptive Node Collapses
Perera, Sujan (IBM Watson Health) | Raz, Orna (IBM Research) | Routray, Ramani (IBM Watson Health) | Bao, Shenghua (IBM Watson Health) | Zalmanovici, Marcel (IBM Research)
Data intensive solutions, such as solutions that include machine learning components, are becoming more and more prevalent. The standard way of developing such solutions is to train machine learning models with manually annotated or labeled data for a given task. This methodology assumes the existence of ample human annotated data. Unfortunately, this is often not the case, due to imbalanced distribution of classes and lack of human annotation resources. This challenge is exasperated when thousands of hierarchical classes are introduced. Therefore, it is critical to quantify the sufficiency of the data for a given task before applying standard machine learning algorithms. Moreover, it may be the case that there is ample labeled training data to only solve a sub-problem. In particular, in the hierarchical classification problem, the sufficiency level of training data could vary significantly depending on the granularity level of hierarchy we use for classification. We identify a need to decompose the given problem to sub-problems for which there is ample training data. In this paper we propose a methodology to decompose a hierarchical classification problem considering the characteristics of a given dataset. We define an optimization problem of adaptive node collapse that identifies an appropriate hierarchy decomposition based on a trade-off between multiple goals. In our experiments, we consider the trade-off between the learning accuracy and the hierarchy abstraction level.
Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services
Yang, Lichun (Shanghai Jiao Tong University) | Bao, Shenghua (IBM Research China) | Lin, Qingliang (Shanghai Jiao Tong University) | Wu, Xian (IBM Research China) | Han, Dingyi (Shanghai Jiao Tong University) | Su, Zhong (IBM Research China) | Yu, Yong (Shanghai Jiao Tong University)
This paper focuses on analyzing and predicting not-answered questions in Community based Question Answering (CQA) services, such as Yahoo! Answers. In CQA services, users express their information needs by submitting natural language questions and await answers from other human users. Comparing to receiving results from web search engines using keyword queries, CQA users are likely to get more specific answers, because human answerers may catch the main point of the question. However, one of the key problems of this pattern is that sometimes no one helps to give answers, while web search engines hardly fail to response. In this paper, we analyze the not-answered questions and give a first try of predicting whether questions will receive answers. More specifically, we first analyze the questions of Yahoo Answers based on the features selected from different perspectives. Then, we formalize the prediction problem as supervised learning – binary classification problem and leverage the proposed features to make predictions. Extensive experiments are made on 76,251 questions collected from Yahoo! Answers. We analyze the specific characteristics of not-answered questions and try to suggest possible reasons for why a question is not likely to be answered. As for prediction, the experimental results show that classification based on the proposed features outperforms the simple word-based approach significantly.