Performance Analysis
Accelerated Training for Massive Classification via Dynamic Class Selection
Zhang, Xingcheng (The Chinese University of Hong Kong) | Yang, Lei (The Chinese University of Hong Kong) | Yan, Junjie (SenseTime Group Limited) | Lin, Dahua (The Chinese University of Hong Kong)
Massive classification, a classification task defined over a vast number of classes (hundreds of thousands or even millions), has become an essential part of many real-world systems, such as face recognition. Existing methods, including the deep networks that achieved remarkable success in recent years, were mostly devised for problems with a moderate number of classes. They would meet with substantial difficulties, e.g., excessive memory demand and computational cost, when applied to massive problems. We present a new method to tackle this problem. This method can efficiently and accurately identify a small number of "active classes" for each mini-batch, based on a set of dynamic class hierarchies constructed on the fly. We also develop an adaptive allocation scheme thereon, which leads to a better tradeoff between performance and cost. On several large-scale benchmarks, our method significantly reduces the training cost and memory demand, while maintaining competitive performance.
Scene-Centric Joint Parsing of Cross-View Videos
Qi, Hang (University of California, Los Angeles) | Xu, Yuanlu (University of California, Los Angeles) | Yuan, Tao (University of California, Los Angeles) | Wu, Tianfu (NC State University) | Zhu, Song-Chun (University of California, Los Angeles)
Cross-view video understanding is an important yet under-explored area in computer vision. In this paper, we introduce a joint parsing framework that integrates view-centric proposals into scene-centric parse graphs that represent a coherent scene-centric understanding of cross-view scenes. Our key observations are that overlapping fields of views embed rich appearance and geometry correlations and that knowledge fragments corresponding to individual vision tasks are governed by consistency constraints available in commonsense knowledge. The proposed joint parsing framework represents such correlations and constraints explicitly and generates semantic scene-centric parse graphs. Quantitative experiments show that scene-centric predictions in the parse graph outperform view-centric predictions.
PoseHD: Boosting Human Detectors Using Human Pose Information
Liu, Zhijian (Shanghai Jiao Tong University) | Pan, Bowen (Shanghai Jiao Tong University) | Xiu, Yuliang (Shanghai Jiao Tong University) | Lu, Cewu (Shanghai Jiao Tong University)
As most recently proposed methods for human detection have achieved a sufficiently high recall rate within a reasonable number of proposals, in this paper, we mainly focus on how to improve the precision rate of human detectors. In order to address the two main challenges in precision improvement, i.e., i) hard background instances and ii) redundant partial proposals, we propose the novel PoseHD framework, a top-down pose-based approach on the basis of an arbitrary state-of-theart human detector. In our proposed PoseHD framework, we first make use of human pose estimation (in a batch manner) and present pose heatmap classification (by a convolutional neural network) to eliminate hard negatives by extracting the more detailed structural information; then, we utilize posebased proposal clustering and reranking modules, filtering redundant partial proposals by comprehensively considering (a) Positive instances (b) Hard negative instances both holistic and part information. The experimental results on multiple pedestrian benchmark datasets validate that our proposed PoseHD framework can generally improve the overall performance of recent state-of-the-art human detectors (by 2-4% in both mAP and MR metrics). Moreover, our PoseHD framework can be easily extended to object detection with large-scale object part annotations. Finally, in this paper, we present extensive ablative analysis to compare our approach with these traditional bottom-up pose-based models and highlight (c) Redundant partial proposals (in blue box) the importance of our framework design decisions.
Unsupervised Selection of Negative Examples for Grounded Language Learning
Pillai, Nisha (University of Maryland, Baltimore County) | Matuszek, Cynthia (University of Maryland, Baltimore County)
There has been substantial work in recent years on grounded language acquisition, in which language and sensor data are used to create a model relating linguistic constructs to the perceivable world. While powerful, this approach is frequently hindered by ambiguities, redundancies, and omissions found in natural language. We describe an unsupervised system that learns language by training visual classifiers, first selecting important terms from object descriptions, then automatically choosing negative examples from a paired corpus of perceptual and linguistic data. We evaluate the effectiveness of each stage as well as the system's performance on the overall learning task.
Efficient Large-Scale Multi-Modal Classification
Kiela, Douwe (Facebook AI Research) | Grave, Edouard (Facebook AI Research) | Joulin, Armand (Facebook AI Research) | Mikolov, Tomas (Facebook AI Research)
While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities of data quickly. We investigate various methods for performing multi-modal fusion and analyze their trade-offs in terms of classification accuracy and computational efficiency. Our findings indicate that the inclusion of continuous information improves performance over text-only on a range of multi-modal classification tasks, even with simple fusion methods. In addition, we experiment with discretizing the continuous features in order to speed up and simplify the fusion process even further. Our results show that fusion with discretized features outperforms text-only classification, at a fraction of the computational cost of full multi-modal fusion, with the additional benefit of improved interpretability.
Jointly Parse and Fragment Ungrammatical Sentences
Hashemi, Homa B. (University of Pittsburgh) | Hwa, Rebecca (University of Pittsburgh)
However, the sentences under analysis may experiments, we find that both joint methods produce tree not always be grammatically correct. When a dependency fragment sets that are more similar to those produced by the parser nonetheless produces fully connected, syntactically oracle method than the previous pipeline method; moreover, well-formed trees for these sentences, the trees may be inappropriate the seq2seq method's pruning decision has a significantly and lead to errors. In fact, researchers have raised higher accuracy. In terms of downstream applications, we valid questions about the merit of annotating dependency show that dependency arc pruning is helpful for two applications: trees for ungrammatical sentences (Ragheb and Dickinson sentential grammaticality judgment and semantic role 2012; Cahill 2015). On the other hand, previous work has labeling.
Linguistic Properties Matter for Implicit Discourse Relation Recognition: Combining Semantic Interaction, Topic Continuity and Attribution
Lei, Wenqiang (National University of Singapore) | Xiang, Yuanxin (National University of Singapore) | Wang, Yuwei (University of Utah) | Zhong, Qian (City University of Hong Kong) | Liu, Meichun (City University of Hong Kong) | Kan, Min-Yen (National University of Singapore)
Modern solutions for implicit discourse relation recognition largely build universal models to classify all of the different types of discourse relations. In contrast to such learning models, we build our model from first principles, analyzing the linguistic properties of the individual top-level Penn Discourse Treebank (PDTB) styled implicit discourse relations: Comparison, Contingency and Expansion. We find semantic characteristics of each relation type and two cohesion devices---topic continuity and attribution---work together to contribute such linguistic properties. We encode those properties as complex features and feed them into a NaiveBayes classifier, bettering baselines(including deep neural network ones) to achieve a new state-of-the-art performance level. Over a strong, feature-based baseline, our system outperforms one-versus-other binary classification by 4.83% for Comparison relation, 3.94% for Contingency and 2.22% for four-way classification.
Learning Graph-Structured Sum-Product Networks for Probabilistic Semantic Maps
Zheng, Kaiyu (University of Washington, Seattle) | Pronobis, Andrzej (University of Washington, Seattle ) | Rao, Rajesh P. N. (University of Washington, Seattle )
We introduce Graph-Structured Sum-Product Networks (GraphSPNs), a probabilistic approach to structured prediction for problems where dependencies between latent variables are expressed in terms of arbitrary, dynamic graphs. While many approaches to structured prediction place strict constraints on the interactions between inferred variables, many real-world problems can be only characterized using complex graph structures of varying size, often contaminated with noise when obtained from real data. Here, we focus on one such problem in the domain of robotics. We demonstrate how GraphSPNs can be used to bolster inference about semantic, conceptual place descriptions using noisy topological relations discovered by a robot exploring large-scale office spaces. Through experiments, we show that GraphSPNs consistently outperform the traditional approach based on undirected graphical models, successfully disambiguating information in global semantic maps built from uncertain, noisy local evidence. We further exploit the probabilistic nature of the model to infer marginal distributions over semantic descriptions of as yet unexplored places and detect spatial environment configurations that are novel and incongruent with the known evidence.
Training Set Debugging Using Trusted Items
Zhang, Xuezhou (University of Wisconsin-Madison) | Zhu, Xiaojin (University of Wisconsin-Madison) | Wright, Stephen (University of Wisconsin-Madison)
Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for manual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus improves learning. Specifically, our approach seeks the smallest set of changes to the training set labels such that the model learned from this corrected training set predicts labels of the trusted items correctly. We flag the items whose labels are changed as potential bugs, whose labels can be checked for veracity by human experts. To find the bugs in this way is a challenging combinatorial bilevel optimization problem, but it can be relaxed into a continuous optimization problem.Experiments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appropriate changes to the labels. Our algorithm is a step toward trustworthy machine learning.
Beyond Link Prediction: Predicting Hyperlinks in Adjacency Space
Zhang, Muhan (Washington University in St. Louis) | Cui, Zhicheng ( Washington University in St. Louis ) | Jiang, Shali ( Washington University in St. Louis ) | Chen, Yixin ( Washington University in St. Louis )
This paper addresses the hyperlink prediction problem in hypernetworks. Different from the traditional link prediction problem where only pairwise relations are considered as links, our task here is to predict the linkage of multiple nodes, i.e., hyperlink. Each hyperlink is a set of an arbitrary number of nodes which together form a multiway relationship. Hyperlink prediction is challenging---since the cardinality of a hyperlink is variable, existing classifiers based on a fixed number of input features become infeasible. Heuristic methods, such as the common neighbors and Katz index, do not work for hyperlink prediction, since they are restricted to pairwise similarities. In this paper, we formally define the hyperlink prediction problem, and propose a new algorithm called Coordinated Matrix Minimization (CMM), which alternately performs nonnegative matrix factorization and least square matching in the vertex adjacency space of the hypernetwork, in order to infer a subset of candidate hyperlinks that are most suitable to fill the training hypernetwork. We evaluate CMM on two novel tasks: predicting recipes of Chinese food, and finding missing reactions of metabolic networks. Experimental results demonstrate the superior performance of our method over many seemingly promising baselines.