Chen, Wei
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
Xu, Ran (State University of New York at Buffalo) | Xiong, Caiming ( University of California, Los Angeles ) | Chen, Wei (State University of New York at Buffalo) | Corso, Jason J (University of Michagan)
Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb- Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.
Combinatorial Pure Exploration of Multi-Armed Bandits
Chen, Shouyuan, Lin, Tian, King, Irwin, Lyu, Michael R., Chen, Wei
We study the {\em combinatorial pure exploration (CPE)} problem in the stochastic multi-armed bandit setting, where a learner explores a set of arms with the objective of identifying the optimal member of a \emph{decision class}, which is a collection of subsets of arms with certain combinatorial structures such as size-$K$ subsets, matchings, spanning trees or paths, etc. The CPE problem represents a rich class of pure exploration tasks which covers not only many existing models but also novel cases where the object of interest has a non-trivial combinatorial structure. In this paper, we provide a series of results for the general CPE problem. We present general learning algorithms which work for all decision classes that admit offline maximization oracles in both fixed confidence and fixed budget settings. We prove problem-dependent upper bounds of our algorithms. Our analysis exploits the combinatorial structures of the decision classes and introduces a new analytic tool. We also establish a general problem-dependent lower bound for the CPE problem. Our results show that the proposed algorithms achieve the optimal sample complexity (within logarithmic factors) for many decision classes. In addition, applying our results back to the problems of top-$K$ arms identification and multiple bandit best arms identification, we recover the best available upper bounds up to constant factors and partially resolve a conjecture on the lower bounds.
Agent Behavior Prediction and Its Generalization Analysis
Tian, Fei (University of Science and Technology of China) | Li, Haifang (Chinese Academy of Sciences) | Chen, Wei (Microsoft Research) | Qin, Tao (Microsoft Research) | Chen, Enhong (University of Science and Technology of China) | Liu, Tie-Yan (Microsoft Research)
Machine learning algorithms have been applied to predict agent behaviors in real-world dynamic systems, such as advertiser behaviors in sponsored search and worker behaviors in crowdsourcing. Behavior data in these systems are generated by live agents: once systems change due to adoption of prediction models learnt from behavior data, agents will observe and respond to these changes by changing their own behaviors accordingly. Therefore, the evolving behavior data will not be identically and independently distributed, posing great challenges to theoretical analysis. To tackle this challenge, in this paper, we propose to use Markov Chain in Random Environments (MCRE) to describe the behavior data, and perform generalization analysis of machine learning algorithms on its basis. We propose a novel technique that transforms the original time-variant MCRE into a higher-dimensional time-homogeneous Markov chain, which is easier to deal with. We prove the convergence of the new Markov chain when time approaches infinity. Then we obtain a generalization bound for the machine learning algorithms on the behavior data generated by the new Markov chain. To the best of our knowledge, this is the first work that performs the generalization analysis on data generated by complex processes in real-world dynamic systems.
A Game-Theoretic Machine Learning Approach for Revenue Maximization in Sponsored Search
He, Di (Peking University) | Chen, Wei (Microsoft Research Asia) | Wang, Liwei (Peking University) | Liu, Tie-Yan (Microsoft Research Asia)
Sponsored search is an important monetization channel for search engines, in which an auction mechanism is used to select the ads shown to users and determine the prices charged from advertisers. There have been several pieces of work in the literature that investigate how to design an auction mechanism in order to optimize the revenue of the search engine. However, due to some unrealistic assumptions used, the practical values of these studies are not very clear. In this paper, we propose a novel \emph{game-theoretic machine learning} approach, which naturally combines machine learning and game theory, and learns the auction mechanism using a bilevel optimization framework. In particular, we first learn a Markov model from historical data to describe how advertisers change their bids in response to an auction mechanism, and then for any given auction mechanism, we use the learnt model to predict its corresponding future bid sequences. Next we learn the auction mechanism through empirical revenue maximization on the predicted bid sequences. We show that the empirical revenue will converge when the prediction period approaches infinity, and a Genetic Programming algorithm can effectively optimize this empirical revenue. Our experiments indicate that the proposed approach is able to produce a much more effective auction mechanism than several baselines.
A Concave Conjugate Approach for Nonconvex Penalized Regression with the MCP Penalty
Zhang, Shubao (Zhejiang University) | Qian, Hui (Zhejiang University) | Chen, Wei (Zhejiang University) | Zhang, Zhihua (Zhejiang University)
The minimax concave plus penalty (MCP) has been demonstrated to be effective in nonconvex penalization for feature selection. In this paper we propose a novel construction approach for MCP. In particular, we show that MCP can be derived from a concave conjugate of the Euclidean distance function. This construction approach in turn leads us to an augmented Lagrange multiplier method for solving the penalized regression problem with MCP. In our method each tuning parameter corresponds to a feature, and these tuning parameters can be automatically updated. We also develop a d.c. (difference of convex functions) programming approach for the penalized regression problem. We find that the augmented Lagrange multiplier method degenerates into the d.c. programming method under specific conditions. Experimental analysis is conducted on a set of simulated data. The result is encouraging.
A Theoretical Analysis of NDCG Type Ranking Measures
Wang, Yining, Wang, Liwei, Li, Yuanzhi, He, Di, Liu, Tie-Yan, Chen, Wei
A central problem in ranking is to design a ranking measure for evaluation of ranking functions. In this paper we study, from a theoretical perspective, the widely used Normalized Discounted Cumulative Gain (NDCG)-type ranking measures. Although there are extensive empirical studies of NDCG, little is known about its theoretical properties. We first show that, whatever the ranking function is, the standard NDCG which adopts a logarithmic discount, converges to 1 as the number of items to rank goes to infinity. On the first sight, this result is very surprising. It seems to imply that NDCG cannot differentiate good and bad ranking functions, contradicting to the empirical success of NDCG in many applications. In order to have a deeper understanding of ranking measures in general, we propose a notion referred to as consistent distinguishability. This notion captures the intuition that a ranking measure should have such a property: For every pair of substantially different ranking functions, the ranking measure can decide which one is better in a consistent manner on almost all datasets. We show that NDCG with logarithmic discount has consistent distinguishability although it converges to the same limit for all ranking functions. We next characterize the set of all feasible discount functions for NDCG according to the concept of consistent distinguishability. Specifically we show that whether NDCG has consistent distinguishability depends on how fast the discount decays, and 1/r is a critical point. We then turn to the cut-off version of NDCG, i.e., NDCG@k. We analyze the distinguishability of NDCG@k for various choices of k and the discount functions. Experimental results on real Web search datasets agree well with the theory.
Using Automatic Question Generation to Evaluate Questions Generated by Children
Chen, Wei (Carnegie Mellon University) | Mostow, Jack (Carnegie Mellon University) | Aist, Gregory (Iowa State University)
This paper shows that automatically generated questions can help classify children’s spoken responses to a reading tutor teaching them to generate their own questions. We use automatic question generation to model and classify children’s prompted spoken questions about stories. On distinguishing complete and incomplete questions from irrelevant speech and silence, a language model built from automatically generated questions out-performs a trigram language model that does not exploit the structure of questions.
Participation Maximization Based on Social Influence in Online Discussion Forums
Sun, Tao (Peking University and Microsoft Research Asia) | Chen, Wei (Microsoft Research Asia) | Liu, Zhenming (Harvard School of Engineering and Applied Sciences and Microsoft Research Asia) | Wang, Yajun (Microsoft Research Asia) | Sun, Xiaorui (Shanghai Jiaotong University and Microsoft Research Asia) | Zhang, Ming (Peking University) | Lin, Chin-Yew (Microsoft Research Asia)
In online discussion forums, users are more motivated to take part in discussions when observing other users’ participation—the effect of social influence among forum users. In this paper, we study how to utilize social influence for increasing the overall forum participation. To this end, we propose a mechanism to maximize user influence and boost participation by displaying forum threads to users. We formally define the participation maximization problem, and show that it is a special instance of the social welfare maximization problem with submodular utility functions and it is NP-hard. However, generic approximation algorithms is impracticable for real-world forums due to time complexity. Thus we design a heuristic algorithm, named Thread Allocation Based on Influence (TABI), to tackle the problem. Through extensive experiments using a dataset from a real-world online forum, we demonstrate that TABI consistently outperforms all other algorithms in maximizing participation. The results of this work demonstrates that current recommender systems can be made more effective by considering future influence propagations. The problem of participation maximization based on influence also opens a new direction in the study of social influence.
Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Chen, Wei, Liu, Tie-yan, Ma, Zhi-ming
This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents per query. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of algorithms. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent if represented by features extracted from the matching between document and query. To tackle the challenge, we decompose the generalization error according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performance of ranking algorithms.
Ranking Measures and Loss Functions in Learning to Rank
Chen, Wei, Liu, Tie-yan, Lan, Yanyan, Ma, Zhi-ming, Li, Hang
Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking function by minimizing the loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking function. In this work, we reveal the relationship between ranking measures and loss functions in learning-to-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that these loss functions are upper bounds of the measure-based ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performance, demonstrating the correctness of our analysis.