Alibaba Group
Adversarial Learning for Chinese NER From Crowd Annotations
Yang, YaoSheng (Soochow University) | Zhang, Meishan (Heilongjiang University) | Chen, Wenliang (Soochow University) | Zhang, Wei (Alibaba Group) | Wang, Haofen (Shenzhen Gowild Robotics Co. Ltd) | Zhang, Min (Soochow University)
To quickly obtain new labeled data, we can choose crowdsourcing as an alternative way at lower cost in a short time. But as an exchange, crowd annotations from non-experts may be of lower quality than those from experts. In this paper, we propose an approach to performing crowd annotation learning for Chinese Named Entity Recognition (NER) to make full use of the noisy sequence labels from multiple annotators. Inspired by adversarial learning, our approach uses a common Bi-LSTM and a private Bi-LSTM for representing annotator-generic and -specific information. The annotator-generic information is the common knowledge for entities easily mastered by the crowd. Finally, we build our Chinese NE tagger based on the LSTM-CRF model. In our experiments, we create two data sets for Chinese NER tasks from two domains. The experimental results show that our system achieves better scores than strong baseline systems.
Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM
Leng, Cong (Alibaba Group) | Dou, Zesheng (Alibaba Group) | Li, Hao (Alibaba Group) | Zhu, Shenghuo (Alibaba Group) | Jin, Rong (Alibaba Group)
Although deep learning models are highly effective for various learning tasks, their high computational costs prohibit the deployment to scenarios where either memory or computational resources are limited. In this paper, we focus on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We model this problem as a discretely constrained optimization problem. Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decouple the continuous parameters from the discrete constraints of network, and cast the original hard problem into several subproblems. We propose to solve these subproblems using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-of-the-art approaches when coming to extremely low bit neural network.
ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation
Zhou, Chang (Alibaba Group) | Bai, Jinze (Peking University) | Song, Junshuai (Peking University) | Liu, Xiaofei (Alibaba Group) | Zhao, Zhengchao (Alibaba Group) | Chen, Xiusi (Peking University) | Gao, Jun (Peking University)
A user can be represented as what he/she does along the history. A common way to deal with the user modeling problem is to manually extract all kinds of aggregated features over the heterogeneous behaviors, which may fail to fully represent the data itself due to limited human instinct. Recent works usually use RNN-based methods to give an overall embedding of a behavior sequence, which then could be exploited by the downstream applications. However, this can only preserve very limited information, or aggregated memories of a person. When a downstream application requires to facilitate the modeled user features, it may lose the integrity of the specific highly correlated behavior of the user, and introduce noises derived from unrelated behaviors. This paper proposes an attention based user behavior modeling framework called ATRank, which we mainly use for recommendation tasks. Heterogeneous user behaviors are considered in our model that we project all types of behaviors into multiple latent semantic spaces, where influence can be made among the behaviors via self-attention. Downstream applications then can use the user behavior vectors via vanilla attention. Experiments show that ATRank can achieve better performance and faster training process. We further explore ATRank to use one unified model to predict different types of user behaviors at the same time, showing a comparable performance with the highly optimized individual models.
Tau-FPL: Tolerance-Constrained Learning in Linear Time
Zhang, Ao (East China Normal University) | Li, Nan (Alibaba Group) | Pu, Jian (East China Normal University ) | Wang, Jun (East China Normal University) | Yan, Junchi (IBM Research – China) | Zha, Hongyuan (East China Normal University)
In many real-world applications, learning a classifier with false-positive rate under a specified tolerance is appealing. Existing approaches either introduce prior knowledge dependent label cost or tune parameters based on traditional classifiers, which are of limitation in methodology since they do not directly incorporate the false-positive rate tolerance. In this paper, we propose a novel scoring-thresholding approach, tau-False Positive Learning (tau-FPL) to address this problem. We show that the scoring problem which takes the false-positive rate tolerance into accounts can be efficiently solved in linear time, also an out-of-bootstrap thresholding method can transform the learned ranking function into a low false-positive classifier. Both theoretical analysis and experimental results show superior performance of the proposed tau-FPL over the existing approaches.
SEE: Syntax-Aware Entity Embedding for Neural Relation Extraction
He, Zhengqiu (Soochow University) | Chen, Wenliang (Soochow University) | Li, Zhenghua (Soochow University) | Zhang, Meishan (Heilongjiang University) | Zhang, Wei (Alibaba Group) | Zhang, Min (Soochow University)
Distant supervised relation extraction is an efficient approach to scale relation extraction to very large corpora, and has been widely used to find novel relational facts from plain text. Recent studies on neural relation extraction have shown great progress on this task via modeling the sentences in low-dimensional spaces, but seldom considered syntax information to model the entities. In this paper, we propose to learn syntax-aware entity embedding for neural relation extraction. First, we encode the context of entities on a dependency tree as sentence-level entity embedding based on tree-GRU. Then, we utilize both intra-sentence and inter-sentence attentions to obtain sentence set-level entity embedding over all sentences containing the focus entity pair. Finally, we combine both sentence embedding and entity embedding for relation classification. We conduct experiments on a widely used real-world dataset and the experimental results show that our model can make full use of all informative instances and achieve state-of-the-art performance of relation extraction.
Probabilistic Ensemble of Collaborative Filters
Min, Zhiyu (Alibaba Group) | Lin, Dahua (The Chinese University of Hong Kong)
Collaborative filtering is an important technique for recommendation. Whereas it has been repeatedly shown to be effective in previous work,its performance remains unsatisfactory in many real-world applications, especially those where the items or users are highly diverse. In this paper, we explore an ensemble-based framework to enhance thecapability of a recommender in handling diverse data. Specifically, we formulate a probabilistic model which integrates the items, the users, as well as the associations between them into a generative process. On top of this formulation, we further derive a progressive algorithm to construct an ensemble of collaborative filters. In each iteration, a new filter is derived from re-weighted entries and incorporated into the ensemble. It is noteworthy that while the algorithmic procedure of our algorithm is apparently similar to boosting, it is derived from an essentially different formulation and thus differs in several key technical aspects. We tested the proposed method on three large datasets, and observed substantial improvement over the state of the art, including L 2 Boost, an effective method based on boosting.
Improved English to Russian Translation by Neural Suffix Prediction
Song, Kai (Soochow University, Alibaba Group) | Zhang, Yue (Singapore University of Technology and Design) | Zhang, Min (Soochow University) | Luo, Weihua (Alibaba Group)
Neural machine translation (NMT) suffers a performance deficiency when a limited vocabulary fails to cover the source or target side adequately, which happens frequently when dealing with morphologically rich languages. To address this problem, previous work focused on adjusting translation granularity or expanding the vocabulary size. However, morphological information is relatively under-considered in NMT architectures, which may further improve translation quality. We propose a novel method, which can not only reduce data sparsity but also model morphology through a simple but effective mechanism. By predicting the stem and suffix separately during decoding, our system achieves an improvement of up to 1.98 BLEU compared with previous work on English to Russian translation. Our method is orthogonal to different NMT architectures and stably gains improvements on various domains.
Scalable Graph Embedding for Asymmetric Proximity
Zhou, Chang (Peking University) | Liu, Yuqiong (Peking University) | Liu, Xiaofei (Alibaba Group) | Liu, Zhongyi (Alibaba Group) | Gao, Jun (Peking University)
Graph Embedding methods are aimed at mapping each vertex into a low dimensional vector space, which preserves certain structural relationships among the vertices in the original graph. Recently, several works have been proposed to learn embeddings based on sampled paths from the graph, e.g., DeepWalk, Line, Node2Vec. However, their methods only preserve symmetric proximities, which could be insufficient in many applications, even the underlying graph is undirected. Besides, they lack of theoretical analysis of what exactly the relationships they preserve in their embedding space. In this paper, we propose an asymmetric proximity preserving (APP) graph embedding method via random walk with restart, which captures both asymmetric and high-order similarities between node pairs. We give theoretical analysis that our method implicitly preserves the Rooted PageRank score for any two vertices. We conduct extensive experiments on tasks of link prediction and node recommendation on open source datasets, as well as online recommendation services in Alibaba Group, in which the training graph has over 290 million vertices and 18 billion edges, showing our method to be highly scalable and effective.
Robust Manifold Matrix Factorization for Joint Clustering and Feature Extraction
Zhang, Lefei (Wuhan University) | Zhang, Qian (Alibaba Group) | Du, Bo (Wuhan University) | Tao, Dacheng (University of Technology Sydney) | You, Jane (The Hong Kong Polytechnic University)
Low-rank matrix approximation has been widely used for data subspace clustering and feature representation in many computer vision and pattern recognition applications. However, in order to enhance the discriminability, most of the matrix approximation based feature extraction algorithms usually generate the cluster labels by certain clustering algorithm (e.g., the kmeans) and then perform the matrix approximation guided by such label information. In addition, the noises and outliers in the dataset with large reconstruction errors will easily dominate the objective function by the conventional ℓ 2 -norm based squared residue minimization. In this paper, we propose a novel clustering and feature extraction algorithm based on an unified low-rank matrix factorization framework, which suggests that the observed data matrix can be approximated by the production of projection matrix and low dimensional representation, among which the low-dimensional representation can be approximated by the cluster indicator and latent feature matrix simultaneously. Furthermore, we have proposed using the ℓ 2,1 -norm and integrating the manifold regularization to further promote the proposed model. A novel Augmented Lagrangian Method (ALM) based procedure is designed to effectively and efficiently seek the optimal solution of the problem. The experimental results in both clustering and feature extraction perspectives demonstrate the superior performance of the proposed method.
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Li, Zhe (The University of Iowa) | Yang, Tianbao (The University of Iowa) | Zhang, Lijun (Nanjing University) | Jin, Rong (Alibaba Group)
This paper aims to provide a sharp excess risk guarantee for learning a sparse linear model without any assumptions about the strong convexity of the expected loss and the sparsity of the optimal solution in hindsight. Given a target level ε for the excess risk, an interesting question to ask is how many examples and how large the support set of the solution are enough for learning a good model with the target excess risk. To answer these questions, we present a two-stage algorithm that (i) in the first stage an epoch based stochastic optimization algorithm is exploited with an established O(1/ε) bound on the sample complexity; and (ii) in the second stage a distribution dependent randomized sparsification is presented with an O(1/ε) bound on the sparsity (referred to as support complexity) of the resulting model. Compared to previous works, our contributions lie at (i) we reduce the order of the sample complexity from O(1/ε2) to O(1/ε) without the strong convexity assumption; and (ii) we reduce the constant in O(1/ε) for the sparsity by exploring the distribution dependent sampling.