Goto

Collaborating Authors

 Clustering


Meaningful Answer Generation of E-Commerce Question-Answering

arXiv.org Artificial Intelligence

In e-commerce portals, generating answers for product-related questions has become a crucial task. In this paper, we focus on the task of product-aware answer generation, which learns to generate an accurate and complete answer from large-scale unlabeled e-commerce reviews and product attributes. However, safe answer problems pose significant challenges to text generation tasks, and e-commerce question-answering task is no exception. To generate more meaningful answers, in this paper, we propose a novel generative neural model, called the Meaningful Product Answer Generator (MPAG), which alleviates the safe answer problem by taking product reviews, product attributes, and a prototype answer into consideration. Product reviews and product attributes are used to provide meaningful content, while the prototype answer can yield a more diverse answer pattern. To this end, we propose a novel answer generator with a review reasoning module and a prototype answer reader. Our key idea is to obtain the correct question-aware information from a large scale collection of reviews and learn how to write a coherent and meaningful answer from an existing prototype answer. To be more specific, we propose a read-and-write memory consisting of selective writing units to conduct reasoning among these reviews. We then employ a prototype reader consisting of comprehensive matching to extract the answer skeleton from the prototype answer. Finally, we propose an answer editor to generate the final answer by taking the question and the above parts as input. Conducted on a real-world dataset collected from an e-commerce platform, extensive experimental results show that our model achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Human evaluation also demonstrates that our model can consistently generate specific and proper answers.


Functorial Manifold Learning and Overlapping Clustering

arXiv.org Machine Learning

We adapt previous research on topological unsupervised learning to develop a unified functorial perspective on manifold learning and clustering. We first introduce overlapping hierachical clustering algorithms as functors and demonstrate that the maximal and single linkage clustering algorithms factor through an adaptation of the singular set functor. Next, we characterize manifold learning algorithms as functors that map uber-metric spaces to optimization objectives and factor through hierachical clustering functors. We use this characterization to prove refinement bounds on manifold learning loss functions and construct a hierarchy of manifold learning algorithms based on their invariants. We express several state of the art manifold learning algorithms as functors at different levels of this hierarchy, including Laplacian Eigenmaps, Metric Multidimensional Scaling, and UMAP. Finally, we experimentally demonstrate that this perspective enables us to derive and analyze novel manifold learning algorithms.


Data-driven Algorithm Design

arXiv.org Artificial Intelligence

Data driven algorithm design is an important aspect of modern data science and algorithm design. Rather than using off the shelf algorithms that only have worst case performance guarantees, practitioners often optimize over large families of parametrized algorithms and tune the parameters of these algorithms using a training set of problem instances from their domain to determine a configuration with high expected performance over future instances. However, most of this work comes with no performance guarantees. The challenge is that for many combinatorial problems of significant importance including partitioning, subset selection, and alignment problems, a small tweak to the parameters can cause a cascade of changes in the algorithm's behavior, so the algorithm's performance is a discontinuous function of its parameters. In this chapter, we survey recent work that helps put data-driven combinatorial algorithm design on firm foundations. We provide strong computational and statistical performance guarantees, both for the batch and online scenarios where a collection of typical problem instances from the given application are presented either all at once or in an online fashion, respectively.


Kernel k-Means, By All Means: Algorithms and Strong Consistency

arXiv.org Machine Learning

Kernel $k$-means clustering is a powerful tool for unsupervised learning of non-linearly separable data. Since the earliest attempts, researchers have noted that such algorithms often become trapped by local minima arising from non-convexity of the underlying objective function. In this paper, we generalize recent results leveraging a general family of means to combat sub-optimal local solutions to the kernel and multi-kernel settings. Called Kernel Power $k$-Means, our algorithm makes use of majorization-minimization (MM) to better solve this non-convex problem. We show the method implicitly performs annealing in kernel feature space while retaining efficient, closed-form updates, and we rigorously characterize its convergence properties both from computational and statistical points of view. In particular, we characterize the large sample behavior of the proposed method by establishing strong consistency guarantees. Its merits are thoroughly validated on a suite of simulated datasets and real data benchmarks that feature non-linear and multi-view separation.


An improved spectral clustering method for community detection under the degree-corrected stochastic blockmodel

arXiv.org Machine Learning

To solve the community detection problem, substantial approaches, such as Snijders and Nowicki (1997); Nowicki and Snijders (2001); Daudin et al. (2008); Bickel and Chen (2009); Rohe et al. (2011); Amini et al. (2013), are designed based on the standard framework, the stochastic block model (SBM) (Holland et al. (1983)), since it is mathematically simple and relatively easy to analyze (Bickel and Chen (2009)). However, the assumptions of SBM are too restrictive to implement in real networks. It is assumed that the distribution of degrees within the community is Poisson, that is, the nodes within each community have the same expected degrees. Unfortunately, in many natural networks, the degrees follow approximately a power-law distribution (Kolaczyk (2009); Goldenberg et al. (2010); Jin(2015)). The corrected-degree stochastic block model (DCSBM) (Karrer and Newman (2011)) is developed based on the power-law distribution which allows the degree of nodes varies among different communities.


Clustering of Big Data with Mixed Features

arXiv.org Machine Learning

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.


PQk-means: Billion-scale Clustering for Product-quantized Codes

#artificialintelligence

PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors. In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes. If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.


Graph Kernels: State-of-the-Art and Future Challenges

arXiv.org Machine Learning

Among the data structures commonly used in machine learning, graphs are arguably one of the most general. Graphs allow modelling complex objects as a collection of entities (nodes) and of relationships between such entities (edges), each of which can be annotated by metadata such as categorical or vectorial node and edge features. Many ubiquitous data types can be understood as particular cases of graphs, including unstructured vectorial data as well as structured data types such as time series, images, volumetric data, point clouds or bags of entities, to name a few. Most importantly, numerous applications benefit from the extra flexibility that graph-based representations provide. In chemoinformatics, graphs have been used extensively to represent molecular compounds (Trinajstic, 2018), with nodes corresponding to atoms, edges to chemical bonds, and node and edge features encoding known chemical properties of each atom and bond in the molecule. Machine learning approaches operating on such graph-based representations of molecules are becoming increasingly successful in learning to predict complex molecular properties from large annotated data sets (Duvenaud et al., 2015; Gilmer et al., 2017; Wu et al., 2018), offering a promising set of tools for drug discovery (Vamathevan et al., 2019). In computational biology, graphs have likewise risen to prominence due to their ability to describe multifaceted interactions between (biological) entities.


A contribution to Optimal Transport on incomparable spaces

arXiv.org Machine Learning

Optimal Transport is a theory that allows to define geometrical notions of distance between probability distributions and to find correspondences, relationships, between sets of points. Many machine learning applications are derived from this theory, at the frontier between mathematics and optimization. This thesis proposes to study the complex scenario in which the different data belong to incomparable spaces. In particular we address the following questions: how to define and apply Optimal Transport between graphs, between structured data? How can it be adapted when the data are varied and not embedded in the same metric space? This thesis proposes a set of Optimal Transport tools for these different cases. An important part is notably devoted to the study of the Gromov-Wasserstein distance whose properties allow to define interesting transport problems on incomparable spaces. More broadly, we analyze the mathematical properties of the various proposed tools, we establish algorithmic solutions to compute them and we study their applicability in numerous machine learning scenarii which cover, in particular, classification, simplification, partitioning of structured data, as well as heterogeneous domain adaptation.


Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel

arXiv.org Machine Learning

Spectral clustering is a popular method for community detection in networks under the assumption of the standard stochastic blockmodel. Taking a matrix representation of the graph such as the adjacency matrix, the nodes are clustered on a low dimensional projection obtained from a truncated spectral decomposition of the matrix. Estimating the number of communities and the dimension of the reduced latent space well is crucial for good performance of spectral clustering algorithms. Real-world networks, such as computer networks studied in cyber-security applications, often present heterogeneous within-community degree distributions which are better addressed by the degree-corrected stochastic blockmodel. A novel, model-based method is proposed in this article for simultaneous and automated selection of the number of communities and latent dimension for spectral clustering under the degree-corrected stochastic blockmodel. The method is based on a transformation to spherical coordinates of the spectral embedding, and on a novel modelling assumption in the transformed space, which is then embedded into an existing model selection framework for estimating the number of communities and the latent dimension. Results show improved performance over competing methods on simulated and real-world computer network data.