Clustering
Machine Learning at the Edge: A Data-Driven Architecture with Applications to 5G Cellular Networks
Polese, Michele, Jana, Rittwik, Kounev, Velin, Zhang, Ke, Deb, Supratim, Zorzi, Michele
The fifth generation of cellular networks (5G) will rely on edge cloud deployments to satisfy the ultra-low latency demand of future applications. In this paper, we argue that an edge-based deployment can also be used as an enabler of advanced Machine Learning (ML) applications in cellular networks, thanks to the balance it strikes between a completely distributed and a centralized approach. First, we will present an edge-controller-based architecture for cellular networks. Second, by using real data from hundreds of base stations of a major U.S. national operator, we will provide insights on how to dynamically cluster the base stations under the domain of each controller. Third, we will describe how these controllers can be used to run ML algorithms to predict the number of users, and a use case in which these predictions are used by a higher-layer application to route vehicular traffic according to network Key Performance Indicators (KPIs). We show that prediction accuracy improves when based on machine learning algorithms that exploit the controllers' view with respect to when it is based only on the local data of each single base station. The next generation of cellular networks (5G) is being designed to satisfy the massive growth in capacity demand, number of connections and the evolving use cases of a connected society for 2020 and beyond [1]. Michele Polese and Michele Zorzi are with the Department of Information Engineering (DEI), University of Padova, Italy. In order to meet these requirements, a new approach in the design of the network is required, and new paradigms have recently emerged [3]. First, the densification of the network will increase the spatial reuse and, combined with the usage of mmWave frequencies, the available throughput. On the other hand, this will introduce new challenges related to mobility management [4].
[Infographic] Using Machine Learning to Realign the NCAA FBS College Football Conferences - Max Kaplan
Does this make sense to you? Before you utilize the power of online anonymous commenting to tell me how much of an idiot I am, please reconsider. This is a league where the Big Ten has fourteen teams and the Big 12 has ten teams. And you think the conferences are perfect the way they are? There is a better way.
Clustering and Labelling Auction Fraud Data
Alzahrani, Ahmad, Sadaoui, Samira
Although shill bidding is a common auction fraud, it is however very tough to detect. Due to the unavailability and lack of training data, in this study, we build a high-quality labeled shill bidding dataset based on recently collected auctions from eBay. Labeling shill biding instances with multidimensional features is a critical phase for the fraud classification task. For this purpose, we introduce a new approach to systematically label the fraud data with the help of the hierarchical clustering CURE that returns remarkable results as illustrated in the experiments.
k-meansNet: When k-means Meets Differentiable Programming
Peng, Xi, Zhou, Joey Tianyi, Zhu, Hongyuan
In this paper, we study how to make clustering benefiting from differentiable programming whose basic idea is treating the neural network as a language instead of a machine learning method. To this end, we recast the vanilla $k$-means as a novel feedforward neural network in an elegant way. Our contribution is two-fold. On the one hand, the proposed \textit{k}-meansNet is a neural network implementation of the vanilla \textit{k}-means, which enjoys four advantages highly desired, i.e., robustness to initialization, fast inference speed, the capability of handling new coming data, and provable convergence. On the other hand, this work may provide novel insights into differentiable programming. More specifically, most existing differentiable programming works unroll an \textbf{optimizer} as a \textbf{recurrent neural network}, namely, the neural network is employed to solve an existing optimization problem. In contrast, we reformulate the \textbf{objective function} of \textit{k}-means as a \textbf{feedforward neural network}, namely, we employ the neural network to describe a problem. In such a way, we advance the boundary of differentiable programming by treating the neural network as from an alternative optimization approach to the problem formulation. Extensive experimental studies show that our method achieves promising performance comparing with 12 clustering methods on some challenging datasets.
Curse of Heterogeneity: Computational Barriers in Sparse Mixture Models and Phase Retrieval
Fan, Jianqing, Liu, Han, Wang, Zhaoran, Yang, Zhuoran
We study the fundamental tradeoffs between statistical accuracy and computational tractability in the analysis of high dimensional heterogeneous data. As examples, we study sparse Gaussian mixture model, mixture of sparse linear regressions, and sparse phase retrieval model. For these models, we exploit an oracle-based computational model to establish conjecture-free computationally feasible minimax lower bounds, which quantify the minimum signal strength required for the existence of any algorithm that is both computationally tractable and statistically accurate. Our analysis shows that there exist significant gaps between computationally feasible minimax risks and classical ones. These gaps quantify the statistical price we must pay to achieve computational tractability in the presence of data heterogeneity. Our results cover the problems of detection, estimation, support recovery, and clustering, and moreover, resolve several conjectures of Azizyan et al. (2013, 2015); Verzelen and Arias-Castro (2017); Cai et al. (2016). Interestingly, our results reveal a new but counter-intuitive phenomenon in heterogeneous data analysis that more data might lead to less computation complexity.
Multi-View Graph Embedding Using Randomized Shortest Paths
Gamage, Anuththari, Rappaport, Brian, Aeron, Shuchin, Hu, Xiaozhe
Real-world data sets often provide multiple types of information about the same set of entities. This data is well represented by multi-view graphs, which consist of several distinct sets of edges over the same nodes. These can be used to analyze how entities interact from different viewpoints. Combining multiple views improves the quality of inferences drawn from the underlying data, which has increased interest in developing efficient multi-view graph embedding methods. We propose an algorithm, C-RSP, that generates a common (C) embedding of a multi-view graph using Randomized Shortest Paths (RSP). This algorithm generates a dissimilarity measure between nodes by minimizing the expected cost of a random walk between any two nodes across all views of a multi-view graph, in doing so encoding both the local and global structure of the graph. We test C-RSP on both real and synthetic data and show that it outperforms benchmark algorithms at embedding and clustering tasks while remaining computationally efficient.
The Variable Quality of Metadata About Biological Samples Used in Biomedical Experiments
Gonçalves, Rafael S., Musen, Mark A.
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well- known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
Shedding Light on Black Box Machine Learning Algorithms: Development of an Axiomatic Framework to Assess the Quality of Methods that Explain Individual Predictions
From self-driving vehicles and back-flipping robots to virtual assistants who book our next appointment at the hair salon or at that restaurant for dinner - machine learning systems are becoming increasingly ubiquitous. The main reason for this is that these methods boast remarkable predictive capabilities. However, most of these models remain black boxes, meaning that it is very challenging for humans to follow and understand their intricate inner workings. Consequently, interpretability has suffered under this ever-increasing complexity of machine learning models. Especially with regards to new regulations, such as the General Data Protection Regulation (GDPR), the necessity for plausibility and verifiability of predictions made by these black boxes is indispensable. Driven by the needs of industry and practice, the research community has recognised this interpretability problem and focussed on developing a growing number of so-called explanation methods over the past few years. These methods explain individual predictions made by black box machine learning models and help to recover some of the lost interpretability. With the proliferation of these explanation methods, it is, however, often unclear, which explanation method offers a higher explanation quality, or is generally better-suited for the situation at hand. In this thesis, we thus propose an axiomatic framework, which allows comparing the quality of different explanation methods amongst each other. Through experimental validation, we find that the developed framework is useful to assess the explanation quality of different explanation methods and reach conclusions that are consistent with independent research.
K – Means Clustering Algorithm - StepUp Analytics Machine learning
"What gets measured, gets managed " – Peter Drucker The most important aim of all the clustering techniques is to group together the similar data points. K-means clustering algorithm is an unsupervised machine learning algorithm. It is a method of vector quantization that aims at grouping the similar data by minimizing the squared error function. You can apply k-means to any clustering problem provided you are having proper feature vector (Vector-Space Model) from data points and a similarity/distance measure that can measure similarity/distance between the feature vectors. The k-means clustering algorithm is used when you have unlabeled data (i.e., data without defined categories or groups).
Effective Unsupervised Author Disambiguation with Relative Frequencies
This work addresses the problem of author name homonymy in the Web of Science. Aiming for an efficient, simple and straightforward solution, we introduce a novel probabilistic similarity measure for author name disambiguation based on feature overlap. Using the researcher-ID available for a subset of the Web of Science, we evaluate the application of this measure in the context of agglomeratively clustering author mentions. We focus on a concise evaluation that shows clearly for which problem setups and at which time during the clustering process our approach works best. In contrast to most other works in this field, we are skeptical towards the performance of author name disambiguation methods in general and compare our approach to the trivial single-cluster baseline. Our results are presented separately for each correct clustering size as we can explain that, when treating all cases together, the trivial baseline and more sophisticated approaches are hardly distinguishable in terms of evaluation results. Our model shows state-of-the-art performance for all correct clustering sizes without any discriminative training and with tuning only one convergence parameter.