Clustering
Balanced Order Batching with Task-Oriented Graph Clustering
Duan, Lu, Hu, Haoyuan, Wu, Zili, Li, Guozheng, Zhang, Xinhang, Gong, Yu, Xu, Yinghui
Balanced order batching problem (BOBP) arises from the process of warehouse picking in Cainiao, the largest logistics platform in China. Batching orders together in the picking process to form a single picking route, reduces travel distance. The reason for its importance is that order picking is a labor intensive process and, by using good batching methods, substantial savings can be obtained. The BOBP is a NP-hard combinational optimization problem and designing a good problem-specific heuristic under the quasi-real-time system response requirement is non-trivial. In this paper, rather than designing heuristics, we propose an end-to-end learning and optimization framework named Balanced Task-orientated Graph Clustering Network (BTOGCN) to solve the BOBP by reducing it to balanced graph clustering optimization problem. In BTOGCN, a task-oriented estimator network is introduced to guide the type-aware heterogeneous graph clustering networks to find a better clustering result related to the BOBP objective. Through comprehensive experiments on single-graph and multi-graphs, we show: 1) our balanced task-oriented graph clustering network can directly utilize the guidance of target signal and outperforms the two-stage deep embedding and deep clustering method; 2) our method obtains an average 4.57m and 0.13m picking distance ("m" is the abbreviation of the meter (the SI base unit of length)) reduction than the expert-designed algorithm on single and multi-graph set and has a good generalization ability to apply in practical scenario.
Segmenting Bank Customers via RFM Model and Unsupervised Machine Learning
Aliyev, Musadig, Ahmadov, Elvin, Gadirli, Habil, Mammadova, Arzu, Alasgarov, Emin
In recent years, one of the major challenges for financial institutions is the retention of their customers using new methodologies of reliable and profitable segmentation. In the field of banking, the approach of offering all of the services to all the existing customers at the same time does not always work. However, being aware of what to sell, when to sell and whom to sell makes a huge difference in the conversion rate of the customers responding to new services and buying new products. In this paper, we used RFM technique and various clustering algorithms applied to the real customer data of one of the largest private banks of Azerbaijan.
Machine Learning for Reliability Engineering and Safety Applications: Review of Current Status and Future Opportunities
Xu, Zhaoyi, Saleh, Joseph Homer
Machine learning (ML) pervades an increasing number of academic disciplines and industries. Its impact is profound, and several fields have been fundamentally altered by it, autonomy and computer vision for example; reliability engineering and safety will undoubtedly follow suit. There is already a large but fragmented literature on ML for reliability and safety applications, and it can be overwhelming to navigate and integrate into a coherent whole. In this work, we facilitate this task by providing a synthesis of, and a roadmap to this ever-expanding analytical landscape and highlighting its major landmarks and pathways. We first provide an overview of the different ML categories and sub-categories or tasks, and we note several of the corresponding models and algorithms. We then look back and review the use of ML in reliability and safety applications. We examine several publications in each category/sub-category, and we include a short discussion on the use of Deep Learning to highlight its growing popularity and distinctive advantages. Finally, we look ahead and outline several promising future opportunities for leveraging ML in service of advancing reliability and safety considerations. Overall, we argue that ML is capable of providing novel insights and opportunities to solve important challenges in reliability and safety applications. It is also capable of teasing out more accurate insights from accident datasets than with traditional analysis tools, and this in turn can lead to better informed decision-making and more effective accident prevention.
Differentially Private Clustering: Tight Approximation Ratios
Ghazi, Badih, Kumar, Ravi, Manurangsi, Pasin
We study the task of differentially private clustering. For several basic clustering problems, including Euclidean DensestBall, 1-Cluster, k-means, and k-median, we give efficient differentially private algorithms that achieve essentially the same approximation ratios as those that can be obtained by any non-private algorithm, while incurring only small additive errors. This improves upon existing efficient algorithms that only achieve some large constant approximation factors. Our results also imply an improved algorithm for the Sample and Aggregate privacy framework. Furthermore, we show that one of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
Gradient-based Learning Methods Extended to Smooth Manifolds Applied to Automated Clustering
Koudounas, Alkis, Fiori, Simone
Grassmann manifold based sparse spectral clustering is a classification technique thatย consists in learning a latent representation of data, formed by a subspace basis, whichย is sparse. In order to learn a latent representation, spectral clustering is formulated inย terms of a loss minimization problem over a smooth manifold known as Grassmannian.ย Such minimization problem cannot be tackled by one of traditional gradient-based learningย algorithms, which are only suitable to perform optimization in absence of constraints amongย parameters. It is, therefore, necessary to develop specific optimization/learning algorithmsย that are able to look for a local minimum of a loss function under smooth constraints inย an efficient way. Such need calls for manifold optimization methods. In this paper, weย extend classical gradient-based learning algorithms on ย ย at parameter spaces (from classicalย gradient descent to adaptive momentum) to curved spaces (smooth manifolds) by meansย of tools from manifold calculus. We compare clustering performances of these methodsย and known methods from the scientific literature. The obtained results confirm that theย proposed learning algorithms prove lighter in computational complexity than existing onesย without detriment in clustering efficacy.
Exploring the weather impact on bike sharing usage through a clustering analysis
Quach, Jessica, Malekian, Reza
Bike sharing systems (BSS) have been a popular traveling service for years and are used worldwide. It is attractive for cities and users who wants to promote healthier lifestyles; to reduce air pollution and greenhouse gas emission as well as improve traffic. One major challenge to docked bike sharing system is redistributing bikes and balancing dock stations. Some studies propose models that can help forecasting bike usage; strategies for rebalancing bike distribution; establish patterns or how to identify patterns. Other studies propose to extend the approach by including weather data. This study aims to extend upon these proposals and opportunities to explore how and in what magnitude weather impacts bike usage. Bike usage data and weather data are gathered for the city of Washington D.C. and are analyzed using k-means clustering algorithm. K-means managed to identify three clusters that correspond to bike usage depending on weather conditions. The results show that the weather impact on bike usage was noticeable between clusters. It showed that temperature followed by precipitation weighted the most, out of five weather variables.
DBSCAN Clustering Algorithm in Machine Learning - KDnuggets
In 2014, the DBSCAN algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, ACM SIGKDD. Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense. It comprises of many different methods based on different distance measures. Centrally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it to cluster the data points into groups or batches. Here we will focus on the Density-based spatial clustering of applications with noise (DBSCAN) clustering method. If you are unfamiliar with the clustering algorithms, I advise you to read the Introduction to Image Segmentation with K-Means clustering.
Segmentation analysis and the recovery of queuing parameters via the Wasserstein distance: a study of administrative data for patients with chronic obstructive pulmonary disease
Wilde, Henry, Knight, Vincent, Gillard, Jonathan, Smith, Kendal
However, many such methods rely heavily on detailed data about both the healthcare system and its population which may limit research where sophisticated data pipelines are not yet in place. This work demonstrates a method of overcoming this, using routinely gathered, administrative hospital data to build a clustering that feeds into a multi-class queuing model, allowing for better understanding of the healthcare population and the system with which they interact. Specifically, this work examines records of patient spells from the National Health Service (NHS) Wales Cwm Taf Morgannwg University Health Board (UHB) presenting chronic obstructive pulmonary disease (COPD). COPD is a condition of particular interest to population health research, and to Cwm Taf Morgannwg UHB, as it is known to often present as a comorbidity in patients [15], increasing the complexity of treatments among those with the condition. Moreover, an internal report by NHS Wales found the Cwm Taf Morgannwg UHB had the highest prevalence of the condition across all the Welsh health boards. This work draws upon several overlapping sources within mathematical research, and this work contributes to the literature in three ways: to theoretical queuing research by the estimation of missing queuing parameters with the Wasserstein distance; to operational healthcare research through the weaving together of the combination of methods used in this work despite data constraints; and to public health research by adding to the growing body of mathematical and operational work around a condition that is vital to understand operationally, socially and medically. The remainder of the paper is structured as follows: Section 1 provides a literature review, and an overview of the dataset and its clustering; Section 2 describes the queuing model used and the estimation of its parameters; Section 3 presents several what-if scenarios with insight provided by the model parameterisation and the clustering; Section 4 concludes the paper. Although the data is confidential and may not be published, a synthetic analogue has been archived [43] along with all the source code used in this paper [40].
Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Schubert, Erich, Rousseeuw, Peter J.
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.
An Entropy Equation for Energy
This paper describes an entropy equation, but one that should be used for measuring energy and not information. In relation to the human brain therefore, both of these quantities can be used to represent the stored information. The human brain makes use of energy efficiency to form its structures, which is likely to be linked to the neuron wiring. This energy efficiency can also be used as the basis for a clustering algorithm, which is described in a different paper. This paper is more of a discussion about global properties, where the rules used for the clustering algorithm can also create the entropy equation E = (mean * variance). This states that work is done through the energy released by the 'change' in entropy. The equation is so simplistic and generic that it can offer arguments for completely different domains, where the journey ends with a discussion about global energy properties in physics and beyond. A comparison with Einstein's relativity equation is made and also the audacious suggestion that a black hole has zero-energy inside.