Clustering
A Case-Based Reasoning and Clustering Framework for the Development of Intelligent Agents in Simulation Systems
Lucca, Marcos R. B. (Federal University of Santa Maria) | Junior, Alcides G. Lopes ( Federal University of Rio Grande do Sul ) | Freitas, Edison P. ( Federal University of Rio Grande do Sul ) | Silva, Luis A. L. ( Federal University of Santa Maria )
Artificial Intelligence (AI) techniques are essential to the modeling of realistic behaviors for agents in simulation systems. Although Case-Based Reasoning (CBR) and Clustering techniques are being explored in the implementation of such agents in computer games, these techniques are still under-used in the implementation of simulation systems. This work approaches this gap by proposing a new CBR and clustering framework in which clustering algorithms and clustering evaluation techniques are explored in both the construction of adjusted similarity functions and the organization of sub-case bases, which are indexing components to the efficient retrieval of relevant cases from case bases so as to support the solution of new simulation problems. To evaluate this framework, a case-based algorithm was implemented to simulate the choice of military supplies to be used in artillery battery missions in virtual tactical simulations.
WSCAN-TFP: Weighted SCAN Clustering Algorithm for Team Formation Problem in Social Network
Selvarajah, Kalyani (University of Windsor) | Bhullar, Amangel (University of Windsor) | Kobti, Ziad (University of Windsor) | Kargar, Mehdi (University of Windsor)
In this paper, we provide a novel approach for the Team Formation Problem (TFP) in social networks. With a given social network of experts and communication cost between them, we address the problem of finding a team with a set of required skills necessary to complete a project. An expert of this network is treated as a node and possesses a given set of skills. The basic idea of the Structural Clustering Algorithm for Networks (SCAN) is to detect clusters, hubs, and outliers in networks. To employ SCAN on TFP, we first find the pool of experts with required skills. Then we search for highly connected (core) expert among all experts network. We expand the cluster from core to neighborhood nodes, and it goes from densely connected to loosely connected nodes within a threshold range of communication cost. We solve this TFP by identifying experts while minimizing communication cost for the project with specific skills. We then measure the communication cost with the sum of a distance function. An enhanced variant of SCAN is the weighted structural clustering algorithm (WSCAN) which is implemented in this paper to solve the TFP with minimum communication cost. Our result with WSCAN performed approximately equal to Greedy algorithms while slightly worse than other Genetic, Cultural and Exact Algorithms. The run-time of WSCAN however, was better compared to the others.
Convex Programming Based Spectral Clustering
Clustering is a fundamental task in data analysis, and spectral clustering has been recognized as a promising approach to it. Given a graph describing the relationship between data, spectral clustering explores the underlying cluster structure in two stages. The first stage embeds the nodes of the graph into real space, and the second stage groups the embedded nodes into several clusters. The use of the $k$-means method in the grouping stage is currently standard practice. We present a spectral clustering algorithm that uses convex programming in the grouping stage, and study how well it works. The concept behind the algorithm design lies in the following observation. The nodes with the largest degree in each cluster may be found by computing an enclosing ellipsoid for embedded nodes in real space, and the clusters may be identified by using those nodes. We show that the observations are valid, and the algorithm returns clusters to provide the conductance of graph, if the gap assumption, introduced by Peng el al. at COLT 2015, is satisfied. We also give an experimental assessment of the algorithm's performance.
Survey and cross-benchmark comparison of remaining time prediction methods in business process monitoring
Verenich, Ilya, Dumas, Marlon, La Rosa, Marcello, Maggi, Fabrizio, Teinemaa, Irene
Predictive business process monitoring methods exploit historical process execution logs to generate predictions about running instances (called cases) of a business process, such as the prediction of the outcome, next activity or remaining cycle time of a given process case. These insights could be used to support operational managers in taking remedial actions as business processes unfold, e.g. shifting resources from one case onto another to ensure this latter is completed on time. A number of methods to tackle the remaining cycle time prediction problem have been proposed in the literature. However, due to differences in their experimental setup, choice of datasets, evaluation measures and baselines, the relative merits of each method remain unclear. This article presents a systematic literature review and taxonomy of methods for remaining time prediction in the context of business processes, as well as a cross-benchmark comparison of 16 such methods based on 16 real-life datasets originating from different industry domains.
An Unsupervised Clustering-Based Short-Term Solar Forecasting Methodology Using Multi-Model Machine Learning Blending
Feng, Cong, Cui, Mingjian, Hodge, Bri-Mathias, Lu, Siyuan, Hamann, Hendrik F., Zhang, Jie
Abstract--Solar forecasting accuracy is affected by weather conditions, and weather awareness forecasting models are expected to improve the performance. However, it may not be available and reliable to classify different forecasting tasks by using only meteorological weather categorization. In this paper, an unsupervised clustering-based (UC-based) solar forecasting methodology is developed for short-term (1-hour-ahead) global horizontal irradiance (GHI) forecasting. This methodology consists of three parts: GHI time series unsupervised clustering, pattern recognition, and UC-based forecasting. The daily GHI time series is first clustered by an Optimized Cross-validated ClUsteRing (OCCUR) method, which determines the optimal number of clusters and best clustering results. Then, support vector machine pattern recognition (SVM-PR) is adopted to recognize the category of a certain day using the first few hours data in the forecasting stage. GHI forecasts are generated by the most suitable models in different clusters, which are built by a two-layer Machine learning based Multi-Model (M3) forecasting framework. The developed UC-based methodology is validated by using 1-year of data with six solar features. Numerical results show that (i) UC-based models outperform non-UC (all-in-one) models with the same M3 architecture by approximately 20%; (ii) M3-based models also outperform the single-algorithm machine learning (SAML) models by approximately 20%.
The 10 Mining Techniques Data Scientists Need For Their Own Toolbox
At their core, data scientists have a math and statistics background. Out of this math background, they're creating advanced analytics. Just like their software engineering counterparts, data scientists will have to interact with the business side. This includes understanding the domain enough to make insights. Data scientists are often tasked with analyzing data to help the business, and this requires a level of business acumen. Finally, their results need to be given to the business in an understandable fashion. This requires the ability to verbally and visually communicate complex results and observations in a way that the business can understand and act on them. Thus, it'll be extremely valuable for any aspiring data scientists to learn data mining -- the process where one structures the raw data and formulate or recognize the various patterns in the data through the mathematical and computational algorithms. This helps generate new information and unlock various insights. Here is a simple list of reasons on why you should study Data Mining?
Finding Frequent Entities in Continuous Data
Alet, Ferran, Chitnis, Rohan, Kaelbling, Leslie P., Lozano-Perez, Tomas
In many applications that involve processing high-dimensional data, it is important to identify a small set of entities that account for a significant fraction of detections. Rather than formalize this as a clustering problem, in which all detections must be grouped into hard or soft categories, we formalize it as an instance of the frequent items or heavy hitters problem, which finds groups of tightly clustered objects that have a high density in the feature space. We show that the heavy hitters formulation generates solutions that are more accurate and effective than the clustering formulation. In addition, we present a novel online algorithm for heavy hitters, called HAC, which addresses problems in continuous space, and demonstrate its effectiveness on real video and household domains.
Semi-Orthogonal Non-Negative Matrix Factorization
Li, Jack Yutong, Zhu, Ruoqing, Qu, Annie, Ye, Han, Sun, Zhankun
Non-negative Matrix Factorization (NMF) is a popular clustering and dimension reduction method by decomposing a non-negative matrix into the product of two lower dimension matrices composed of basis vectors. In this paper, we propose a semi-orthogonal NMF method that enforces one of the matrices to be orthogonal with mixed signs, thereby guarantees the rank of the factorization. Our method preserves strict orthogonality by implementing the Cayley transformation to force the solution path to be exactly on the Stiefel manifold, as opposed to the approximated orthogonality solutions in existing literature. We apply a line search update scheme along with an SVD-based initialization which produces a rapid convergence of the algorithm compared to other existing approaches. In addition, we present formulations of our method to incorporate both continuous and binary design matrices. Through various simulation studies, we show that our model has an advantage over other NMF variations regarding the accuracy of the factorization, rate of convergence, and the degree of orthogonality while being computationally competitive. We also apply our method to a text-mining data on classifying triage notes, and show the effectiveness of our model in reducing classification error compared to the conventional bag-of-words model and other alternative matrix factorization approaches.
Branching embedding: A heuristic dimensionality reduction algorithm based on hierarchical clustering
This paper proposes a new dimensionality reduction algorithm named branching embedding (BE). It converts a dendrogram to a two-dimensional scatter plot, and visualizes the inherent structures of the original high-dimensional data. Since the conversion part is not computationally demanding, the BE algorithm would be beneficial for the case where hierarchical clustering is already performed. Numerical experiments revealed that the outputs of the algorithm moderately preserve the original hierarchical structures.
Clustering With Pairwise Relationships: A Generative Approach
Yu, Yen-Yun, Elhabian, Shireen Y., Whitaker, Ross T.
Semi-supervised learning (SSL) has become important in current data analysis applications, where the amount of unlabeled data is growing exponentially and user input remains limited by logistics and expense. Constrained clustering, as a subclass of SSL, makes use of user input in the form of relationships between data points (e.g., pairs of data points belonging to the same class or different classes) and can remarkably improve the performance of unsupervised clustering in order to reflect user-defined knowledge of the relationships between particular data points. Existing algorithms incorporate such user input, heuristically, as either hard constraints or soft penalties, which are separate from any generative or statistical aspect of the clustering model; this results in formulations that are suboptimal and not sufficiently general. In this paper, we propose a principled, generative approach to probabilistically model, without ad hoc penalties, the joint distribution given by user-defined pairwise relations. The proposed model accounts for general underlying distributions without assuming a specific form and relies on expectation-maximization for model fitting. For distributions in a standard form, the proposed approach results in a closed-form solution for updated parameters.