Clustering
Alternative Blockmodelling
Correa, Oscar, Chan, Jeffrey, Nguyen, Vinh
Many approaches have been proposed to discover clusters within networks. Community finding field encompasses approaches which try to discover clusters where nodes are tightly related within them but loosely related with nodes of other clusters. However, a community network configuration is not the only possible latent structure in a graph. Core-periphery and hierarchical network configurations are valid structures to discover in a relational dataset. On the other hand, a network is not completely explained by only knowing the membership of each node. A high level view of the inter-cluster relationships is needed. Blockmodelling techniques deal with these two issues. Firstly, blockmodelling allows finding any network configuration besides to the well-known community structure. Secondly, blockmodelling is a summary representation of a network which regards not only membership of nodes but also relations between clusters. Finally, a unique summary representation of a network is unlikely. Networks might hide more than one blockmodel. Therefore, our proposed problem aims to discover a secondary blockmodel representation of a network that is of good quality and dissimilar with respect to a given blockmodel. Our methodology is presented through two approaches, (a) inclusion of cannot-link constraints and (b) dissimilarity between image matrices. Both approaches are based on non-negative matrix factorisation NMF which fits the blockmodelling representation. The evaluation of these two approaches regards quality and dissimilarity of the discovered alternative blockmodel as these are the requirements of the problem.
A close-up comparison of the misclassification error distance and the adjusted Rand index for external clustering evaluation
Indeed, it was the recommended choice in the seminal paper of Milligan and Cooper (1986), where five criteria were examined regarding the task of comparison of hierarchical clustering algorithms across different hierarchy levels. Their recommendation is based on the fact that, for the null case data (i.e., for a synthetic sample with randomly assigned class labels, showing no significant cluster structure), the ARI was the only index that produced a flat response curve across hierarchy levels, with mean values close to zero, hence indicating that the agreement between the randomly assigned labels and the algorithm solution was due to chance. Another popular measure for clustering validation, not included in Milligan and Cooper's study, is the misclassification error distance (MED). Its first appearance in the literature dates back at least to R egnier (1965), where it was introduced as a distance between partitions of a finite set, and it was called transfer distance. It is also referred to as partition distance (Gusfield, 2002) or maximum matching distance (Rossi, 2015).
Heavy Hitters via Cluster-Preserving Clustering
We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n) words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l2 sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory. Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms. Finding "frequent" or "top-k" items in a dataset is a common task in data mining. In the data streaming literature, this problem is typically referred to as the heavy hitters problem, which is as follows: a frequency vector x Rn is initialized to the zero vector, and we process a stream of updates update(i, ฮ) for ฮ R, with each such update causing the change xi xi ฮ . The goal is to identify coordinates in x with large weight (in absolute value) while using limited memory.
Prediction of Highway Lane Changes Based on Prototype Trajectories
Augustin, David, Hofmann, Marius, Konigorski, Ulrich
The vision of automated driving is to increase both road safety and efficiency, while offering passengers a convenient travel experience. This requires that autonomous systems correctly estimate the current traffic scene and its likely evolution. In highway scenarios early recognition of cutin maneuvers is essential for risk-aware maneuver planning. In this paper, a statistical approach is proposed, which advantageously utilizes a set of prototypical lane change trajectories to realize both early maneuver detection and uncertainty-aware trajectory prediction for traffic participants. Generation of prototype trajectories from real traffic data is accomplished by Agglomerative Hierarchical Clustering. During clustering, the alignment of the cluster prototypes to each other is optimized and the cohesion of the resulting prototype is limited when two clusters merge. In the prediction stage, the similarity of observed vehicle motion and typical lane change patterns in the data base is evaluated to construct a set of significant features for maneuver classification via Boosted Decision T rees. The future trajectory is predicted combining typical lane change realizations in a mixture model. B-splines based trajectory adaptations guarantee continuity during transition from actually observed to predicted vehicle states. Quantitative evaluation results demonstrate the proposed concept's improved performance for both maneuver and trajectory prediction compared to a previously implemented reference approach. The development of automated driving functions is a central activity of industry and science.
Learning about spatial inequalities: Capturing the heterogeneity in the urban environment
Siqueira-Gay, J., Giannotti, M. A., Sester, M.
Transportation systems can be conceptualized as an instrument of spreading people and resources over the territory, playing an important role in developing sustainable cities. The current rationale of transport provision is based on population demand, disregarding land use and socioeconomic information. To meet the challenge to promote a more equitable resource distribution, this work aims at identifying and describing patterns of urban services supply, their accessibility, and household income. By using a multidimensional approach, the spatial inequalities of a large city of the global south reveal that the low-income population has low access mainly to hospitals and cultural centers. A low-income group presents an intermediate level of accessibility to public schools and sports centers, evidencing the diverse condition of citizens in the peripheries. These complex outcomes generated by the interaction of land use and public transportation emphasize the importance of comprehensive methodological approaches to support decisions of urban projects, plans and programs. Reducing spatial inequalities, especially providing services for deprived groups, is fundamental to promote the sustainable use of resources and optimize the daily commuting.
A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training
Y adav and BodeMETHODOLOGY A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training Sumedh Y adav 1* and Mathis Bode 2 * Correspondence: sumedhyadav.iitkgp@gmail.com 1 Gstech T echnology Pvt. Ltd., 415, 2nd Floor, 16th Cross Road, 17th Main Road, HSR Layout Sector 4, 560102, Bengaluru, India Full list of author information is available at the end of the article Abstract A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy . T est results show that both approaches significantly speedup the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy . A network design is also presented for the partitioning based distributed training formulation. Added speedup in training run-time is observed when compared to that of serial implementation of the approaches. Keywords: training set selection; machine learning; large datasets; distributed machine learning; classification; graph coarsening objective; network architecture design Introduction Two decades earlier, some of the most seminal works in machine learning were done on training set selection [1, 2] under the banner of relevance reasoning. However, the better part of recent works have been exclusively towards feature selection [3, 4]. With increased processing power, run time of training is feasible even for datasets erstwhile considered large. Additionally, dimensionality ( d) dominates dataset size ( n) in the algorithmic complexities of learning algorithms. In the training phase, less data points mean fewer generalization guarantees, however, as we are moving in the era of big data, even the fastest classification algorithms are taking unfeasible time to train models. When data sources are abundant, it is befitting to separate data based on relevance to the learning task. This has led to a renewed interest in the once famous problem statement of relevance reasoning [5, 6]. Reasoning on relevance to get improved scalability of classification algorithms is currently explored on graphical/network data [7], and learned models [8]. One research area where training set selection has been given attention to is support vector machines (SVM).
Constrained K-means with General Pairwise and Cardinality Constraints
Bibi, Adel, Wu, Baoyuan, Ghanem, Bernard
In this work, we study constrained clustering, where constraints are utilized to guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. However, most existing constrained clustering models can only utilize one category of constraints at a time. In this paper, we enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As these two categories provide useful information at different levels, utilizing both of them is expected to allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the proposed unified formulation. To alleviate this difficulty, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. Then we derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers (ADMM) algorithm. Extensive experiments on both synthetic and real data demonstrate: (1) when utilizing a single category of constraint, the proposed model is superior to or competitive with state-of-the-art constrained clustering models, and (2) when utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category.
Collaborative Filtering and Multi-Label Classification with Matrix Factorization
Machine learning techniques for Recommendation System (RS) and Classification has become a prime focus of research to tackle the problem of information overload. RS are software tools that aim at making informed decisions about the services that a user may like. On the other hand, classification technique deals with the categorization of a data object into one of the several predefined classes. In the multi-label classification problem, unlike the traditional multi-class classification setting, each instance can be simultaneously associated with a subset of labels. The focus of thesis is on the development of novel techniques for collaborative filtering and multi-label classification. We propose a novel method of constructing a hierarchical bi-level maximum margin matrix factorization to handle matrix completion of ordinal rating matrix. Taking the cue from the alternative formulation of support vector machines, a novel loss function is derived by considering proximity as an alternative criterion instead of margin maximization criterion for matrix factorization framework. We extended the concept of matrix factorization for yet another important problem of machine learning namely multi-label classification which deals with the classification of data with multiple labels. We propose a novel piecewise-linear embedding method with a low-rank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. We also study the embedding of labels together with the group information with an objective to build an efficient multi-label classifier. We assume the existence of a low-dimensional space onto which the feature vectors and label vectors can be embedded. We ensure that labels belonging to the same group share the same sparsity pattern in their low-rank representations.
Online optimization of piecewise Lipschitz functions in changing environments
Balcan, Maria-Florina, Dick, Travis, Sharma, Dravyansh
In an online optimization problem we are required to choose a sequence of points from a fixed feasible subset of $\mathbb{R}^d$. After each point is chosen, a function over the set is revealed and we accumulate payoff equal to the function value at the point. We consider the class of piecewise Lipschitz functions, which is the most general setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of `static' regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to $s$ environment shifts. In this work we provide an $O(\sqrt{sdT\log T}+sT^{1-\beta})$ regret bound for $\beta$-dispersed functions, where $\beta$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $\beta\ge1/2$ in problems of practical interest). We show this bound is optimal up to sub-logarithmic factors. We further show how to improve the bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems, with 15-40% relative gain over static regret based algorithms on popular benchmarks.
The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations
Laugel, Thibault, Lesot, Marie-Jeanne, Marsala, Christophe, Renard, Xavier, Detyniecki, Marcin
Post-hoc interpretability approaches have been proven to be powerful tools to generate explanations for the predictions made by a trained black-box model. However, they create the risk of having explanations that are a result of some artifacts learned by the model instead of actual knowledge from the data. This paper focuses on the case of counterfactual explanations and asks whether the generated instances can be justified, i.e. continuously connected to some ground-truth data. We evaluate the risk of generating unjustified counterfactual examples by investigating the local neighborhoods of instances whose predictions are to be explained and show that this risk is quite high for several datasets. Furthermore, we show that most state of the art approaches do not differentiate justified from unjustified counterfactual examples, leading to less useful explanations.