Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.
This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.
Clustering points in a vector space or nodes in a graph is a ubiquitous primitive in statistical data analysis, and it is commonly used for exploratory data analysis. In practice, it is often of interest to "refine" or "improve" a given cluster that has been obtained by some other method. In this survey, we focus on principled algorithms for this cluster improvement problem. Many such cluster improvement algorithms are flow-based methods, by which we mean that operationally they require the solution of a sequence of maximum flow problems on a (typically implicitly) modified data graph. These cluster improvement algorithms are powerful, both in theory and in practice, but they have not been widely adopted for problems such as community detection, local graph clustering, semi-supervised learning, etc. Possible reasons for this are: the steep learning curve for these algorithms; the lack of efficient and easy to use software; and the lack of detailed numerical experiments on real-world data that demonstrate their usefulness. Our objective here is to address these issues. To do so, we guide the reader through the whole process of understanding how to implement and apply these powerful algorithms. We present a unifying fractional programming optimization framework that permits us to distill, in a simple way, the crucial components of all these algorithms. It also makes apparent similarities and differences between related methods. Viewing these cluster improvement algorithms via a fractional programming framework suggests directions for future algorithm development. Finally, we develop efficient implementations of these algorithms in our LocalGraphClustering Python package, and we perform extensive numerical experiments to demonstrate the performance of these methods on social networks and image-based data graphs.
Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties.
Data available in scholarly knowledge graphs (SKGs) - i.e., "a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent potentially different relations between these entities"  - is growing continuously every day, leading to a plethora of challenges concerning, for instance, article exploration and visualization , article recommendation , citation recommendation , and Author Name Disambiguation (AND) , which is relevant for the purposes of the present article. In particular, AND refers to a specific task of entity resolution which aims at resolving author mentions in bibliographic references to real-world people. Author persistent identifiers, such as ORCIDs and VIAFs, simplify the AND activity since such identifiers can be used for reconciling entities defined as different objects and representing the same real-world person. However, the availability of such persistent identifiers in SKGs - such as OpenCitations (OC) , AMiner  and Microsoft Academic Knowledge Graph (MAKG)  - is characterized by very low coverage and, as such, additional and computationally-oriented techniques must be adopted to identify different authors as the same person. In the past, many automatic approaches have been developed to automatically address AND by using publications metadata (e.g., title, abstract, keywords, venue, affiliation, etc.) to extract some features which can be used in the disambiguation task. These methods vary widely from supervised learning methods to unsupervised learning including recently developed deep neural network-based architectures . However, the existing SKGs do not provide all the relevant contextual information necessary to reuse effectively and efficiently such approaches, that often rely on pure textual data. In contrast with the approaches mentioned above, this study focuses on performing AND for scholarly data represented as linked data or included in SKGs by considering the multi-modal information available in such collections, i.e., the structural information consisting of entities and relations between them as well as text or numeric values associated with the authors and publications defined in the form of literals (family name, given name, publication title, venue title, year of publication, etc.). The proposed framework to address this task is named Literally Author Name Disambiguation (LAND), which focuses on tackling the following research questions: - Can Knowledge Graph Embeddings (KGEs) - i.e. a technique that enables the creation of a "dense representation of the graph in a continuous, low-dimensional vector space that can then be used for machine learning tasks" - be used effectively for the downstream task of clustering, more specifically for author name disambiguation?
Assessing the effects of the energy transition and liberalization of energy markets on resource adequacy is an increasingly important and demanding task. The rising complexity in energy systems requires adequate methods for energy system modeling leading to increased computational requirements. Furthermore, with complexity, uncertainty increases likewise calling for probabilistic assessments and scenario analyses. To adequately and efficiently address these various requirements, new methods from the field of data science are needed to accelerate current methods. With our systematic literature review, we want to close the gap between the three disciplines (1) assessment of security of electricity supply, (2) artificial intelligence, and (3) design of experiments. For this, we conduct a large-scale quantitative review on selected fields of application and methods and make a synthesis that relates the different disciplines to each other. Among other findings, we identify metamodeling of complex security of electricity supply models using AI methods and applications of AI-based methods for forecasts of storage dispatch and (non-)availabilities as promising fields of application that have not sufficiently been covered, yet. We end with deriving a new methodological pipeline for adequately and efficiently addressing the present and upcoming challenges in the assessment of security of electricity supply.
Advancements in Intelligent Traffic Systems (ITS) have made huge amounts of traffic data available through automatic data collection. A big part of this data is stored as trajectories of moving vehicles and road users. Automatic analysis of this data with minimal human supervision would both lower the costs and eliminate subjectivity of the analysis. Trajectory clustering is an unsupervised task. In this paper, we perform a comprehensive comparison of similarity measures, clustering algorithms and evaluation measures using trajectory data from seven intersections. We also propose a method to automatically generate trajectory reference clusters based on their origin and destination points to be used for label-based evaluation measures. Therefore, the entire procedure remains unsupervised both in clustering and evaluation levels. Finally, we use a combination of evaluation measures to find the top performing similarity measures and clustering algorithms for each intersection. The results show that there is no single combination of distance and clustering algorithm that is always among the top ten clustering setups.
An outlier is an event or observation that is defined as an unusual activity, intrusion, or a suspicious data point that lies at an irregular distance from a population. The definition of an outlier event, however, is subjective and depends on the application and the domain (Energy, Health, Wireless Network, etc.). It is important to detect outlier events as carefully as possible to avoid infrastructure failures because anomalous events can cause minor to severe damage to infrastructure. For instance, an attack on a cyber-physical system such as a microgrid may initiate voltage or frequency instability, thereby damaging a smart inverter which involves very expensive repairing. Unusual activities in microgrids can be mechanical faults, behavior changes in the system, human or instrument errors or a malicious attack. Accordingly, and due to its variability, Outlier Detection (OD) is an ever-growing research field. In this chapter, we discuss the progress of OD methods using AI techniques. For that, the fundamental concepts of each OD model are introduced via multiple categories. Broad range of OD methods are categorized into six major categories: Statistical-based, Distance-based, Density-based, Clustering-based, Learning-based, and Ensemble methods. For every category, we discuss recent state-of-the-art approaches, their application areas, and performances. After that, a brief discussion regarding the advantages, disadvantages, and challenges of each technique is provided with recommendations on future research directions. This survey aims to guide the reader to better understand recent progress of OD methods for the assurance of AI.
We review the scholarly contributions that utilise Natural Language Processing (NLP) methods to support the design process. Using a heuristic approach, we collected 223 articles published in 32 journals and within the period 1991-present. We present state-of-the-art NLP in-and-for design research by reviewing these articles according to the type of natural language text sources: internal reports, design concepts, discourse transcripts, technical publications, consumer opinions, and others. Upon summarizing and identifying the gaps in these contributions, we utilise an existing design innovation framework to identify the applications that are currently being supported by NLP. We then propose a few methodological and theoretical directions for future NLP in-and-for design research.
We propose a spectral clustering algorithm for analyzing the dependence structure of multivariate extremes. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory. Our work studies the theoretical performance of spectral clustering based on a random $k$-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods.