Friedrich, Sarah, Antes, Gerd, Behr, Sigrid, Binder, Harald, Brannath, Werner, Dumpert, Florian, Ickstadt, Katja, Kestler, Hans, Lederer, Johannes, Leitgöb, Heinz, Pauly, Markus, Steland, Ansgar, Wilhelm, Adalbert, Friede, Tim
The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also deals with the equally necessary and meaningful extension of curricula in schools and universities.
Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations the typology employs four dimensions: data type, cardinality of relationship, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.
Dirichlet-multinomial (DMN) distribution is commonly used to model over-dispersion in count data. Precise and fast numerical computation of the DMN log-likelihood function is important for performing statistical inference using this distribution, and remains a challenge. To address this, we use mathematical properties of the gamma function to derive a closed form expression for the DMN log-likelihood function. Compared to existing methods, calculation of the closed form has a lower computational complexity, hence is much faster without comprimising computational accuracy.
Self-organization of complex morphological patterns from local interactions is a fascinating phenomenon in many natural and artificial systems. In the artificial world, typical examples of such morphogenetic systems are cellular automata. Yet, their mechanisms are often very hard to grasp and so far scientific discoveries of novel patterns have primarily been relying on manual tuning and ad hoc exploratory search. The problem of automated diversity-driven discovery in these systems was recently introduced [26, 61], highlighting that two key ingredients are autonomous exploration and unsupervised representation learning to describe "relevant" degrees of variations in the patterns. In this paper, we motivate the need for what we call Meta-diversity search, arguing that there is not a unique ground truth interesting diversity as it strongly depends on the final observer and its motives. Using a continuous game-of-life system for experiments, we provide empirical evidences that relying on monolithic architectures for the behavioral embedding design tends to bias the final discoveries (both for hand-defined and unsupervisedly-learned features) which are unlikely to be aligned with the interest of a final end-user. To address these issues, we introduce a novel dynamic and modular architecture that enables unsupervised learning of a hierarchy of diverse representations. Combined with intrinsically motivated goal exploration algorithms, we show that this system forms a discovery assistant that can efficiently adapt its diversity search towards preferences of a user using only a very small amount of user feedback.
Are you looking for the Best Online Statistics Courses? Get everything you'd want to know about descriptive and inferential statistics with these statistics training's. Learning statistics is a must for a data scientist. If you want to learn computer science, you will need to know the statistics as well. Do you know, What is the importance of statistics?
Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering algorithm, $k$-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.
This thesis contributes to the mathematical foundation of domain adaptation as emerging field in machine learning. In contrast to classical statistical learning, the framework of domain adaptation takes into account deviations between probability distributions in the training and application setting. Domain adaptation applies for a wider range of applications as future samples often follow a distribution that differs from the ones of the training samples. A decisive point is the generality of the assumptions about the similarity of the distributions. Therefore, in this thesis we study domain adaptation problems under as weak similarity assumptions as can be modelled by finitely many moments.
This two-part review examines how automation has contributed to different aspects of discovery in the chemical sciences. In this first part, we describe a classification for discoveries of physical matter (molecules, materials, devices), processes, and models and how they are unified as search problems. We then introduce a set of questions and considerations relevant to assessing the extent of autonomy. Finally, we describe many case studies of discoveries accelerated by or resulting from computer assistance and automation from the domains of synthetic chemistry, drug discovery, inorganic chemistry, and materials science. These illustrate how rapid advancements in hardware automation and machine learning continue to transform the nature of experimentation and modelling. Part two reflects on these case studies and identifies a set of open challenges for the field.
Bayesian optimization is a framework for global search via maximum a posteriori updates rather than simulated annealing, and has gained prominence for decision-making under uncertainty. In this work, we cast Bayesian optimization as a multi-armed bandit problem, where the payoff function is sampled from a Gaussian process (GP). Further, we focus on action selections via upper confidence bound (UCB) or expected improvement (EI) due to their prevalent use in practice. Prior works using GPs for bandits cannot allow the iteration horizon $T$ to be large, as the complexity of computing the posterior parameters scales cubically with the number of past observations. To circumvent this computational burden, we propose a simple statistical test: only incorporate an action into the GP posterior when its conditional entropy exceeds an $\epsilon$ threshold. Doing so permits us to derive sublinear regret bounds of GP bandit algorithms up to factors depending on the compression parameter $\epsilon$ for both discrete and continuous action sets. Moreover, the complexity of the GP posterior remains provably finite. Experimentally, we observe state of the art accuracy and complexity tradeoffs for GP bandit algorithms applied to global optimization, suggesting the merits of compressed GPs in bandit settings.
The growing availability of user-specific data has welcomed the exciting era of personalized recommendation, a paradigm that uncovers the heterogeneity across individuals and provides tailored service decisions that lead to improved outcomes. Such heterogeneity is ubiquitous across a variety of application domains (including online advertising, medical treatment assignment, product/news recommendation (, ,,,)) and manifests itself as different individuals responding differently to the recommended items. Rising to this opportunity, contextual bandits ([8, 39, 22, 1, 3]) have emerged to be the predominant mathematical formalism that provides an elegant and powerful formulation: its three core components, the features (representing individual characteristics), the actions (representing the recommendation), and the rewards (representing the observed feedback), capture the salient aspects of the problem and provide fertile ground for developing algorithms that balance exploring and exploiting users' heterogeneity. As such, the last decade has witnessed extensive research efforts in developing effective and efficient contextual bandits algorithms. In particular, two types of algorithms-upper confidence bounds (UCB) based algorithms ([29, 20, 15, 26, 30]) and Thompson sampling (TS) based algorithms ([4, 5, 40, 41, 2])-stand out from this flourishing and fruitful line of work: their theoretical guarantees have been analyzed in many settings, often yielding (near-)optimal regret bounds; their empirical performance have been thoroughly validated, often providing insights into their practical efficacy (including the consensus that TS based algorithms, although sometimes suffering from intensive computation for posterior updates, are generally more effective than their UCB counterparts, whose performance can be sensitive to hyper-parameter tuning). To a large extent, these two family of algorithms have been widely deployed in many modern recommendation engines.