AITopics

2406.10421

Country:

Europe (0.93)
North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.50)

Industry:

Education > Assessment & Standards > Student Performance (0.69)
Education > Educational Setting (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Artificial IntelligenceJun-25-2024

Generalizability of experimental studies

Matteucci, Federico, Arzamasov, Vadim, Cribeiro-Ramallo, Jose, Heyden, Marco, Ntounas, Konstantin, Böhm, Klemens

Experimental studies are a cornerstone of machine learning (ML) research. A common, but often implicit, assumption is that the results of a study will generalize beyond the study itself, e.g. to new data. That is, there is a high probability that repeating the study under different conditions will yield similar results. Despite the importance of the concept, the problem of measuring generalizability remains open. This is probably due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization and develop a quantifiable notion of generalizability. This notion allows to explore the generalizability of existing studies and to estimate the number of experiments needed to achieve the generalizability of new studies. To demonstrate its usefulness, we apply it to two recently published benchmarks to discern generalizable and non-generalizable results. We also publish a Python module that allows our analysis to be repeated for other experimental studies.

large language model, machine learning, natural language, (21 more...)

2406.17374

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.31)

arXiv.org Artificial IntelligenceApr-20-2024

Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional Data

Cribeiro-Ramallo, Jose, Arzamasov, Vadim, Matteucci, Federico, Wambold, Denis, Böhm, Klemens

Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a comprehensive mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.

data mining, gsaal, machine learning, (18 more...)

2404.14451

Country: North America > United States > New Jersey > Mercer County > Princeton (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceFeb-1-2024

Uncertainty-Aware Partial-Label Learning

Fuchs, Tobias, Kalinke, Florian, Böhm, Klemens

In real-world applications, one often encounters ambiguously labeled data, where different annotators assign conflicting class labels. Partial-label learning allows training classifiers in this weakly supervised setting. While state-of-the-art methods already feature good predictive performance, they often suffer from miscalibrated uncertainty estimates. However, having well-calibrated uncertainty estimates is important, especially in safety-critical domains like medicine and autonomous driving. In this article, we propose a novel nearest-neighbor-based partial-label-learning algorithm that leverages Dempster-Shafer theory. Extensive experiments on artificial and real-world datasets show that the proposed method provides a well-calibrated uncertainty estimate and achieves competitive prediction performance. Additionally, we prove that our algorithm is risk-consistent.

artificial intelligence, machine learning, prediction, (15 more...)

2402.00592

Country: Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Artificial IntelligenceJun-22-2023

Adaptive Bernstein Change Detector for High-Dimensional Data Streams

Heyden, Marco, Fouché, Edouard, Arzamasov, Vadim, Fenn, Tanja, Kalinke, Florian, Böhm, Klemens

Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein's inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by at least 8% and up to 23% in F1-score on average. It can also accurately estimate changes' subspace, together with a severity measure that correlates with the ground truth.

artificial intelligence, data mining, machine learning, (17 more...)

2306.12974

Country:

Europe > Germany (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Energy (1.00)
Automobiles & Trucks (0.67)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(2 more...)

arXiv.org Artificial IntelligenceMar-13-2023

Maximum Mean Discrepancy on Exponential Windows for Online Change Detection

Kalinke, Florian, Heyden, Marco, Fouché, Edouard, Böhm, Klemens

Detecting changes is of fundamental importance when analyzing data streams and has many applications, e.g., predictive maintenance, fraud detection, or medicine. A principled approach to detect changes is to compare the distributions of observations within the stream to each other via hypothesis testing. Maximum mean discrepancy (MMD; also called energy distance) is a well-known (semi-)metric on the space of probability distributions. MMD gives rise to powerful non-parametric two-sample tests on kernel-enriched domains under mild conditions, which makes its deployment for change detection desirable. However, the classic MMD estimators suffer quadratic complexity, which prohibits their application in the online change detection setting. We propose a general-purpose change detection algorithm, Maximum Mean Discrepancy on Exponential Windows (MMDEW), which leverages the MMD two-sample test, facilitates its efficient online computation on any kernel-enriched domain, and is able to detect any disparity between distributions. Our experiments and analysis show that (1) MMDEW achieves better detection quality than state-of-the-art competitors and that (2) the algorithm has polylogarithmic runtime and logarithmic memory requirements, which allow its deployment to the streaming setting.

artificial intelligence, exponential window, machine learning, (16 more...)

2205.12706

Country:

South America > Brazil (0.28)
Europe > Germany (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

arXiv.org Machine LearningSep-29-2020

Efficient SVDD Sampling with Approximation Guarantees for the Decision Boundary

Englhardt, Adrian, Trittenbach, Holger, Kottke, Daniel, Sick, Bernhard, Böhm, Klemens

Support Vector Data Description (SVDD) is a popular one-class classifiers for anomaly and novelty detection. But despite its effectiveness, SVDD does not scale well with data size. To avoid prohibitive training times, sampling methods select small subsets of the training data on which SVDD trains a decision boundary hopefully equivalent to the one obtained on the full data set. According to the literature, a good sample should therefore contain so-called boundary observations that SVDD would select as support vectors on the full data set. However, non-boundary observations also are essential to not fragment contiguous inlier regions and avoid poor classification accuracy. Other aspects, such as selecting a sufficiently representative sample, are important as well. But existing sampling methods largely overlook them, resulting in poor classification accuracy. In this article, we study how to select a sample considering these points. Our approach is to frame SVDD sampling as an optimization problem, where constraints guarantee that sampling indeed approximates the original decision boundary. We then propose RAPID, an efficient algorithm to solve this optimization problem. RAPID does not require any tuning of parameters, is easy to implement and scales well to large data sets. We evaluate our approach on real-world and synthetic data. Our evaluation is the most comprehensive one for SVDD sampling so far. Our results show that RAPID outperforms its competitors in classification accuracy, in sample size, and in runtime.

artificial intelligence, machine learning, svdd, (18 more...)

2009.13853

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.58)

arXiv.org Machine LearningJun-5-2020

Generating Artificial Outliers in the Absence of Genuine Ones -- a Survey

Steinbuss, Georg, Böhm, Klemens

By definition, outliers are rarely observed in reality, making them difficult to detect or analyse. Artificial outliers approximate such genuine outliers and can, for instance, help with the detection of genuine outliers or with benchmarking outlier-detection algorithms. The literature features different approaches to generate artificial outliers. However, systematic comparison of these approaches remains absent. This surveys and compares these approaches. We start by clarifying the terminology in the field, which varies from publication to publication, and we propose a general problem formulation. Our description of the connection of generating outliers to other research fields like experimental design or generative models frames the field of artificial outliers. Along with offering a concise description, we group the approaches by their general concepts and how they make use of genuine instances. An extensive experimental study reveals the differences between the generation approaches when ultimately being used for outlier detection. This survey shows that the existing approaches already cover a wide range of concepts underlying the generation, but also that the field still has potential for further development. Our experimental study does confirm the expectation that the quality of the generation approaches varies widely, for example, in terms of the data set they are used on. Ultimately, to guide the choice of the generation approach in a specific context, we propose an appropriate general-decision process. In summary, this survey comprises, describes, and connects all relevant work regarding the generation of artificial outliers and may serve as a basis to guide further research in the field.

artificial intelligence, health & medicine, outlier, (19 more...)

2006.03646

Country:

North America > United States > New York (0.14)
North America > United States > Texas (0.14)
North America > United States > California (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Health & Medicine (0.46)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.70)

arXiv.org Machine LearningMay-25-2020

Incremental Real-Time Personalization in Human Activity Recognition Using Domain Adaptive Batch Normalization

Mazankiewicz, Alan, Böhm, Klemens, Bergés, Mario

Human Activity Recognition (HAR) from devices like smartphone accelerometers is a fundamental problem in ubiquitous computing. Machine learning based recognition models often perform poorly when applied to new users that were not part of the training data. Previous work has addressed this challenge by personalizing general recognition models to the unique motion pattern of a new user in a static batch setting. They require target user data to be available upfront. The more challenging online setting has received less attention. No samples from the target user are available in advance, but they arrive sequentially. Additionally, the user's motion pattern may change over time. Thus, adapting to new and forgetting old information must be traded off. Finally, the target user should not have to do any work to use the recognition system by, say, labeling any activities. Our work addresses this challenges by proposing an unsupervised online domain adaptation algorithm. Both classification and personalization happen continuously and incrementally in real-time. Our solution works by aligning the feature distribution of all the subjects, sources and target, within deep neural network layers. Experiments with 44 subjects show accuracy improvements of up to 14 % for some individuals. Median improvement is 4 %.

deep learning, experiment, neural network, (14 more...)

2005.12178

Country:

North America > United States (0.67)
Europe > Germany > Baden-Württemberg (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

arXiv.org Machine LearningOct-3-2019

Scenario Discovery via Rule Extraction

Arzamasov, Vadim, Böhm, Klemens

Scenario discovery is the process of finding areas of interest, commonly referred to as scenarios, in data spaces resulting from simulations. For instance, one might search for conditions - which are inputs of the simulation model - where the system under investigation is unstable. A commonly used algorithm for scenario discovery is PRIM. It yields scenarios in the form of hyper-rectangles which are human-comprehensible. When the simulation model has many inputs, and the simulations are computationally expensive, PRIM may not produce good results, given the affordable volume of data. So we propose a new procedure for scenario discovery - we train an intermediate statistical model which generalizes fast, and use it to label (a lot of) data for PRIM. We provide the statistical intuition behind our idea. Our experimental study shows that this method is much better than PRIM itself. Specifically, our method reduces the number of simulations runs necessary by 75% on average.

modeling & simulation, scenario discovery, survey article, (17 more...)

1910.01713

Country: North America > United States > Wisconsin (0.14)

Genre:

Research Report > New Finding (0.87)
Research Report > Promising Solution (0.67)

Industry: Energy > Power Industry (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.50)