Eskofier, Bjoern
Robust and Efficient Writer-Independent IMU-Based Handwriting Recognization
Li, Jindong, Hamann, Tim, Barth, Jens, Kaempf, Peter, Zanca, Dario, Eskofier, Bjoern
Online handwriting recognition (HWR) using data from inertial measurement units (IMUs) remains challenging due to variations in writing styles and the limited availability of high-quality annotated datasets. Traditional models often struggle to recognize handwriting from unseen writers, making writer-independent (WI) recognition a crucial but difficult problem. This paper presents an HWR model with an encoder-decoder structure for IMU data, featuring a CNN-based encoder for feature extraction and a BiLSTM decoder for sequence modeling, which supports inputs of varying lengths. Our approach demonstrates strong robustness and data efficiency, outperforming existing methods on WI datasets, including the WI split of the OnHW dataset and our own dataset. Extensive evaluations show that our model maintains high accuracy across different age groups and writing conditions while effectively learning from limited data. Through comprehensive ablation studies, we analyze key design choices, achieving a balance between accuracy and efficiency. These findings contribute to the development of more adaptable and scalable HWR systems for real-world applications.
Don't Get Me Wrong: How to Apply Deep Visual Interpretations to Time Series
Loeffler, Christoffer, Lai, Wei-Cheng, Eskofier, Bjoern, Zanca, Dario, Schmidt, Lukas, Mutschler, Christopher
The correct interpretation and understanding of deep learning models are essential in many applications. Explanatory visual interpretation approaches for image, and natural language processing allow domain experts to validate and understand almost any deep learning model. However, they fall short when generalizing to arbitrary time series, which is inherently less intuitive and more diverse. Whether a visualization explains valid reasoning or captures the actual features is difficult to judge. Hence, instead of blind trust, we need an objective evaluation to obtain trustworthy quality metrics. We propose a framework of six orthogonal metrics for gradient-, propagation- or perturbation-based post-hoc visual interpretation methods for time series classification and segmentation tasks. An experimental study includes popular neural network architectures for time series and nine visual interpretation methods. We evaluate the visual interpretation methods with diverse datasets from the UCR repository and a complex, real-world dataset and study the influence of standard regularization techniques during training. We show that none of the methods consistently outperforms others on all metrics, while some are sometimes ahead. Our insights and recommendations allow experts to choose suitable visualization techniques for the model and task.
Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors
Zanca, Dario, Zugarini, Andrea, Dietz, Simon, Altstidl, Thomas R., Ndjeuha, Mark A. Turban, Schwinn, Leo, Eskofier, Bjoern
Understanding the mechanisms underlying human attention is a fundamental challenge for both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we present CapMIT1003, a database of captions and click-contingent image explorations collected during captioning tasks. CapMIT1003 is based on the same stimuli from the well-known MIT1003 benchmark, for which eye-tracking data under free-viewing conditions is available, which offers a promising opportunity to concurrently study human attention under both tasks. We make this dataset publicly available to facilitate future research in this field. In addition, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths that combines contrastive language-image pretrained (CLIP) models with biologically-inspired neural visual attention (NeVA) algorithms. NevaClip simulates human scanpaths by aligning the representation of the foveated visual stimulus and the representation of the associated caption, employing gradient-driven visual exploration to generate scanpaths. Our experimental results demonstrate that NevaClip outperforms existing unsupervised computational models of human visual attention in terms of scanpath plausibility, for both captioning and free-viewing tasks. Furthermore, we show that conditioning NevaClip with incorrect or misleading captions leads to random behavior, highlighting the significant impact of caption guidance in the decision-making process. These findings contribute to a better understanding of mechanisms that guide human attention and pave the way for more sophisticated computational approaches to scanpath prediction that can integrate direct top-down guidance of downstream tasks.
From Patches to Objects: Exploiting Spatial Reasoning for Better Visual Representations
Albert, Toni, Eskofier, Bjoern, Zanca, Dario
As the field of deep learning steadily transitions from the realm of academic research to practical application, the significance of self-supervised pretraining methods has become increasingly prominent. These methods, particularly in the image domain, offer a compelling strategy to effectively utilize the abundance of unlabeled image data, thereby enhancing downstream tasks' performance. In this paper, we propose a novel auxiliary pretraining method that is based on spatial reasoning. Our proposed method takes advantage of a more flexible formulation of contrastive learning by introducing spatial reasoning as an auxiliary task for discriminative self-supervised methods. Spatial Reasoning works by having the network predict the relative distances between sampled non-overlapping patches. We argue that this forces the network to learn more detailed and intricate internal representations of the objects and the relationships between their constituting parts. Our experiments demonstrate substantial improvement in downstream performance in linear evaluation compared to similar work and provide directions for further research into spatial reasoning.
Active Learning of Ordinal Embeddings: A User Study on Football Data
Loeffler, Christoffer, Fallah, Kion, Fenu, Stefano, Zanca, Dario, Eskofier, Bjoern, Rozell, Christopher John, Mutschler, Christopher
Distance metrics can only serve as proxy for similarity in information retrieval of similar instances. Learning a good similarity function from human annotations improves the quality of retrievals. This work uses deep metric learning to learn these user-defined similarity functions from few annotations for a large football trajectory dataset. We adapt an entropy-based active learning method with recent work from triplet mining to collect easy-to-answer but still informative annotations from human participants and use them to train a deep convolutional network that generalizes to unseen samples. Our user study shows that our approach improves the quality of the information retrieval compared to a previous deep metric learning approach that relies on a Siamese network. Specifically, we shed light on the strengths and weaknesses of passive sampling heuristics and active learners alike by analyzing the participants' response efficacy. To this end, we collect accuracy, algorithmic time complexity, the participants' fatigue and time-to-response, qualitative self-assessment and statements, as well as the effects of mixed-expertise annotators and their consistency on model performance and transfer-learning.
Identifying Untrustworthy Predictions in Neural Networks by Geometric Gradient Analysis
Schwinn, Leo, Nguyen, An, Raab, Renรฉ, Bungert, Leon, Tenbrinck, Daniel, Zanca, Dario, Burger, Martin, Eskofier, Bjoern
The susceptibility of deep neural networks to untrustworthy predictions, including out-of-distribution (OOD) data and adversarial examples, still prevent their widespread use in safety-critical applications. Most existing methods either require a re-training of a given model to achieve robust identification of adversarial attacks or are limited to out-of-distribution sample detection only. In this work, we propose a geometric gradient analysis (GGA) to improve the identification of untrustworthy predictions without retraining of a given model. GGA analyzes the geometry of the loss landscape of neural networks based on the saliency maps of their respective input. To motivate the proposed approach, we provide theoretical connections between gradients' geometrical properties and local minima of the loss function. Furthermore, we demonstrate that the proposed method outperforms prior approaches in detecting OOD data and adversarial attacks, including state-of-the-art and adaptive attacks.
System Design for a Data-driven and Explainable Customer Sentiment Monitor
Nguyen, An, Foerstel, Stefan, Kittler, Thomas, Kurzyukov, Andrey, Schwinn, Leo, Zanca, Dario, Hipp, Tobias, Sun, Da Jun, Schrapp, Michael, Rothgang, Eva, Eskofier, Bjoern
The most important goal of customer services is to keep the customer satisfied. However, service resources are always limited and must be prioritized. Therefore, it is important to identify customers who potentially become unsatisfied and might lead to escalations. Today this prioritization of customers is often done manually. Data science on IoT data (esp. log data) for machine health monitoring, as well as analytics on enterprise data for customer relationship management (CRM) have mainly been researched and applied independently. In this paper, we present a framework for a data-driven decision support system which combines IoT and enterprise data to model customer sentiment. Such decision support systems can help to prioritize customers and service resources to effectively troubleshoot problems or even avoid them. The framework is applied in a real-world case study with a major medical device manufacturer. This includes a fully automated and interpretable machine learning pipeline designed to meet the requirements defined with domain experts and end users. The overall framework is currently deployed, learns and evaluates predictive models from terabytes of IoT and enterprise data to actively monitor the customer sentiment for a fleet of thousands of high-end medical devices. Furthermore, we provide an anonymized industrial benchmark dataset for the research community.
Time Matters: Time-Aware LSTMs for Predictive Business Process Monitoring
Nguyen, An, Chatterjee, Srijeet, Weinzierl, Sven, Schwinn, Leo, Matzner, Martin, Eskofier, Bjoern
Predictive business process monitoring (PBPM) aims to predict future process behavior during ongoing process executions based on event log data. Especially, techniques for the next activity and timestamp prediction can help to improve the performance of operational business processes. Recently, many PBPM solutions based on deep learning were proposed by researchers. Due to the sequential nature of event log data, a common choice is to apply recurrent neural networks with long short-term memory (LSTM) cells. We argue, that the elapsed time between events is informative. However, current PBPM techniques mainly use 'vanilla' LSTM cells and hand-crafted time-related control flow features. To better model the time dependencies between events, we propose a new PBPM technique based on time-aware LSTM (T-LSTM) cells. T-LSTM cells incorporate the elapsed time between consecutive events inherently to adjust the cell memory. Furthermore, we introduce cost-sensitive learning to account for the common class imbalance in event logs. Our experiments on publicly available benchmark event logs indicate the effectiveness of the introduced techniques.
Conformance Checking for a Medical Training Process Using Petri net Simulation and Sequence Alignment
Nguyen, An, Zhang, Wenyu, Schwinn, Leo, Eskofier, Bjoern
Process Mining has recently gained popularity in healthcare due to its potential to provide a transparent, objective and data-based view on processes. Conformance checking is a sub-discipline of process mining that has the potential to answer how the actual process executions deviate from existing guidelines. In this work, we analyze a medical training process for a surgical procedure. Ten students were trained to install a Central Venous Catheters (CVC) with ultrasound. Event log data was collected directly after instruction by the supervisors during a first test run and additionally after a subsequent individual training phase. In order to provide objective performance measures, we formulate an optimal, global sequence alignment problem inspired by approaches in bioinformatics. Therefore, we use the Petri net model representation of the medical process guideline to simulate a representative set of guideline conform sequences. Next, we calculate the optimal, global sequence alignment of the recorded and simulated event logs. Finally, the output measures and visualization of aligned sequences are provided for objective feedback.