Accuracy
Ensemble Learning based Anomaly Detection for IoT Cybersecurity via Bayesian Hyperparameters Sensitivity Analysis
Lai, Tin, Farid, Farnaz, Bello, Abubakar, Sabrina, Fariza
The Internet of Things (IoT) integrates more than billions of intelligent devices over the globe with the capability of communicating with other connected devices with little to no human intervention. IoT enables data aggregation and analysis on a large scale to improve life quality in many domains. In particular, data collected by IoT contain a tremendous amount of information for anomaly detection. The heterogeneous nature of IoT is both a challenge and an opportunity for cybersecurity. Traditional approaches in cybersecurity monitoring often require different kinds of data pre-processing and handling for various data types, which might be problematic for datasets that contain heterogeneous features. However, heterogeneous types of network devices can often capture a more diverse set of signals than a single type of device readings, which is particularly useful for anomaly detection. In this paper, we present a comprehensive study on using ensemble machine learning methods for enhancing IoT cybersecurity via anomaly detection. Rather than using one single machine learning model, ensemble learning combines the predictive power from multiple models, enhancing their predictive accuracy in heterogeneous datasets rather than using one single machine learning model. We propose a unified framework with ensemble learning that utilises Bayesian hyperparameter optimisation to adapt to a network environment that contains multiple IoT sensor readings. Experimentally, we illustrate their high predictive power when compared to traditional methods.
A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos
Rai, Anand Kumar, Jaiswal, Siddharth D, Mukherjee, Animesh
Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.
Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization
LeJeune, Daniel, Liu, Jiayu, Heckel, Reinhard
Machine learning models are typically evaluated by shuffling a set of labeled data, splitting it into training and test sets, and evaluating the model trained on the training set on the test set. This measures how well the model performs on the distribution the model was trained on. However, in practice a model is most commonly not applied to such in-distribution data, but rather to outof-distribution data that is almost always at least slightly different. In order to understand the performance of machine learning methods in practice, it is therefore important to understand how out-of-distribution performance relates to in-distribution performance. While there are settings in which models with similar in-distribution performance have different out-of-distribution performance (McCoy et al., 2020), a series of recent empirical studies have shown that often, the in-distribution and out-of-distribution performances of models are strongly correlated: Recht et al. (2019), Yadav and Bottou (2019), and Miller et al. (2020) constructed new test sets for the popular CIFAR-10, ImageNet, and MNIST image classification problems and for the SQuAD question answering datasets by following the original data collection and labeling process as closely as possible. For CIFAR-10 and ImageNet the performance drops significantly when evaluated on the new test set, indicating that even when following the original data collection and labeling process, a significant distribution shift can occur. In addition, for all four distribution shifts, the in-and out-of-distribution errors are strongly linearly correlated.
Uncovering Bias in Personal Informatics
Yfantidou, Sofia, Sermpezis, Pavlos, Vakali, Athena, Baeza-Yates, Ricardo
Ubiquitous technologies, such as smartphones and wearables, are an integral part of our lives today [47, 90]. Their proliferation has given rise to Personal Informatics (PI), namely a class of systems that "help people collect personally relevant information for the purpose of self-reflection and gaining self-knowledge" [66]. Such systems enable people to keep track of their productivity [62], finances [60], and learning [45]. Yet, tracking various aspects of physical and mental health is particularly prevalent [33]. PI systems can continuously and unobtrusively measure and collect physiological and behavioral data, namely, "digital biomarkers", from users through integrated sensors. Digital biomarkers contain an uncanny amount of personal information. Even the coarser behavioral biomarkers acquired from consumer wearables (e.g., steps, calories) strongly correlate to a person's gender, height, and weight [61], while signals of finer granularity (e.g., accelerometer and heart rate), can predict variables associated with an individual's physical health, fitness, and demographics [89]. At the same time, consumer smartphones and wearables are now packed with an increasing number of advanced health tracking features, innovating in personal health, research, and care [7]. Flagship consumer wearable algorithms --some approved by the US Food and Drug Administration-- can now identify signs of atrial fibrillation (AFib) through electrocardiogram (ECG) or photoplethysmography (PPG) signals [37].
Confidence Estimation Using Unlabeled Data
Li, Chen, Hu, Xiaoling, Chen, Chao
Overconfidence is a common issue for deep neural networks, limiting their deployment in real-world applications. To better estimate confidence, existing methods mostly focus on fully-supervised scenarios and rely on training labels. In this paper, we propose the first confidence estimation method for a semi-supervised setting, when most training labels are unavailable. We stipulate that even with limited training labels, we can still reasonably approximate the confidence of model on unlabeled samples by inspecting the prediction consistency through the training process. We use training consistency as a surrogate function and propose a consistency ranking loss for confidence estimation. On both image classification and segmentation tasks, our method achieves state-of-the-art performances in confidence estimation. Furthermore, we show the benefit of the proposed method through a downstream active learning task. The code is available at https://github.com/TopoXLab/consistency-ranking-loss
Selection functions of strong lens finding neural networks
Herle, A., O'Riordan, C. M., Vegetti, S.
Convolution Neural Networks trained for the task of lens finding with similar architecture and training data as is commonly found in the literature are biased classifiers. An understanding of the selection function of lens finding neural networks will be key to fully realising the potential of the large samples of strong gravitational lens systems that will be found in upcoming wide-field surveys. We use three training datasets, representative of those used to train galaxy-galaxy and galaxy-quasar lens finding neural networks. The networks preferentially select systems with larger Einstein radii and larger sources with more concentrated source-light distributions. Increasing the detection significance threshold to 12$\sigma$ from 8$\sigma$ results in 50 per cent of the selected strong lens systems having Einstein radii $\theta_\mathrm{E}$ $\ge$ 1.04 arcsec from $\theta_\mathrm{E}$ $\ge$ 0.879 arcsec, source radii $R_S$ $\ge$ 0.194 arcsec from $R_S$ $\ge$ 0.178 arcsec and source S\'ersic indices $n_{\mathrm{Sc}}^{\mathrm{S}}$ $\ge$ 2.62 from $n_{\mathrm{Sc}}^{\mathrm{S}}$ $\ge$ 2.55. The model trained to find lensed quasars shows a stronger preference for higher lens ellipticities than those trained to find lensed galaxies. The selection function is independent of the slope of the power-law of the mass profiles, hence measurements of this quantity will be unaffected. The lens finder selection function reinforces that of the lensing cross-section, and thus we expect our findings to be a general result for all galaxy-galaxy and galaxy-quasar lens finding neural networks.
Code Detection for Hardware Acceleration Using Large Language Models
Martínez, Pablo Antonio, Bernabé, Gregorio, García, José Manuel
Large language models (LLMs) have been massively applied to many tasks, often surpassing state-of-the-art approaches. While their effectiveness in code generation has been extensively studied (e.g., AlphaCode), their potential for code detection remains unexplored. This work presents the first analysis of code detection using LLMs. Our study examines essential kernels, including matrix multiplication, convolution, and fast-fourier transform, implemented in C/C++. We propose both a preliminary, naive prompt and a novel prompting strategy for code detection. Results reveal that conventional prompting achieves great precision but poor accuracy (68.8%, 22.3%, and 79.2% for GEMM, convolution, and FFT, respectively) due to a high number of false positives. Our novel prompting strategy substantially reduces false positives, resulting in excellent overall accuracy (91.1%, 97.9%, and 99.7%, respectively). These results pose a considerable challenge to existing state-of-the-art code detection methods.
GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence
Huang, Xia, Chong, Kai Fong Ernest
Web image datasets curated online inherently contain ambiguous in-distribution (ID) instances and out-of-distribution (OOD) instances, which we collectively call non-conforming (NC) instances. In many recent approaches for mitigating the negative effects of NC instances, the core implicit assumption is that the NC instances can be found via entropy maximization. For "entropy" to be well-defined, we are interpreting the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer. Hence, entropy maximization is based on the idealized assumption that NC instances have predictions that are "almost" uniformly distributed. However, in real-world web image datasets, there are numerous NC instances whose predictions are far from being uniformly distributed. To tackle the limitation of entropy maximization, we propose $(\alpha, \beta)$-generalized KL divergence, $\mathcal{D}_{\text{KL}}^{\alpha, \beta}(p\|q)$, which can be used to identify significantly more NC instances. Theoretical properties of $\mathcal{D}_{\text{KL}}^{\alpha, \beta}(p\|q)$ are proven, and we also show empirically that a simple use of $\mathcal{D}_{\text{KL}}^{\alpha, \beta}(p\|q)$ outperforms all baselines on the NC instance identification task. Building upon $(\alpha,\beta)$-generalized KL divergence, we also introduce a new iterative training framework, GenKL, that identifies and relabels NC instances. When evaluated on three web image datasets, Clothing1M, Food101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art classification accuracies: $81.34\%$, $85.73\%$ and $78.99\%$/$92.54\%$ (top-1/top-5), respectively.
Beyond Single-Feature Importance with ICECREAM
Oesterle, Michael, Blöbaum, Patrick, Mastakouri, Atalanti A., Kirschbaum, Elke
Which set of features was responsible for a certain output of a machine learning model? Which components caused the failure of a cloud computing application? These are just two examples of questions we are addressing in this work by Identifying Coalition-based Explanations for Common and Rare Events in Any Model (ICECREAM). Specifically, we propose an information-theoretic quantitative measure for the influence of a coalition of variables on the distribution of a target variable. This allows us to identify which set of factors is essential to obtain a certain outcome, as opposed to well-established explainability and causal contribution analysis methods which can assign contributions only to individual factors and rank them by their importance. In experiments with synthetic and real-world data, we show that ICECREAM outperforms state-of-the-art methods for explainability and root cause analysis, and achieves impressive accuracy in both tasks.
Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers
Kim, Jaeyoung, Jung, Kyuheon, Na, Dongbin, Jang, Sion, Park, Eunbin, Choi, Sungchul
For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.