Goto

Collaborating Authors

 Gomes, Heitor Murilo


CapyMOA: Efficient Machine Learning for Data Streams in Python

arXiv.org Artificial Intelligence

CapyMOA is an open-source library designed for efficient machine learning on streaming data. It provides a structured framework for real-time learning and evaluation, featuring a flexible data representation. CapyMOA includes an extensible architecture that allows integration with external frameworks such as MOA and PyTorch, facilitating hybrid learning approaches that combine traditional online algorithms with deep learning techniques. By emphasizing adaptability, scalability, and usability, CapyMOA allows researchers and practitioners to tackle dynamic learning challenges across various domains.


Evaluation for Regression Analyses on Evolving Data Streams

arXiv.org Artificial Intelligence

The paper explores the challenges of regression analysis in evolving data streams, an area that remains relatively underexplored compared to classification. We propose a standardized evaluation process for regression and prediction interval tasks in streaming contexts. Additionally, we introduce an innovative drift simulation strategy capable of synthesizing various drift types, including the less-studied incremental drift. Comprehensive experiments with state-of-the-art methods, conducted under the proposed process, validate the effectiveness and robustness of our approach.


CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning

arXiv.org Artificial Intelligence

The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier's training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods -- EWC, GEM, and Experience Replay -- and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: https://github.com/Will-Doherty/CLOFAI.


Look At Me, No Replay! SurpriseNet: Anomaly Detection Inspired Class Incremental Learning

arXiv.org Artificial Intelligence

Continual learning aims to create artificial neural networks capable of accumulating knowledge and skills through incremental training on a sequence of tasks. The main challenge of continual learning is catastrophic interference, wherein new knowledge overrides or interferes with past knowledge, leading to forgetting. An associated issue is the problem of learning "cross-task knowledge," where models fail to acquire and retain knowledge that helps differentiate classes across task boundaries. A common solution to both problems is "replay," where a limited buffer of past instances is utilized to learn cross-task knowledge and mitigate catastrophic interference. However, a notable drawback of these methods is their tendency to overfit the limited replay buffer. In contrast, our proposed solution, SurpriseNet, addresses catastrophic interference by employing a parameter isolation method and learning cross-task knowledge using an auto-encoder inspired by anomaly detection. SurpriseNet is applicable to both structured and unstructured data, as it does not rely on image-specific inductive biases. We have conducted empirical experiments demonstrating the strengths of SurpriseNet on various traditional vision continual-learning benchmarks, as well as on structured data datasets. Source code made available at https://doi.org/10.5281/zenodo.8247906 and https://github.com/tachyonicClock/SurpriseNet-CIKM-23


Machine Learning (In) Security: A Stream of Problems

arXiv.org Artificial Intelligence

Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers and defenders: malicious actors can always create novel threats to overcome the defense solutions, which may not consider them in some approaches. Due to this, it is essential to know how to properly build and evaluate an ML-based security solution. In this paper, we identify, detail, and discuss the main challenges in the correct application of ML techniques to cybersecurity data. We evaluate how concept drift, evolution, delayed labels, and adversarial ML impact the existing solutions. Moreover, we address how issues related to data collection affect the quality of the results presented in the security literature, showing that new strategies are needed to improve current solutions. Finally, we present how existing solutions may fail under certain circumstances, and propose mitigations to them, presenting a novel checklist to help the development of future ML solutions for cybersecurity.


Advances on Concept Drift Detection in Regression Tasks using Social Networks Theory

arXiv.org Artificial Intelligence

Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the classification task but usually maintain a fixed size ensemble of learners running the risk of needlessly spending processing time and memory. In this paper we present improvements to the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for regression that employs social networks theory. In order to detect concept drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show improvements in accuracy, especially in concept drift situations and better performance compared to other state-of-the-art algorithms in both real and synthetic data.


STUDD: A Student-Teacher Method for Unsupervised Concept Drift Detection

arXiv.org Machine Learning

Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is increasing attention to approaches that perform concept drift detection in an unsupervised manner, i.e., without access to the true labels. We propose a novel approach to unsupervised concept drift detection based on a student-teacher learning paradigm. Essentially, we create an auxiliary model (student) to mimic the behaviour of the primary model (teacher). At run-time, our approach is to use the teacher for predicting new instances and monitoring the mimicking loss of the student for concept drift detection. In a set of experiments using 19 data streams, we show that the proposed approach can detect concept drift and present a competitive behaviour relative to the state of the art approaches.


River: machine learning for streaming data in Python

arXiv.org Artificial Intelligence

River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of the two most popular packages for stream learning in Python: Creme and scikit-multiflow. River introduces a revamped architecture based on the lessons learnt from the seminal packages. River's ambition is to be the go-to library for doing machine learning on streaming data. Additionally, this open source package brings under the same umbrella a large community of practitioners and researchers. The source code is available at https://github.com/online-ml/river.