Goto

Collaborating Authors

 hidden layer



DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift

McFadden, Shae, Foley, Myles, D'Onghia, Mario, Hicks, Chris, Mavroudis, Vasilios, Paoletti, Nicola, Pierazzi, Fabio

arXiv.org Artificial Intelligence

Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved an average AUT improvement of 8.66 and 10.90 for the classification-only and classification-rejection policies, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic setting of Android malware detection.


A Theoretical Results Consider a rewardless

Neural Information Processing Systems

We first bound the maximum increase. The case for maximum decrease is similar. The auxiliary reward function is learned after it is generated. We train each auxiliary reward function for 1M steps. A careful λ schedule helps induce a successful policy that avoids side effects.Algorithm 1: A Require CB-V AE training epochs T Require AUP penalty λ Require Exploration buffer size k Require Auxiliary model training steps L Require AUP model training steps N Require PPO update function PPO-Update Require CB-V AE update function V AE-Update for Step k = 1,...K do Sample random action a s Act (a) S = s S end for Epoch t = 1,...T do Update-V AE(F,S) end for Step i = 1,...L + N do s Starting state for Step l = 1,...L do a = ψ Common refers to those hyperparameters that are the same for each evaluated condition.



Checklist

Neural Information Processing Systems

The interesting properties of this model are: i) Boundedness: hidden state h( t) stays within range ( 1, 1), ii) Continuity: GRU-ODE is Lipschitz continuous with Lipschitz constant 2. In Appendix A.3 we show how our GRU flow model has the same properties without the need to use



Continual Learning with Query-Only Attention

Bekal, Gautham, Pujari, Ashish, Kelly, Scott David

arXiv.org Artificial Intelligence

Continual learning involves learning from a stream of data without repetition of data points, a scenario that is inherently complex due to distributional shift across tasks. We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures. In continual learning scenarios, this simplified mechanism significantly mitigates both loss of plasticity and catastrophic forgetting, outperforming baselines such as selective re-initialization. We establish a conceptual link between query-only attention, full transformer attention, and model agnostic meta-learning, framing them as instances of meta-learning. We further provide intuition for why query-based models and attention networks help preserve plasticity in continual settings. Finally, through preliminary Hessian spectrum analysis, we observe that models maintaining higher curvature rank across tasks tend to retain plasticity. Our findings suggest that full attention may not be essential for capturing the benefits of meta-learning in continual learning.