edd
Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning
Moon, Hyung-Jun, Cho, Sung-Bae
Continual learning methods used to force neural networks to process sequential tasks in isolation, preventing them from leveraging useful inter-task relationships and causing them to repeatedly relearn similar features or overly differentiate them. To address this problem, we propose a fully differentiable, exemplar-free expandable method composed of two complementary memories: One learns common features that can be used across all tasks, and the other combines the shared features to learn discriminative characteristics unique to each sample. Both memories are differentiable so that the network can autonomously learn latent representations for each sample. For each task, the memory adjustment module adaptively prunes critical slots and minimally expands capacity to accommodate new concepts, and orthogonal regularization enforces geometric separation between preserved and newly learned memory components to prevent interference. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that the proposed method outperforms 14 state-of-the-art methods for class-incremental learning, achieving final accuracies of 55.13\%, 37.24\%, and 30.11\%, respectively. Additional analysis confirms that, through effective integration and utilization of knowledge, the proposed method can increase average performance across sequential tasks, and it produces feature extraction results closest to the upper bound, thus establishing a new milestone in continual learning.
- North America > United States > Hawaii (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
A Theory of Learnability for Offline Decision Making
Mao, Chenjie, Zhang, Qiaosheng
We study the problem of offline decision making, which focuses on learning decisions from datasets only partially correlated with the learning objective. While previous research has extensively studied specific offline decision making problems like offline reinforcement learning (RL) and off-policy evaluation (OPE), a unified framework and theory remain absent. To address this gap, we introduce a unified framework termed Decision Making with Offline Feedback (DMOF), which captures a wide range of offline decision making problems including offline RL, OPE, and offline partially observable Markov decision processes (POMDPs). For the DMOF framework, we introduce a hardness measure called the Offline Estimation Coefficient (OEC), which measures the learnability of offline decision making problems and is also reflected in the derived minimax lower bounds. Additionally, we introduce an algorithm called Empirical Decision with Divergence (EDD), for which we establish both an instance-dependent upper bound and a minimax upper bound. The minimax upper bound almost matches the lower bound determined by the OEC. Finally, we show that EDD achieves a fast convergence rate (i.e., a rate scaling as $1/N$, where $N$ is the sample size) for specific settings such as supervised learning and Markovian sequential problems~(e.g., MDPs) with partial coverage.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Deep learning-driven scheduling algorithm for a single machine problem minimizing the total tardiness
Bouška, Michal, Šůcha, Přemysl, Novák, Antonín, Hanzálek, Zdeněk
First of all, there is a lack of systematic methods that improve the performance of algorithms on unseen instances by gathering the experience from the instances solved in the past. Therefore, all the information obtained during the past runs of an algorithm is neglected when a new instance is encountered. A good example is the branchand-bound-and-remember method [34, 40], where the algorithm remembers the information derived during an instance solving, but the information is forgotten as soon as the instance is solved. Second, the development of efficient heuristic rules requires a substantial amount of time devoted to its design and testing. This process is tedious and requires a skilled human professional to fine-tune the heuristic's parameters. A typical example of this feature is genetic algorithms having many parameters for selection, cross-over, mutation, and other operators. The apparent response to the above challenges is utilizing the existing data. However, the main obstacle to the successful application of machine learning to enhance algorithms for combinatorial problems remains. It can be formulated as the following fundamental question--is it possible to extract any useful information from the solved instances and use it efficiently to accelerate solving of an unseen instance?
- Europe > Czechia > Prague (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD)
Metzner, Claus, Schilling, Achim, Krauss, Patrick
In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.
Tensor-Aware Energy Accounting
With the rapid growth of Artificial Intelligence (AI) applications supported by deep learning (DL), the energy efficiency of these applications has an increasingly large impact on sustainability. We introduce Smaragdine, a new energy accounting system for tensor-based DL programs implemented with TensorFlow. At the heart of Smaragdine is a novel white-box methodology of energy accounting: Smaragdine is aware of the internal structure of the DL program, which we call tensor-aware energy accounting. With Smaragdine, the energy consumption of a DL program can be broken down into units aligned with its logical hierarchical decomposition structure. We apply Smaragdine for understanding the energy behavior of BERT, one of the most widely used language models. Layer-by-layer and tensor-by-tensor, Smaragdine is capable of identifying the highest energy/power-consuming components of BERT. Furthermore, we conduct two case studies on how Smaragdine supports downstream toolchain building, one on the comparative energy impact of hyperparameter tuning of BERT, the other on the energy behavior evolution when BERT evolves to its next generation, ALBERT.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (7 more...)
Online Kernel CUSUM for Change-Point Detection
We present a computationally efficient online kernel Cumulative Sum (CUSUM) method for change-point detection that utilizes the maximum over a set of kernel statistics to account for the unknown change-point location. Our approach exhibits increased sensitivity to small changes compared to existing kernel-based change-point detection methods, including Scan-B statistic, corresponding to a non-parametric Shewhart chart-type procedure. We provide accurate analytic approximations for two key performance metrics: the Average Run Length (ARL) and Expected Detection Delay (EDD), which enable us to establish an optimal window length to be on the order of the logarithm of ARL to ensure minimal power loss relative to an oracle procedure with infinite memory. Moreover, we introduce a recursive calculation procedure for detection statistics to ensure constant computational and memory complexity, which is essential for online implementation. Through extensive experiments on both simulated and real data, we demonstrate the competitive performance of our method and validate our theoretical results.
- North America > United States > Virginia (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Asia > Middle East > Jordan (0.04)
Neural Network-based CUSUM for Online Change-point Detection
Lee, Junghwan, Gong, Tingnan, Cheng, Xiuyuan, Xie, Yao
Change-point detection, detecting an abrupt change in the data distribution from sequential data, is a fundamental problem in statistics and machine learning. CUSUM is a popular statistical method for online change-point detection due to its efficiency from recursive computation and constant memory requirement, and it enjoys statistical optimality. CUSUM requires knowing the precise pre- and post-change distribution. However, post-change distribution is usually unknown a priori since it represents anomaly and novelty. When there is a model mismatch with actual data, classic CUSUM can perform poorly. While likelihood ratio-based methods encounter challenges in high dimensions, neural networks have become an emerging tool for change-point detection with computational efficiency and scalability. In this paper, we introduce a neural network CUSUM (NN-CUSUM) for online change-point detection. We also present a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve our goal. We further extend our analysis by combining it with the Neural Tangent Kernel theory to establish learning guarantees for the standard performance metrics, including the average run length (ARL) and expected detection delay (EDD). The strong performance of NN-CUSUM is demonstrated in detecting change-point in high-dimensional data using both synthetic and real-world data.
- North America > United States > California > Orange County > Irvine (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Spectral CUSUM for Online Network Structure Change Detection
Zhang, Minghe, Xie, Liyan, Xie, Yao
Detecting abrupt changes in the community structure of a network from noisy observations is a fundamental problem in statistics and machine learning. This paper presents an online change detection algorithm called Spectral-CUSUM to detect unknown network structure changes through a generalized likelihood ratio statistic. We characterize the average run length (ARL) and the expected detection delay (EDD) of the Spectral-CUSUM procedure and prove its asymptotic optimality. Finally, we demonstrate the good performance of the Spectral-CUSUM procedure and compare it with several baseline methods using simulations and real data examples on seismic event detection using sensor network data.
- North America > United States (0.46)
- Asia > China (0.28)
Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection
We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.
Repulsive Deep Ensembles are Bayesian
D'Angelo, Francesco, Fortuin, Vincent
Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.