Goto

Collaborating Authors

 predictive feature


Online Learning with Sublinear Best-Action Queries

Neural Information Processing Systems

In online learning, a decision maker repeatedly selects one of a set of actions, with the goal of minimizing the overall loss incurred. Following the recent line of research on algorithms endowed with additional predictive features, we revisit this problem by allowing the decision maker to acquire additional information on the actions to be selected. In particular, we study the power of \emph{best-action queries}, which reveal beforehand the identity of the best action at a given time step. In practice, predictive features may be expensive, so we allow the decision maker to issue at most $k$ such queries.We establish tight bounds on the performance any algorithm can achieve when given access to $k$ best-action queries for different types of feedback models. In particular, we prove that in the full feedback model, $k$ queries are enough to achieve an optimal regret of $\Theta(\min\{\sqrt T, \frac{T}{k}\})$. This finding highlights the significant multiplicative advantage in the regret rate achievable with even a modest (sublinear) number $k \in \Omega(\sqrt{T})$ of queries. Additionally, we study the challenging setting in which the only available feedback is obtained during the time steps corresponding to the $k$ best-action queries.


Can contrastive learning avoid shortcut solutions?

Neural Information Processing Systems

The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts, i.e., by inadvertently suppressing important predictive features. We find that feature extraction is influenced by the difficulty of the so-called instance discrimination task (i.e., the task of discriminating pairs of similar points from pairs of dissimilar ones). Although harder pairs improve the representation of some features, the improvement comes at the cost of suppressing previously well represented features. In response, we propose implicit feature modification (IFM), a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. Empirically, we observe that IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.


PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning

Xu, Beicheng, Liu, Wei, Ding, Keyao, Lu, Yupeng, Cui, Bin

arXiv.org Artificial Intelligence

The Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem is fundamental in Automated Machine Learning (AutoML). Inspired by the success of ensemble learning, recent AutoML systems construct post-hoc ensembles for final predictions rather than relying on the best single model. However, while most CASH methods conduct extensive searches for the optimal single model, they typically employ fixed strategies during the ensemble phase that fail to adapt to specific task characteristics. To tackle this issue, we propose PSEO, a framework for post-hoc stacking ensemble optimization. First, we conduct base model selection through binary quadratic programming, with a trade-off between diversity and performance. Furthermore, we introduce two mechanisms to fully realize the potential of multi-layer stacking. Finally, PSEO builds a hyperparameter space and searches for the optimal post-hoc ensemble strategy within it. Empirical results on 80 public datasets show that PSEO achieves the best average test rank (2.96) among 16 methods, including post-hoc designs in recent AutoML systems and state-of-the-art ensemble learning methods.


Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Zhang, Shaohua, Lin, Yuan, Li, Hang

arXiv.org Artificial Intelligence

The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.


Age-Normalized HRV Features for Non-Invasive Glucose Prediction: A Pilot Sleep-Aware Machine Learning Study

Azam, Md Basit, Singh, Sarangthem Ibotombi

arXiv.org Artificial Intelligence

Non-invasive glucose monitoring remains a critical challenge in the management of diabetes. HRV during sleep shows promise for glucose prediction however, age-related autonomic changes significantly confound traditional HRV analyses. We analyzed 43 subjects with multi-modal data including sleep-stage specific ECG, HRV features, and clinical measurements. A novel age-normalization technique was applied to the HRV features by, dividing the raw values by age-scaled factors. BayesianRidge regression with 5-fold cross-validation was employed for log-glucose prediction. Age-normalized HRV features achieved R2 = 0.161 (MAE = 0.182) for log-glucose prediction, representing a 25.6% improvement over non-normalized features (R2 = 0.132). The top predictive features were hrv rem mean rr age normalized (r = 0.443, p = 0.004), hrv ds mean rr age normalized (r = 0.438, p = 0.005), and diastolic blood pressure (r = 0.437, p = 0.005). Systematic ablation studies confirmed age-normalization as the critical component, with sleep-stage specific features providing additional predictive value. Age-normalized HRV features significantly enhance glucose prediction accuracy compared with traditional approaches. This sleep-aware methodology addresses fundamental limitations in autonomic function assessment and suggests a preliminary feasibility for non-invasive glucose monitoring applications. However, these results require validation in larger cohorts before clinical consideration.


A Closer Look at Multimodal Representation Collapse

Chaudhuri, Abhra, Dutta, Anjan, Bui, Tu, Georgescu, Serban

arXiv.org Artificial Intelligence

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.


Online Learning with Sublinear Best-Action Queries

Neural Information Processing Systems

In online learning, a decision maker repeatedly selects one of a set of actions, with the goal of minimizing the overall loss incurred. Following the recent line of research on algorithms endowed with additional predictive features, we revisit this problem by allowing the decision maker to acquire additional information on the actions to be selected. In particular, we study the power of \emph{best-action queries}, which reveal beforehand the identity of the best action at a given time step. In practice, predictive features may be expensive, so we allow the decision maker to issue at most k such queries.We establish tight bounds on the performance any algorithm can achieve when given access to k best-action queries for different types of feedback models. In particular, we prove that in the full feedback model, k queries are enough to achieve an optimal regret of \Theta(\min\{\sqrt T, \frac{T}{k}\}) .


Can contrastive learning avoid shortcut solutions?

Neural Information Processing Systems

The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features. We find that feature extraction is influenced by the difficulty of the so-called instance discrimination task (i.e., the task of discriminating pairs of similar points from pairs of dissimilar ones). Although harder pairs improve the representation of some features, the improvement comes at the cost of suppressing previously well represented features. In response, we propose implicit feature modification (IFM), a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. Empirically, we observe that IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.


Is it me, or is A larger than B: Uncovering the determinants of relational cognitive dissonance resolution

Barak, Tomer, Loewenstein, Yonatan

arXiv.org Artificial Intelligence

This study explores the computational mechanisms underlying the resolution of cognitive dissonances. We focus on scenarios in which an observation violates the expected relationship between objects. For instance, an agent expects object A to be smaller than B in some feature space but observes the opposite. One solution is to adjust the expected relationship according to the new observation and change the expectation to A being larger than B. An alternative solution would be to adapt the representation of A and B in the feature space such that in the new representation, the relationship that A is smaller than B is maintained. While both pathways resolve the dissonance, they generalize differently to different tasks. Using Artificial Neural Networks (ANNs) capable of relational learning, we demonstrate the existence of these two pathways and show that the chosen pathway depends on the dissonance magnitude. Large dissonances alter the representation of the objects, while small dissonances lead to adjustments in the expected relationships. We show that this effect arises from the inherently different learning dynamics of relationships and representations and study the implications.


Can contrastive learning avoid shortcut solutions?

Neural Information Processing Systems

The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features. We find that feature extraction is influenced by the difficulty of the so-called instance discrimination task (i.e., the task of discriminating pairs of similar points from pairs of dissimilar ones). Although harder pairs improve the representation of some features, the improvement comes at the cost of suppressing previously well represented features. In response, we propose implicit feature modification (IFM), a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. Empirically, we observe that IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.