intermediate model
Efficient Knowledge Distillation from Model Checkpoints
Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we observe that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more ``dark knowledge'' for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability.
Towards provable probabilistic safety for scalable embodied AI systems
He, Linxuan, Jia, Qing-Shan, Li, Ang, Sang, Hongyan, Wang, Ling, Lu, Jiwen, Zhang, Tao, Zhou, Jie, Zhang, Yi, Wang, Yisen, Wei, Peng, Wang, Zhongyuan, Liu, Henry X., Feng, Shuo
Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. Instead, empirical safety evaluation is employed as an alternative, but the absence of provable guarantees imposes significant limitations. To address these issues, we argue for a paradigm shift to provable probabilistic safety that integrates provable guarantees with progressive achievement toward a probabilistic safety boundary on overall system performance. The new paradigm better leverages statistical methods to enhance feasibility and scalability, and a well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale. In this Perspective, we outline a roadmap for provable probabilistic safety, along with corresponding challenges and potential solutions. By bridging the gap between theoretical safety assurance and practical deployment, this Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > Iowa (0.04)
- (5 more...)
- Information Technology > Security & Privacy (1.00)
- Transportation > Air (0.93)
- Health & Medicine (0.68)
- (2 more...)
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization
Wu, Shujin, Qian, Cheng, Fung, Yi R., Liang, Paul Pu, Ji, Heng
The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students from employing their knowledge during training and reaching their full potential. In this work, we introduce Alice (pro{A}ctive {l}earning w{i}th tea{c}her's D{e}monstrations), a framework that leverages complementary knowledge between teacher and student to enhance the learning process. We probe the knowledge base of the teacher model by eliciting their uncertainty, and then use these insights together with teachers' responses as demonstrations to guide student models in self-generating improved responses for supervision. In addition, for situations with significant capability gaps between teacher and student models, we introduce cascade Alice, which employs a hierarchical training approach where weak teachers initially supervise intermediate models, who then guide stronger models in sequence. Experimental results demonstrate that our method significantly enhances the W2SG performance, yielding substantial improvements in three key tasks compared to the original W2SG: knowledge-based reasoning (+4.0%), mathematical reasoning (+22.62%), and logical reasoning (+12.11%). This highlights the effectiveness of our new W2SG paradigm that enables more robust knowledge transfer and supervision outcome.
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Overview (0.66)
- Research Report > New Finding (0.48)
Med-gte-hybrid: A contextual embedding transformer model for extracting actionable information from clinical texts
Kumar, Aditya, Rauch, Simon, Cypko, Mario, Amft, Oliver
Besides structured data, patient care information in Electronic Health Records (EHRs) comprises data in unstructured form (i.e., clinical notes). While EHR information may overlap between structured and unstructured sections, some crucial information remains only in the unstructured sections [1-5]. Relevant information for clinical decisions can easily be overlooked when dealing with large amounts of notes. Previous investigations have already found that clinical text alone can often provide su!cient information for decisions [2, 6, 7]. However, extracting actionable information from clinical text remains di!cult due to language variability, inconsistent use of medical terminology, and lack of standardised formatting [8]. In addition, the lack of structure introduces ambiguities and inconsistencies in the text. Thus, it is still di!cult to accurately interpret and analyse clinical notes for decision support systems. As a result, the development of advanced natural language processing (NLP) models that can extract and represent information from clinical narrative has become a focal point of research in digital medicine [2, 9, 10]. Contextual embedding models have emerged as a powerful tool for transforming unstructured texts into dense vector representations to encode rich semantic information [11, 12].
Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection
Ghosh, Somrita, Xu, Yuelin, Zhang, Xiao
Compared with standard learning, adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training (SSAT) with external or synthetically generated unlabeled data to enhance model robustness. However, SSAT requires a substantial amount of extra unlabeled data, significantly increasing memory usage and model training times. To address these challenges, we propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement. Our selection prioritizes data points near the model's decision boundary based on latent clustering-based techniques, efficiently identifying a critical subset of unlabeled data with a higher concentration of boundary-adjacent points. While focusing on near-boundary data, our methods are designed to maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Our experiments on image benchmarks show that integrating our selection strategies into self-supervised adversarial training can largely reduce memory and computational requirements while achieving high model robustness. In particular, our latent clustering-based selection method with k-means is the most effective, achieving nearly identical test-time robust accuracies with 5 to 10 times less external or generated unlabeled data when applied to image benchmarks. Additionally, we validate the generalizability of our approach across various application scenarios, including a real-world medical dataset for COVID-19 chest X-ray classification.
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.86)
Towards efficient compression and communication for prototype-based decentralized learning
Fernández-Piñeiro, Pablo, Ferández-Veiga, Manuel, Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana, González-Soto, Martín
In prototype-based federated learning, the exchange of model parameters between clients and the master server is replaced by transmission of prototypes or quantized versions of the data samples to the aggregation server. A fully decentralized deployment of prototype-based learning, without a central agregartor of prototypes, is more robust upon network failures and reacts faster to changes in the statistical distribution of the data, suggesting potential advantages and quick adaptation in dynamic learning tasks, e.g., when the data sources are IoT devices or when data is non-iid. In this paper, we consider the problem of designing a communication-efficient decentralized learning system based on prototypes. We address the challenge of prototype redundancy by leveraging on a twofold data compression technique, i.e., sending only update messages if the prototypes are informationtheoretically useful (via the Jensen-Shannon distance), and using clustering on the prototypes to compress the update messages used in the gossip protocol. We also use parallel instead of sequential gossiping, and present an analysis of its age-of-information (AoI). Our experimental results show that, with these improvements, the communications load can be substantially reduced without decreasing the convergence rate of the learning algorithm. Federated Learning (FL) [1], [2], [3] and Decentralized Federated Learning (DFL) [4], [5] provide good approaches for distributed machine learning system where the main focus is the minimization of a global loss function using different versions of a model created by multiple clients. These approaches have been extensively studied in the literature and applied, traditionally, to process private data in areas such as health and banking. In this paper, differently to these well-known approaches, we focus on the analysis and implementation of a decentralized machine learning system based on prototypes. On the one hand, our choice of prototype-based algorithms is motivated by the advantages of these prototypes as compact representation of the data, capturing the essential features and patterns within the dataset.
- Europe > Denmark > Capital Region > Kongens Lyngby (0.14)
- Europe > Spain (0.04)
- North America > United States (0.04)
- Asia > Kazakhstan > West Kazakhstan Region (0.04)
Efficient Knowledge Distillation from Model Checkpoints
Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we observe that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers.
Tighter Privacy Auditing of DP-SGD in the Hidden State Threat Model
Cebere, Tudor, Bellet, Aurélien, Papernot, Nicolas
Machine learning models can be trained with formal privacy guarantees via differentially private optimizers such as DP-SGD. In this work, we study such privacy guarantees when the adversary only accesses the final model, i.e., intermediate model updates are not released. In the existing literature, this "hidden state" threat model exhibits a significant gap between the lower bound provided by empirical privacy auditing and the theoretical upper bound provided by privacy accounting. To challenge this gap, we propose to audit this threat model with adversaries that craft a gradient sequence to maximize the privacy loss of the final model without accessing intermediate models. We demonstrate experimentally how this approach consistently outperforms prior attempts at auditing the hidden state model. When the crafted gradient is inserted at every optimization step, our results imply that releasing only the final model does not amplify privacy, providing a novel negative result. On the other hand, when the crafted gradient is not inserted at every step, we show strong evidence that a privacy amplification phenomenon emerges in the general non-convex setting (albeit weaker than in convex regimes), suggesting that existing privacy upper bounds can be improved.
- North America > Canada > Ontario > Toronto (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)