Goto

Collaborating Authors

 Large Language Model


Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

arXiv.org Machine Learning

Knowledge distillation has emerged as a powerful technique for compressing large language models (LLMs) into efficient, deployable architectures while preserving their advanced capabilities. Recent advances in low-rank knowledge distillation, particularly methods like Low-Rank Clone (LRC), have demonstrated remarkable empirical success, achieving comparable performance to full-parameter distillation with significantly reduced training data and computational overhead. However, the theoretical foundations underlying these methods remain poorly understood. In this paper, we establish a rigorous theoretical framework for low-rank knowledge distillation in language models. We prove that under mild assumptions, low-rank projection preserves the optimization dynamics, yielding explicit convergence rates of $O(1/\sqrt{T})$. We derive generalization bounds that characterize the fundamental trade-off between model compression and generalization capability, showing that the generalization error scales with the rank parameter as $O(r(m+n)/\sqrt{n})$. Furthermore, we provide an information-theoretic analysis of the activation cloning mechanism, revealing its role in maximizing the mutual information between the teacher's and student's intermediate representations. Our theoretical results offer principled guidelines for rank selection, mathematically suggesting an optimal rank $r^* = O(\sqrt{n})$ where $n$ is the sample size. Experimental validation on standard language modeling benchmarks confirms our theoretical predictions, demonstrating that the empirical convergence, rank scaling, and generalization behaviors align closely with our bounds.


OpenAI shutters AI video generator Sora in abrupt announcement

The Guardian

Tech firm'says goodbye' to Sora, made publicly available in 2024, just six months after its launch of a stand-alone app In an abrupt announcement on Tuesday, OpenAI said it was "saying goodbye" to its AI video generator Sora. The move comes just six months after the company's splashy launch of a stand-alone app with which people could make and share hyper-realistic AI videos in a scrolling social feed. "To everyone who created with Sora, shared it, and built community around it: thank you," the company wrote in a post on X . "What you made with Sora mattered, and we know this news is disappointing." OpenAI first made Sora publicly available in late 2024, but it wasn't until the company launched Sora 2 and its stand-alone app last September that the video generator reached mainstream attention.


Understanding Behavior Cloning with Action Quantization

arXiv.org Machine Learning

Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to vision-language-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity, matching existing lower bounds, and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.


User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

arXiv.org Machine Learning

Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.


Generalized Discrete Diffusion from Snapshots

arXiv.org Machine Learning

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.


Meet the Gods of AI Warfare

WIRED

In its early days, the AI initiative known as Project Maven had its fair share of skeptics at the Pentagon. Today, many of them are true believers. The rise of AI warfare speaks to the biggest moral and practical question there is: Who--or what--gets to decide to take a human life? And who bears that cost? In 2018, more than 3,000 Google workers protested the company's involvement in "the business of war" after finding out the company was part of Project Maven, then a nascent Pentagon effort to use computer vision to rifle through copious video footage taken in America's overseas drone wars. They feared Project Maven's AI could one day be used for lethal targeting. In my yearslong effort to uncover the full story of Project Maven for my book,, I learned that is exactly what happened, and that the undertaking was just as controversial inside the Pentagon. Today, the tool known as Maven Smart System is being used in US operations against Iran . How the US military's top brass moved from skepticism about the use of AI in war to true believers has a lot to do with a Marine colonel named Drew Cukor. In early September 2024, during the cocktail hour at a private retreat for tech investors and defense leaders, Vice Admiral Frank "Trey" Whitworth found his way to Drew Cukor. Now Project Maven's founding leader and his skeptical successor were standing face-to-face. Three years earlier, Whitworth had been the Pentagon's top military official for intelligence, advising the chairman of the Joint Chiefs of Staff and running one of the most sensitive and potentially lethal parts of any military process: targeting.


The AI Race Is Pressuring Utilities to Squeeze More From Europe's Power Grids

WIRED

The AI Race Is Pressuring Utilities to Squeeze More From Europe's Power Grids As data center developers queue up to connect to power grids across Europe, network operators are experimenting with novel ways of clearing room for them. European countries are racing to bring new data centers online as AI labs across the globe continue to demand more compute. The primary limiting factor is energy--and specifically, the ability to move it. Though Europe is on track to generate enough energy, utilities experts say, grid operators broadly lack the infrastructure needed to transport it to where it needs to go. That's throttling grid capacity and, by extension, the number of new power-hungry data centers that can connect without risking blackouts.


Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects

arXiv.org Machine Learning

Autocorrelation is a defining characteristic of time-series data, where each observation is statistically dependent on its predecessors. In the context of deep time-series forecasting, autocorrelation arises in both the input history and the label sequences, presenting two central research challenges: (1) designing neural architectures that model autocorrelation in history sequences, and (2) devising learning objectives that model autocorrelation in label sequences. Recent studies have made strides in tackling these challenges, but a systematic survey examining both aspects remains lacking. To bridge this gap, this paper provides a comprehensive review of deep time-series forecasting from the perspective of autocorrelation modeling. In contrast to existing surveys, this work makes two distinctive contributions. First, it proposes a novel taxonomy that encompasses recent literature on both model architectures and learning objectives -- whereas prior surveys neglect or inadequately discuss the latter aspect. Second, it offers a thorough analysis of the motivations, insights, and progression of the surveyed literature from a unified, autocorrelation-centric perspective, providing a holistic overview of the evolution of deep time-series forecasting. The full list of papers and resources is available at https://github.com/Master-PLC/Awesome-TSF-Papers.


BAKU: An Efficient Transformer for Multi-Task Policy Learning

Neural Information Processing Systems

Training generalist agents capable of solving diverse tasks is challenging, often requiring large datasets of expert demonstrations. This is particularly problematic in robotics, where each data point requires physical execution of actions in the real world. Thus, there is a pressing need for architectures that can effectively leverage the available training data. In this work, we present BAKU, a simple transformer architecture that enables efficient learning of multi-task robot policies. BAKU builds upon recent advancements in offline imitation learning and meticulously combines observation trunks, action chunking, multi-sensory observations, and action heads to substantially improve upon prior work.


CLUES: Collaborative Private-domain High-quality Data Selection for LLMs via Training Dynamics

Neural Information Processing Systems

Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of LLMs, that high quality data are more likely to have similar training dynamics to the anchor dataset. We then leverage the influence of the training dynamics to select high-quality data from different private domains, with centralized model updates on the server side in a collaborative training fashion by either model merging or federated learning. As for the data quality indicator, we compute the per-sample gradients with respect to the private data and the anchor dataset, and use the trace of the accumulated inner products as a measurement of data quality. In addition, we develop a quality control evaluation tailored for collaborative settings with heterogeneous medical domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs, across diverse private domain datasets, in medical, multilingual and financial settings.