Goto

Collaborating Authors

 long tail



What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Neural Information Processing Systems

Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed.


Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Shao, Zelei, Srivatsa, Vikranth, Srivastava, Sanjana, Wu, Qingyang, Ariyak, Alpay, Wu, Xiaoxia, Patel, Ameen, Wang, Jue, Liang, Percy, Dao, Tri, Zhang, Ce, Zhang, Yiying, Athiwaratkun, Ben, Xu, Chenfeng, Wang, Junxiong

arXiv.org Artificial Intelligence

Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.


Hurdle-IMDL: An Imbalanced Learning Framework for Infrared Rainfall Retrieval

Zhang, Fangjian, Zhuge, Xiaoyong, Wang, Wenlan, Xiao, Haixia, Zhu, Yuying, Cheng, Siyang

arXiv.org Artificial Intelligence

Artificial intelligence has advanced quantitative remote sensing, yet its effectiveness is constrained by imbalanced label distribution. This imbalance leads conventionally trained models to favor common samples, which in turn degrades retrieval performance for rare ones. Rainfall retrieval exemplifies this issue, with performance particularly compromised for heavy rain. This study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework. Following a divide-and-conquer strategy, imbalance in the rain distribution is decomposed into two components: zero inflation, defined by the predominance of non-rain samples; and long tail, defined by the disproportionate abundance of light-rain samples relative to heavy-rain samples. A hurdle model is adopted to handle the zero inflation, while IMDL is proposed to address the long tail by transforming the learning object into an unbiased ideal inverse model. Comprehensive evaluation via statistical metrics and case studies investigating rainy weather in eastern China confirms Hurdle-IMDL's superiority over conventional, cost-sensitive, generative, and multi-task learning methods. Its key advancements include effective mitigation of systematic underestimation and a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a generalizable approach for addressing imbalance in distributions of environmental variables, enabling enhanced retrieval of rare yet high-impact events.


The Long Tail of the AWS Outage

WIRED

Experts say outages like the one that Amazon experienced this week are almost inevitable given the complexity and scale of cloud technology--but the duration serves as a warning. A sprawling Amazon Web Services cloud outage that began early Monday morning illustrated the fragile interdependencies of the internet as major communication, financial, health care, education, and government platforms around the world suffered disruptions. As the day wore on, AWS diagnosed and began working to correct the issue, which stemmed from the company's critical US-EAST-1 region based in northern Virginia. But the cascade of impacts took time to fully resolve. Researchers reflecting on the incident particularly highlighted the length of Monday's outage, which started around 3 am ET on Monday, October 20.



Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Ju, Da, Blix, Hagen, Williams, Adina

arXiv.org Artificial Intelligence

Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data -- Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.


Taming the Long Tail in Human Mobility Prediction

Neural Information Processing Systems

With the popularity of location-based services, human mobility prediction plays a key role in enhancing personalized navigation, optimizing recommendation systems, and facilitating urban mobility and planning. This involves predicting a user's next POI (point-of-interest) visit using their past visit history. However, the uneven distribution of visitations over time and space, namely the long-tail problem in spatial distribution, makes it difficult for AI models to predict those POIs that are less visited by humans. In light of this issue, we propose the \underline{\bf{Lo}} ng- \underline{\bf{T}} ail Adjusted \underline{\bf{Next}} POI Prediction (LoTNext) framework for mobility prediction, combining a Long-Tailed Graph Adjustment module to reduce the impact of the long-tailed nodes in the user-POI interaction graph and a novel Long-Tailed Loss Adjustment module to adjust loss by logit score and sample weight adjustment strategy. Also, we employ the auxiliary prediction task to enhance generalization and accuracy.


Review for NeurIPS paper: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Neural Information Processing Systems

Weaknesses: I would like to see some clarification on the long tail theory. If the value of mem(A,S,i_1,...,i_k) is high, perhaps we can still call this phenomenon "memorization." If so, then memorization phenomenon is not just limited to long tails. Then, it seems to me the claim in [12] that memorization is needed due to long tail may not be showing a bigger picture. The paper mentions that very high influence scores are due to near duplicates in the training and test examples.


Review for NeurIPS paper: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Neural Information Processing Systems

The reviews feel that the issues are interesting and the contributions are sufficient for acceptance. However, there are serious suggestions for improvements in the experiments. It seems the paper is suggestive, but not definitive, on the long tail hypothesis.