long tail
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Texas (0.04)
- Asia > South Korea (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.68)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.69)
What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation
Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed.
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Shao, Zelei, Srivatsa, Vikranth, Srivastava, Sanjana, Wu, Qingyang, Ariyak, Alpay, Wu, Xiaoxia, Patel, Ameen, Wang, Jue, Liang, Percy, Dao, Tri, Zhang, Ce, Zhang, Yiying, Athiwaratkun, Ben, Xu, Chenfeng, Wang, Junxiong
Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
- North America > United States > Louisiana (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
Hurdle-IMDL: An Imbalanced Learning Framework for Infrared Rainfall Retrieval
Zhang, Fangjian, Zhuge, Xiaoyong, Wang, Wenlan, Xiao, Haixia, Zhu, Yuying, Cheng, Siyang
Artificial intelligence has advanced quantitative remote sensing, yet its effectiveness is constrained by imbalanced label distribution. This imbalance leads conventionally trained models to favor common samples, which in turn degrades retrieval performance for rare ones. Rainfall retrieval exemplifies this issue, with performance particularly compromised for heavy rain. This study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework. Following a divide-and-conquer strategy, imbalance in the rain distribution is decomposed into two components: zero inflation, defined by the predominance of non-rain samples; and long tail, defined by the disproportionate abundance of light-rain samples relative to heavy-rain samples. A hurdle model is adopted to handle the zero inflation, while IMDL is proposed to address the long tail by transforming the learning object into an unbiased ideal inverse model. Comprehensive evaluation via statistical metrics and case studies investigating rainy weather in eastern China confirms Hurdle-IMDL's superiority over conventional, cost-sensitive, generative, and multi-task learning methods. Its key advancements include effective mitigation of systematic underestimation and a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a generalizable approach for addressing imbalance in distributions of environmental variables, enabling enhanced retrieval of rare yet high-impact events.
The Long Tail of the AWS Outage
Experts say outages like the one that Amazon experienced this week are almost inevitable given the complexity and scale of cloud technology--but the duration serves as a warning. A sprawling Amazon Web Services cloud outage that began early Monday morning illustrated the fragile interdependencies of the internet as major communication, financial, health care, education, and government platforms around the world suffered disruptions. As the day wore on, AWS diagnosed and began working to correct the issue, which stemmed from the company's critical US-EAST-1 region based in northern Virginia. But the cascade of impacts took time to fully resolve. Researchers reflecting on the incident particularly highlighted the length of Monday's outage, which started around 3 am ET on Monday, October 20.
- North America > United States > Virginia (0.25)
- North America > United States > New York (0.05)
- North America > United States > California (0.05)
- (3 more...)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.70)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Communications > Web (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.48)
- Information Technology > Communications > Networks (0.35)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Texas (0.04)
- Asia > South Korea (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.68)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.69)
Domain Regeneration: How well do LLMs match syntactic properties of text domains?
Ju, Da, Blix, Hagen, Williams, Adina
Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data -- Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.05)
- North America > United States > Alabama (0.04)
- (18 more...)
- Media (0.46)
- Leisure & Entertainment (0.46)
Taming the Long Tail in Human Mobility Prediction
With the popularity of location-based services, human mobility prediction plays a key role in enhancing personalized navigation, optimizing recommendation systems, and facilitating urban mobility and planning. This involves predicting a user's next POI (point-of-interest) visit using their past visit history. However, the uneven distribution of visitations over time and space, namely the long-tail problem in spatial distribution, makes it difficult for AI models to predict those POIs that are less visited by humans. In light of this issue, we propose the \underline{\bf{Lo}} ng- \underline{\bf{T}} ail Adjusted \underline{\bf{Next}} POI Prediction (LoTNext) framework for mobility prediction, combining a Long-Tailed Graph Adjustment module to reduce the impact of the long-tailed nodes in the user-POI interaction graph and a novel Long-Tailed Loss Adjustment module to adjust loss by logit score and sample weight adjustment strategy. Also, we employ the auxiliary prediction task to enhance generalization and accuracy.
Review for NeurIPS paper: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation
Weaknesses: I would like to see some clarification on the long tail theory. If the value of mem(A,S,i_1,...,i_k) is high, perhaps we can still call this phenomenon "memorization." If so, then memorization phenomenon is not just limited to long tails. Then, it seems to me the claim in [12] that memorization is needed due to long tail may not be showing a bigger picture. The paper mentions that very high influence scores are due to near duplicates in the training and test examples.