AITopics | nnull

Collaborating Authors

nnull

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 14:07:05 GMT

dp-sgd, gradient noise distribution, section 5, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Differentially Private Empirical Risk Minimization Revisited: Faster and More General

Di Wang, Minwei Ye, Jinhui Xu

Neural Information Processing SystemsNov-21-2025, 13:48:23 GMT

Privacy preserving is an important issue in learning.

artificial intelligence, loss function, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > Erie County > Buffalo (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Critical attention scaling in long-context transformers

Chen, Shi, Lin, Zhengjiang, Polyanskiy, Yury, Rigollet, Philippe

arXiv.org Artificial IntelligenceOct-8-2025

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

large language model, machine learning, theorem 2, (17 more...)

arXiv.org Artificial Intelligence

2510.05554

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Supplementary Material Additional Notation

Neural Information Processing SystemsOct-2-2025, 04:10:52 GMT

A.1 Robust Mean Estimation from Subset Stability The upper bound is always less than null for m n . Let m be the largest value of f (x) for any x T with w ( x) null= 0 . Thus, by the weighted version of Lemma 2.4 of [DK19], we have that nullµ Section B.1, we show a result stating that pre-processing on i.i.d. points yields a set that contains Then, in Section B.2, we use a coupling argument to show a We recall the median of means principle. We now state our main result in this section, proved using minimax duality, that Theorem B.1 implies We first consider the case of i.i.d. In particular, Lemma E.2 shows that we can deterministically round We now prove Theorem 1.7, i.e., stability of a subset after corruption, using Theorem B.2.

artificial intelligence, probability, probability 1, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.34)

Add feedback

Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression

Su, Baozhuo, Qu, Zhengxian

arXiv.org Artificial IntelligenceAug-26-2025

Regression under uncertainty is fundamental across science and engineering. We present an Anchored Mixture of Experts (Anchor-MoE), a model that handles both probabilistic and point regression. For simplicity, we use a tuned gradient-boosting model to furnish the anchor mean; however, any off-the-shelf point regressor can serve as the anchor. The anchor prediction is projected into a latent space, where a learnable metric-window kernel scores locality and a soft router dispatches each sample to a small set of mixture-density-network experts; the experts produce a heteroscedastic correction and predictive variance. We train by minimizing negative log-likelihood, and on a disjoint calibration split fit a post-hoc linear map on predicted means to improve point accuracy. On the theory side, assuming a Hölder smooth regression function of order~$α$ and fixed Lipschitz partition-of-unity weights with bounded overlap, we show that Anchor-MoE attains the minimax-optimal $L^2$ risk rate $O\!\big(N^{-2α/(2α+d)}\big)$. In addition, the CRPS test generalization gap scales as $\widetilde{O}\!\Big(\sqrt{(\log(Mh)+P+K)/N}\Big)$; it is logarithmic in $Mh$ and scales as the square root in $P$ and $K$. Under bounded-overlap routing, $K$ can be replaced by $k$, and any dependence on a latent dimension is absorbed into $P$. Under uniformly bounded means and variances, an analogous $\widetilde{O}\!\big(\sqrt{(\log(Mh)+P+K)/N}\big)$ scaling holds for the test NLL up to constants. Empirically, across standard UCI regressions, Anchor-MoE consistently matches or surpasses the strong NGBoost baseline in RMSE and NLL; on several datasets it achieves new state-of-the-art probabilistic regression results on our benchmark suite. Code is available at https://github.com/BaozhuoSU/Probabilistic_Regression.

anchor-moe, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2508.16802

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

A Faster Decentralized Algorithm for Nonconvex Minimax Problems

Neural Information Processing SystemsAug-17-2025, 18:50:32 GMT

As training data become larger, distributed training has been broadly adopted in machine learning tasks.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback

9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf

Neural Information Processing SystemsAug-15-2025, 11:13:23 GMT

We thank the reviewers for their positive and constructive feedback. We address several points in the review below. The bias reduction technique in Section 5 is designed for DP-SGD with clipping. When it is applied to DP-SGD, the update rule is shown below. Typos: Thank you for pointing them out, we will correct the typos.

dp-sgd, gradient noise distribution, section 5, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Supplementary Material Fairness in Ranking under Uncertainty A Related Work

Neural Information Processing SystemsAug-14-2025, 20:35:33 GMT

The group fairness perspective imposes constraints like demographic parity (Calders et al., 2009; Zliobaite, 2015) and equalized odds (Hardt et al., 2016). Although similar in spirit, our work sidesteps this need to define a similarity metric between agents in the feature space. Rather, we view an agent's Ranking has been widely studied in the field of Information Retrieval (IR), mostly in the context of optimizing user utility. The Probability Ranking Principle (PRP) (Robertson, 1977), a guiding principle for ranking in IR, states that user utility is optimal when documents (i.e., the agents) are Besides ranking diversity, IR methods have dealt with uncertainty in relevance that comes via users' implicit or explicit feedback (Penha and Hauff, 2021; Soufiani et al., 2012), as well as stochasticity arising Kearns et al. (2017) present a way to fairly select Hence, they propose using the true CDF rank as a derived merit criterion that can be compared. Thus, a fair principal stands to gain more by obtaining perfect information.

agent, fairness, probability, (14 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.50)

Add feedback

Point Prediction for Streaming Data

Chanda, Aleena, Vinodchandran, N. V., Clarke, Bertrand

arXiv.org Machine LearningAug-2-2024

We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.

nnull, predictor, shtarkov solution, (16 more...)

arXiv.org Machine Learning

2408.01318

Country: