Goto

Collaborating Authors

 pk 1


True Impact of Cascade Length in Contextual Cascading Bandits

Neural Information Processing Systems

We revisit the contextual cascading bandit, where a learning agent recommends an ordered list (cascade) of items, and a user scans the list sequentially, stopping at the first attractive item. Although cascading bandits underpin various applications including recommender systems and search engines, the role of the cascade length K in shaping regret has remained unclear. Contrary to prior results that regret grows with K, we prove that regret actually decreases once K is large enough. Leveraging this insight, we design a new upper-confidence-bound algorithm built on online mirror descent that attains the sharpest known regret upper bound, O min{K pK 1,1}d Tfor contextual cascading bandits. To complement this new regret upper bound, we provide a nearly matching lower bound of Ω min{KpK 1,1}d T, where 0 p p < 1. Together, these results fully characterize how regret truly scales with K, thereby closing the theoretical gap for contextual cascading bandits. Finally, comprehensive experiments validate our theoretical results and show the effectiveness of our proposed method.


Supplementary Materials AExpanded Related Work

Neural Information Processing Systems

A number of gradient-based bilevel algorithms have been proposed via AIDand ITD-based hypergradient approximations. For example, AID-based hypergradient computation [4, 33, 10, 11, 19] estimates the Hessian-inverse-vector product by solving a linear system with an efficient iterative algorithm. ITD-based hypergradient computation [31, 8, 9, 6, 35, 17] involves a backpropagation over the inner-loop gradient-based optimization path. Convergence rate of AIDand ITD-based bilevel methods has been studied recently. For example, [10, 19] and [19, 17] analyzed the convergence rate and complexity of AIDand ITD-based bilevel algorithms, respectively.


Supplementary Material

Neural Information Processing Systems

The supplementary material is organized as follows. First, we prove Proposition 1 and Theorem 1. We then provide some details on the overall communication complexity of FedRZObl. Lastly, we present some additional numerical experiments. In this section we prove Proposition 1, and some preliminary lemmas.


Neyman-Pearson multiclass classification under label noise via empirical likelihood

arXiv.org Machine Learning

In many classification problems, the costs of misclassifying observations from different classes can be highly unequal. The Neyman-Pearson multiclass classification (NPMC) framework addresses this issue by minimizing a weighted misclassification risk while imposing upper bounds on class-specific error probabilities. Existing NPMC methods typically assume that training labels are correctly observed. In practice, however, labels are often corrupted due to measurement error or annotation, and the effect of such label noise on NPMC procedures remains largely unexplored. We study the NPMC problem when only noisy labels are available in the training data. We propose an empirical likelihood (EL)-based method that relates the distributions of noisy and true labels through an exponential tilting density ratio model. The resulting maximum EL estimators recover the class proportions and posterior probabilities of the clean labels required for error control. We establish consistency, asymptotic normality, and optimal convergence rates for these estimators. Under mild conditions, the resulting classifier satisfies NP oracle inequalities with respect to the true labels asymptotically. An expectation-maximization algorithm computes the maximum EL estimators. Simulations show that the proposed method performs comparably to the oracle classifier under clean labels and substantially improves over procedures that ignore label noise.


Fast Estimation of Causal Interactions using Wold Processes

Neural Information Processing Systems

Recently, several fields used networked point processes to understand complex systems such as spiking biological neurons [36],social networks[8,42]geo-sensor networks[22],financial agents inmarkets[37],television records [48]and patient visits [11]. One ofthemain objectivesinthese analyses istouncoverthe causal relationships among the entities ofthe system, ortheinteraction structure among the nodes, which is also called thelatent network structure.





Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

arXiv.org Machine Learning

Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.


Covariance Estimation for Matrix-valued Data

arXiv.org Machine Learning

Covariance estimation for matrix-valued data has received an increasing interest in applications including neuroscience and environmental studies. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating the banded and tapering covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are studied and compared to the ones for the usual vector-valued data. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with the potential heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from an electroencephalography study and a gridded temperature anomalies dataset.