Goto

Collaborating Authors

 srank


Appendix for " Beyond the Signs: Nonparametric Tensor Completion via Sign Series "

Neural Information Processing Systems

The appendix consists of proofs (Section A), additional theoretical results (Section B), and numerical experiments (Section C). When g is strictly increasing, the mapping x7 g(x) is sign preserving. Specifically, if x 0, then g(x) g(0) = 0. Conversely, ifg(x) 0 = g(0), then applying g 1 to both sides givesx 0. When g is strictly decreasing, the mappingx7 g(x) is sign reversing. See Section B.2 for constructive examples. Based on the definition of classification lossL(,), the function Risk() relies only on the sign pattern of the tensor.


BeyondtheSigns: NonparametricTensor CompletionviaSignSeries

Neural Information Processing Systems

A nonparametric approach to tensor completion is developed based on anewmodel which we coin assign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low-andhigh-rank signals, while encompassing manyexisting tensor models-- includingCPmodels,Tuckermodels,singleindexmodels,structuredtensorswith repeating entries--as special cases. We provably reduce the tensor estimation problem to a series of structured classification tasks, and we develop a learning reduction machinery to empower existing low-rank tensor algorithms for more challenging high-rank estimation.


Appendix for " Beyond the Signs: Nonparametric Tensor Completion via Sign Series "

Neural Information Processing Systems

See Section B.2 for constructive examples.Proof of Proposition 2. Based on (3) in Proposition 2, we have Risk( Z) Risk( ฮ˜) = E null |sgnZ sgn ฮ˜|| ฮ˜|null . We divide the proof into two cases: ฮฑ > 0 and ฮฑ = . The inequality (6) now becomes Risk( Z) Risk( ฮ˜) t null MAE(sgn ฮ˜, sgnZ) C snull, for all 0 t < ฯ(ฯ€, N) . Consider the same setup as in Theorem 2. Fix The conclusion (10) then directly follows by applying Remark A.1 to (11). 3 Proof of Theorem 2. To simplify the notation, we denote ฯ = ฯ(ฯ€, N). It follows from Kosorok (2007, Theorem 9.22) that the Proof of Theorem 3. By definition of ห† ฮ˜, we have MAE( ห† ฮ˜, ฮ˜) = E null null null null null 1 2H + 1 null Assumption A.1, we establish the estimation accuracy guarantee for the large-margin estimators H log H. (29) In particualr, setting H null (1 + |N|) To apply Theorem A.1, we choose the pair ( L Here, we describe the details of the example set-up.


Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations

arXiv.org Artificial Intelligence

Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods suffer from slow convergence due to high-variance stochastic gradient estimators. While structured perturbations, such as sparsity and low-rank constraints, have been explored to mitigate these issues, their effectiveness remains highly under-explored. In this work, we develop a unified theoretical framework that analyzes both the convergence and generalization properties of ZO optimization under structured perturbations. We show that high dimensionality is the primary bottleneck and introduce the notions of \textit{stable rank} and \textit{effective overlap} to explain how structured perturbations reduce gradient noise and accelerate convergence. Using the uniform stability under our framework, we then provide the first theoretical justification for why these perturbations enhance generalization. Additionally, through empirical analysis, we identify that \textbf{block coordinate descent} (BCD) to be an effective structured perturbation method. Extensive experiments show that, compared to existing alternatives, memory-efficient ZO (MeZO) with BCD (\textit{MeZO-BCD}) can provide improved converge with a faster wall-clock time/iteration by up to $\times\textbf{2.09}$ while yielding similar or better accuracy.


Embedding Ontologies in the Description Logic ALC by Axis-Aligned Cones

Journal of Artificial Intelligence Research

This paper is concerned with knowledge graph embedding with background knowledge, taking the formal perspective of logics. In knowledge graph embedding, knowledgeโ€” expressed as a set of triples of the form (a R b) (โ€œa is R-related to bโ€)โ€”is embedded into a real-valued vector space. The embedding helps exploiting geometrical regularities of the space in order to tackle typical inductive tasks of machine learning such as link prediction. Recent embedding approaches also consider incorporating background knowledge, in which the intended meanings of the symbols a, R, b are further constrained via axioms of a theory. Of particular interest are theories expressed in a formal language with a neat semantics and a good balance between expressivity and feasibility. In that case, the knowledge graph together with the background can be considered to be an ontology. This paper develops a cone-based theory for embedding in order to advance the expressivity of the ontology: it works (at least) with ontologies expressed in the description logic ALC, which comprises restricted existential and universal quantifiers, as well as concept negation and concept disjunction. In order to align the classical Tarskian Style semantics for ALC with the sub-symbolic representation of triples, we use the notion of a geometric model of an ALC ontology and show, as one of our main results, that an ALC ontology is satisfiable in the classical sense iff it is satisfiable by a geometric model based on cones. The geometric model, if treated as a partial model, can even be chosen to be faithful, i.e., to reflect all and only the knowledge captured by the ontology. We introduce the class of axis-aligned cones and show that modulo simple geometric operations any distributive logic (such as ALC) interpreted over cones employs this class of cones. Cones are also attractive from a machine learning perspective on knowledge graph embeddings since they give rise to applying conic optimization techniques.


Approximate Real Symmetric Tensor Rank

arXiv.org Artificial Intelligence

We investigate the effect of an $\varepsilon$-room of perturbation tolerance on symmetric tensor decomposition. To be more precise, suppose a real symmetric $d$-tensor $f$, a norm $||.||$ on the space of symmetric $d$-tensors, and $\varepsilon >0$ are given. What is the smallest symmetric tensor rank in the $\varepsilon$-neighborhood of $f$? In other words, what is the symmetric tensor rank of $f$ after a clever $\varepsilon$-perturbation? We prove two theorems and develop three corresponding algorithms that give constructive upper bounds for this question. With expository goals in mind; we present probabilistic and convex geometric ideas behind our results, reproduce some known results, and point out open problems.


KDEformer: Accelerating Transformers via Kernel Density Estimation

arXiv.org Artificial Intelligence

Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.


Explicit and Implicit Semantic Ranking Framework

arXiv.org Artificial Intelligence

The core challenge in numerous real-world applications is to match an inquiry to the best document from a mutable and finite set of candidates. Existing industry solutions, especially latency-constrained services, often rely on similarity algorithms that sacrifice quality for speed. In this paper we introduce a generic semantic learning-to-rank framework, Self-training Semantic Cross-attention Ranking (sRank). This transformer-based framework uses linear pairwise loss with mutable training batch sizes and achieves quality gains and high efficiency, and has been applied effectively to show gains on two industry tasks at Microsoft over real-world large-scale data sets: Smart Reply (SR) and Ambient Clinical Intelligence (ACI). In Smart Reply, $sRank$ assists live customers with technical support by selecting the best reply from predefined solutions based on consumer and support agent messages. It achieves 11.7% gain in offline top-one accuracy on the SR task over the previous system, and has enabled 38.7% time reduction in composing messages in telemetry recorded since its general release in January 2021. In the ACI task, sRank selects relevant historical physician templates that serve as guidance for a text summarization model to generate higher quality medical notes. It achieves 35.5% top-one accuracy gain, along with 46% relative ROUGE-L gain in generated medical notes.


A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

arXiv.org Artificial Intelligence

Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $\eta$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $\eta$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $\eta\gamma$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.


Nonparametric Trace Regression in High Dimensions via Sign Series Representation

arXiv.org Machine Learning

Learning of matrix-valued data has recently surged in a range of scientific and business applications. Trace regression is a widely used method to model effects of matrix predictors and has shown great success in matrix learning. However, nearly all existing trace regression solutions rely on two assumptions: (i) a known functional form of the conditional mean, and (ii) a global low-rank structure in the entire range of the regression function, both of which may be violated in practice. In this article, we relax these assumptions by developing a general framework for nonparametric trace regression models via structured sign series representations of high dimensional functions. The new model embraces both linear and nonlinear trace effects, and enjoys rank invariance to order-preserving transformations of the response. In the context of matrix completion, our framework leads to a substantially richer model based on what we coin as the "sign rank" of a matrix. We show that the sign series can be statistically characterized by weighted classification tasks. Based on this connection, we propose a learning reduction approach to learn the regression model via a series of classifiers, and develop a parallelable computation algorithm to implement sign series aggregations. We establish the excess risk bounds, estimation error rates, and sample complexities. Our proposal provides a broad nonparametric paradigm to many important matrix learning problems, including matrix regression, matrix completion, multi-task learning, and compressed sensing. We demonstrate the advantages of our method through simulations and two applications, one on brain connectivity study and the other on high-rank image completion.