Goto

Collaborating Authors

 argsort



Hierarchical classification at multiple operating points

Neural Information Processing Systems

Figure 4: Impact of loss hyper-parameters on trade-off with iNat21-Mini (correct vs. recall). Table 3 outlines the parametrisation that corresponds to each loss function. Table 3: Definition and properties of the parametrisations used by each loss function.Loss θ Parametrisation Properties Flat softmax, HXE [2] R Algorithm 1 Algorithm for finding ordered Pareto set. We use square brackets to denote array elements (subscripts were used in the main text).procedure


A Details on the Weighting Function

Neural Information Processing Systems

Only when kN < 1 this fails to hold. Finally, we discuss some potential questions about the rank-based weighting. Why do the weights need to be normalized? By normalizing the weights, it is easier to identify hyperparameter settings that work robustly across different problems, thereby allowing weighted retraining to be applied with minimal tuning. Why not use a weight function directly based on the objective function value?




On ranking via sorting by estimated expected utility

Neural Information Processing Systems

This paper addresses the question of which of these tasks are asymptotically solved by sorting by decreasing order of expected utility, for some suitable notion of utility, or, equivalently, when is square loss regression consistent for ranking via score-and-sort?




XicorAttention: Time Series Transformer Using Attention with Nonlinear Correlation

Kimura, Daichi, Izumitani, Tomonori, Kashima, Hisashi

arXiv.org Artificial Intelligence

Various Transformer-based models have been proposed for time series forecasting. These models leverage the self-attention mechanism to capture long-term temporal or variate dependencies in sequences. Existing methods can be divided into two approaches: (1) reducing computational cost of attention by making the calculations sparse, and (2) reshaping the input data to aggregate temporal features. However, existing attention mechanisms may not adequately capture inherent nonlinear dependencies present in time series data, leaving room for improvement. In this study, we propose a novel attention mechanism based on Chatterjee's rank correlation coefficient, which measures nonlinear dependencies between variables. Specifically, we replace the matrix multiplication in standard attention mechanisms with this rank coefficient to measure the query-key relationship. Since computing Chatterjee's correlation coefficient involves sorting and ranking operations, we introduce a differentiable approximation employing SoftSort and SoftRank. Our proposed mechanism, ``XicorAttention,'' integrates it into several state-of-the-art Transformer models. Experimental results on real-world datasets demonstrate that incorporating nonlinear correlation into the attention improves forecasting accuracy by up to approximately 9.1\% compared to existing models.


Deep greedy unfolding: Sorting out argsorting in greedy sparse recovery algorithms

Mohammad-Taheri, Sina, Colbrook, Matthew J., Brugiapaglia, Simone

arXiv.org Artificial Intelligence

Gradient-based learning imposes (deep) neural networks to be differentiable at all steps. This includes model-based architectures constructed by unrolling iterations of an iterative algorithm onto layers of a neural network, known as algorithm unrolling. However, greedy sparse recovery algorithms depend on the non-differentiable argsort operator, which hinders their integration into neural networks. In this paper, we address this challenge in Orthogonal Matching Pursuit (OMP) and Iterative Hard Thresholding (IHT), two popular representative algorithms in this class. We propose permutation-based variants of these algorithms and approximate permutation matrices using "soft" permutation matrices derived from softsort, a continuous relaxation of argsort. We demonstrate -- both theoretically and numerically -- that Soft-OMP and Soft-IHT, as differentiable counterparts of OMP and IHT and fully compatible with neural network training, effectively approximate these algorithms with a controllable degree of accuracy. This leads to the development of OMP- and IHT-Net, fully trainable network architectures based on Soft-OMP and Soft-IHT, respectively. Finally, by choosing weights as "structure-aware" trainable parameters, we connect our approach to structured sparse recovery and demonstrate its ability to extract latent sparsity patterns from data.