Goto

Collaborating Authors

 alibi




AntipodesofLabelDifferentialPrivacy: PATEandALIBI

Neural Information Processing Systems

A prominent example of label-only privacy is in online advertising, where the goal is to predict conversionofanadimpression(thelabel)givenauser'sprofileandthespot'scontext(thefeatures).


The Impact of Positional Encoding on Length Generalization in Transformers

Neural Information Processing Systems

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.


Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

Sin, Daniel, Toutounchian, Milad

arXiv.org Machine Learning

In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/SSBA_For_Counterfactuals


Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Bianchessi, Arthur S., Aguirre, Yasmin C., Barros, Rodrigo C., Kupssinskü, Lucas S.

arXiv.org Artificial Intelligence

Effective PE is vital, particularly for enabling LMs trained on shorter contexts to generalize to significantly longer sequences during inference--a desirable capability known as context length extrapolation. Several PE methods have been proposed to facilitate context length extrapolation, including Sinusoidal embeddings (V aswani, 2017), RoPE (Su et al., 2024), ALiBi (Press et al., 2022), and even the omission Bayesian attention mechanism, hereby called BAM. 2.1 B This dependency is trivially modeled by a scalar Z when the scoring function is additive, as detailed below. If the scoring function of the attention mechanism is additive, i.e., of the form With Theorem 1, we can frame positional encoding as priors to BAM. Lemma 2. ALiBi is a special case of BAM prior where the token position distribution comprises Lemma 3. ALiBi becomes local attention as the relative length |j i| increases. See Appendix B.1, B.2, and B.3. 2.3 A PE We call this new PE method GGD-BAM.



37ecd27608480aa3569a511a638ca74f-Supplemental.pdf

Neural Information Processing Systems

Tables 3 and 4 summarize hyperparameters for P A TE-FM and ALIBI respectively. Table 3: P A TE-FM (Algorithms 1 and 2) hyperparameters for select accuracy levels. By repeating this game multiple times, we can estimate the adversary's success rate and convert this The probability is taken over the bit b, the randomness of the mechanism M and the algorithm A. Theorem B.1. It now remains to be seen how we can bound the adversary's correct guessing rate "canaries", we can compute a lower bound on the adversary's We can improve the tightness of this bound further. The adversary simply looks at the model's confidence on (Game 3).



Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Gong, Chenlei, Tian, Yuanhe, Mao, Lei, Song, Yan

arXiv.org Artificial Intelligence

Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods--sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.