Goto

Collaborating Authors

 capitalization


Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Li, Huihan, Chen, You, Wang, Siyuan, He, Yixin, Mehrabi, Ninareh, Gupta, Rahul, Ren, Xiang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.


Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Altinok, Duygu

arXiv.org Artificial Intelligence

ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.


Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B

Bakalova, Aleksandra, Veitsman, Yana, Huang, Xinting, Hahn, Michael

arXiv.org Artificial Intelligence

In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.


Improve LLM-based Automatic Essay Scoring with Linguistic Features

Hou, Zhaoyi Joey, Ciuba, Alejandro, Li, Xiang Lorraine

arXiv.org Artificial Intelligence

Automatic Essay Scoring (AES) assigns scores to student essays, reducing the grading workload for instructors. Developing a scoring system capable of handling essays across diverse prompts is challenging due to the flexibility and diverse nature of the writing task. Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods. Supervised feature-based approaches often achieve higher performance but require resource-intensive training. In contrast, LLM-based methods are computationally efficient during inference but tend to suffer from lower performance. This paper combines these approaches by incorporating linguistic features into LLM-based scoring. Experimental results show that this hybrid method outperforms baseline models for both in-domain and out-of-domain writing prompts.


Statistical Arbitrage in Rank Space

Li, Y. -F., Papanicolaou, G.

arXiv.org Machine Learning

In equity markets, stocks are conventionally labeled by equity indices (company names). By relabeling stocks according to their ranks in capitalization, rather than their equity indices (company names), a different, more stable market structure can emerge. Specifically, we will gain a different perspective on market dynamics by focusing on the stock that occupies a certain rank in capitalization while the corresponding company name may change. We refer to a market labeled by the equity indices (company names) as a market in name space and one labeled by ranks in capitalization as a market in rank space . Market in rank space was explored by Fernholtz et al. who observed a stable distribution of capitalization across different ranks in the U.S. equity market over different time periods [11,16]. They further introduced an explanatory hybrid-Atlas model under stochastic portfolio theory, a framework that enables analyzing portfolios in rank space [5,15]. Empirically, B. Healy et al. analyzed the U.S. equity data and showed that the market in rank space is driven by a dominant single factor [14], in contrast to the multi-factor-driven market in name space [9,10,19]. While the primary market factor in rank space has been extensively studied, the residual returns - those not explained by this primary factor in stock returns - remain a fertile land of adventure.


Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Koluguri, Nithin Rao, Bartley, Travis, Xu, Hainan, Hrinchuk, Oleksii, Balam, Jagadeesh, Ginsburg, Boris, Kucsko, Georg

arXiv.org Artificial Intelligence

This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capitalization (PnC) sentences, we propose training on longer utterances that include complete sentences with proper punctuation and capitalization. We achieve this by using the FastConformer architecture which allows training 1 Billion parameter models with sequences up to 60 seconds long with full attention. However, while training with PnC enhances the overall performance, we observed that accuracy plateaus when training on sequences longer than 40 seconds across various evaluation settings. Our proposed method significantly improves punctuation and capitalization accuracy, showing a 25% relative word error rate (WER) improvement on the Earnings-21 and Earnings-22 benchmarks. Additionally, training on longer audio segments increases the overall model accuracy across speech recognition and translation benchmarks. The model weights and training code are open-sourced though NVIDIA NeMo.


Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Debenedetti, Edoardo, Rando, Javier, Paleka, Daniel, Florin, Silaghi Fineas, Albastroiu, Dragos, Cohen, Niv, Lemberg, Yuval, Ghosh, Reshmi, Wen, Rui, Salem, Ahmed, Cherubin, Giovanni, Zanella-Beguelin, Santiago, Schmid, Robin, Klemm, Victor, Miki, Takahiro, Li, Chenhao, Kraft, Stefan, Fritz, Mario, Tramèr, Florian, Abdelnabi, Sahar, Schönherr, Lea

arXiv.org Artificial Intelligence

Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform.


Detecting Text Formality: A Study of Text Classification Approaches

Dementieva, Daryna, Babakov, Nikolay, Panchenko, Alexander

arXiv.org Artificial Intelligence

Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation -- GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.


Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Bijwadia, Shaan, Chang, Shuo-yiin, Wang, Weiran, Meng, Zhong, Zhang, Hao, Sainath, Tara N.

arXiv.org Artificial Intelligence

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.


Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

Si, Chenglei, Friedman, Dan, Joshi, Nitish, Feng, Shi, Chen, Danqi, He, He

arXiv.org Artificial Intelligence

In-context learning (ICL) is an important paradigm for adapting large language models (LLMs) to new tasks, but the generalization behavior of ICL remains poorly understood. We investigate the inductive biases of ICL from the perspective of feature bias: which feature ICL is more likely to use given a set of underspecified demonstrations in which two features are equally predictive of the labels. First, we characterize the feature biases of GPT-3 models by constructing underspecified demonstrations from a range of NLP datasets and feature combinations. We find that LLMs exhibit clear feature biases - for example, demonstrating a strong bias to predict labels according to sentiment rather than shallow lexical features, like punctuation. Second, we evaluate the effect of different interventions that are designed to impose an inductive bias in favor of a particular feature, such as adding a natural language instruction or using semantically relevant label words. We find that, while many interventions can influence the learner to prefer a particular feature, it can be difficult to overcome strong prior biases. Overall, our results provide a broader picture of the types of features that ICL may be more likely to exploit and how to impose inductive biases that are better aligned with the intended task.