input type
Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Jung, Haeji, Kim, Jinju, Kim, Kyungjin, Roh, Youjeong, Mortensen, David R.
Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Lu, Yuxuan, Huang, Jing, Han, Yan, Yao, Bingsheng, Bei, Sisong, Gesi, Jiri, Xie, Yaochen, Zheshen, null, Wang, null, He, Qi, Wang, Dakuo
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models
We present an exploratory framework to test whether noise-like input can induce structured responses in language models. Instead of assuming that extraterrestrial signals must be decoded, we evaluate whether inputs can trigger linguistic behavior in generative systems. This shifts the focus from decoding to viewing structured output as a sign of underlying regularity in the input. We tested GPT-2 small, a 117M-parameter model trained on English text, using four types of acoustic input: human speech, humpback whale vocalizations, Phylloscopus trochilus birdsong, and algorithmically generated white noise. All inputs were treated as noise-like, without any assumed symbolic encoding. To assess reactivity, we defined a composite score called Semantic Induction Potential (SIP), combining entropy, syntax coherence, compression gain, and repetition penalty. Results showed that whale and bird vocalizations had higher SIP scores than white noise, while human speech triggered only moderate responses. This suggests that language models may detect latent structure even in data without conventional semantics. We propose that this approach could complement traditional SETI methods, especially in cases where communicative intent is unknown. Generative reactivity may offer a different way to identify data worth closer attention.
On the reliability of feature attribution methods for speech classification
Shen, Gaofei, Mohebbi, Hosein, Bisazza, Arianna, Alishahi, Afra, Chrupała, Grzegorz
As the capabilities of large-scale pre-trained models evolve, understanding the determinants of their outputs becomes more important. Feature attribution aims to reveal which parts of the input elements contribute the most to model outputs. In speech processing, the unique characteristics of the input signal make the application of feature attribution methods challenging. We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods, and how these factors interact with characteristics of each classification task. We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain, with the exception of word-aligned perturbation methods when applied to word-based classification tasks.
Generative AI in Multimodal User Interfaces: Trends, Challenges, and Cross-Platform Adaptability
Bieniek, J., Rahouti, M., Verma, D. C.
As the boundaries of human computer interaction expand, Generative AI emerges as a key driver in reshaping user interfaces, introducing new possibilities for personalized, multimodal and cross-platform interactions. This integration reflects a growing demand for more adaptive and intuitive user interfaces that can accommodate diverse input types such as text, voice and video, and deliver seamless experiences across devices. This paper explores the integration of generative AI in modern user interfaces, examining historical developments and focusing on multimodal interaction, cross-platform adaptability and dynamic personalization. A central theme is the interface dilemma, which addresses the challenge of designing effective interactions for multimodal large language models, assessing the trade-offs between graphical, voice-based and immersive interfaces. The paper further evaluates lightweight frameworks tailored for mobile platforms, spotlighting the role of mobile hardware in enabling scalable multimodal AI. Technical and ethical challenges, including context retention, privacy concerns and balancing cloud and on-device processing are thoroughly examined. Finally, the paper outlines future directions such as emotionally adaptive interfaces, predictive AI driven user interfaces and real-time collaborative systems, underscoring generative AI's potential to redefine adaptive user-centric interfaces across platforms.
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Furuta, Hiroki, Lee, Kuang-Huei, Nachum, Ofir, Matsuo, Yutaka, Faust, Aleksandra, Gu, Shixiang Shane, Gur, Izzeddin
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. Web navigation is a class of sequential decision making problems where agents interact with web interfaces following user instructions (Shi et al., 2017; Liu et al., 2018; Gur et al., 2019). Common web navigation tasks include, for example, form filling (Diaz et al., 2013), information retrieval (Nogueira & Cho, 2016; Adolphs et al., 2022), or sending emails via a sequence of interactions with computer interface such as click or type (Figure 1). Recently, there has been a growing interest in developing agents to automate these actions and free humans from repetitive interactions (Mazumder & Riva, 2020; Li et al., 2020; Shvo et al., 2021). Most prior works studied web navigation problems as online RL to learn the optimal action distribution with task-specific models from scratch (Liu et al., 2018; Gur et al., 2019; Jia et al., 2019; Humphreys et al., 2022).
URET: Universal Robustness Evaluation Toolkit (for Evasion)
Eykholt, Kevin, Lee, Taesung, Schales, Douglas, Jang, Jiyong, Molloy, Ian, Zorin, Masha
Machine learning models are known to be vulnerable to adversarial evasion attacks as illustrated by image classification models. Thoroughly understanding such attacks is critical in order to ensure the safety and robustness of critical AI tasks. However, most evasion attacks are difficult to deploy against a majority of AI systems because they have focused on image domain with only few constraints. An image is composed of homogeneous, numerical, continuous, and independent features, unlike many other input types to AI systems used in practice. Furthermore, some input types include additional semantic and functional constraints that must be observed to generate realistic adversarial inputs. In this work, we propose a new framework to enable the generation of adversarial inputs irrespective of the input type and task domain. Given an input and a set of pre-defined input transformations, our framework discovers a sequence of transformations that result in a semantically correct and functional adversarial input. We demonstrate the generality of our approach on several diverse machine learning tasks with various input representations. We also show the importance of generating adversarial examples as they enable the deployment of mitigation techniques.
A Demand-Driven Perspective on Generative Audio AI
Oh, Sangshin, Kang, Minsung, Moon, Hyeongi, Choi, Keunwoo, Chon, Ben Sangbae
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.
Resolving Indirect Referring Expressions for Entity Selection
Hosseini, Mohammad Javad, Radlinski, Filip, Pareti, Silvia, Louis, Annie
Recent advances in language modeling have enabled new conversational systems. In particular, it is often desirable for people to make choices among specified options when using such systems. We address this problem of reference resolution, when people use natural expressions to choose between the entities. For example, given the choice `Should we make a Simnel cake or a Pandan cake?' a natural response from a dialog participant may be indirect: `let's make the green one'. Such natural expressions have been little studied for reference resolution. We argue that robustly understanding such language has large potential for improving naturalness in dialog, recommendation, and search systems. We create AltEntities (Alternative Entities), a new public dataset of 42K entity pairs and expressions (referring to one entity in the pair), and develop models for the disambiguation problem. Consisting of indirect referring expressions across three domains, our corpus enables for the first time the study of how language models can be adapted to this task. We find they achieve 82%-87% accuracy in realistic settings, which while reasonable also invites further advances.
Understanding HTML with Large Language Models
Gur, Izzeddin, Nachum, Ofir, Miao, Yingjie, Safdari, Mustafa, Huang, Austin, Chowdhery, Aakanksha, Narang, Sharan, Fiedel, Noah, Faust, Aleksandra
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl.