text input
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work.For example, on the datasets used for benchmarking phrase-grounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input.
Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs
Various developed countries have experienced a continuous rise in obesity [1]. This can be burdensome on healthcare and has thus prompted governments to introduce laws and regulations to restaurants and food chains in an attempt to promote healthier eating [1]. One patent rule is the mandatory display of calories in menus, which has not come without an added cost to businesses. For example, Section 4205 of the Affordable Care Act (ACA) in the USA required certain food chains to display caloric information on menu items which was estimated to cost $315 million to comply [2]. These costs were primarily due to nutritional analysis - if one could drastically diminish this overhead, it would greatly lower business spending [2].
Cross-Modal Knowledge Distillation for Speech Large Language Models
Wang, Enzhi, Li, Qicheng, Tang, Zhiyuan, Jia, Yuhang
ABSTRACT In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Index T erms-- Speech LLMs, Cross-Modal Knowledge Distillation, Catastrophic Forgetting, Modality Inequivalence, Question Answering 1. INTRODUCTION In recent years, large language models (LLMs) have made remarkable progress in multimodal capabilities, with voice interaction emerging as a key application direction. Cutting-edge models such as GPT -4o [1] already enable real-time spoken dialogue, providing users with more natural, flexible, and high-quality interaction experiences compared to traditional text-based systems. Building on this trend, many researchers have begun extending pretrained text LLMs into the speech domain, constructing large speech models with both speech understanding and generation abilities.
TextOnly: A Unified Function Portal for Text-Related Functions on Smartphones
Tu, Minghao, Yu, Chun, Shen, Xiyuan, Zheng, Zhi, Chen, Li, Shi, Yuanchun
Text boxes serve as portals to diverse functionalities in today's smartphone applications. However, when it comes to specific functionalities, users always need to navigate through multiple steps to access particular text boxes for input. We propose TextOnly, a unified function portal that enables users to access text-related functions from various applications by simply inputting text into a sole text box. For instance, entering a restaurant name could trigger a Google Maps search, while a greeting could initiate a conversation in WhatsApp. Despite their brevity, TextOnly maximizes the utilization of these raw text inputs, which contain rich information, to interpret user intentions effectively. TextOnly integrates large language models(LLM) and a BERT model. The LLM consistently provides general knowledge, while the BERT model can continuously learn user-specific preferences and enable quicker predictions. Real-world user studies demonstrated TextOnly's effectiveness with a top-1 accuracy of 71.35%, and its ability to continuously improve both its accuracy and inference speed. Participants perceived TextOnly as having satisfactory usability and expressed a preference for TextOnly over manual executions. Compared with voice assistants, TextOnly supports a greater range of text-related functions and allows for more concise inputs.
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
Geng, Jiahui, Tran, Thy Thy, Nakov, Preslav, Gurevych, Iryna
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model's response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.