AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

arXiv.org Artificial IntelligenceNov-18-2025

Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

Narang, Arya

Various developed countries have experienced a continuous rise in obesity [1]. This can be burdensome on healthcare and has thus prompted governments to introduce laws and regulations to restaurants and food chains in an attempt to promote healthier eating [1]. One patent rule is the mandatory display of calories in menus, which has not come without an added cost to businesses. For example, Section 4205 of the Affordable Care Act (ACA) in the USA required certain food chains to display caloric information on menu items which was estimated to cost $315 million to comply [2]. These costs were primarily due to nutritional analysis - if one could drastically diminish this overhead, it would greatly lower business spending [2].

artificial intelligence, dataset, machine learning, (16 more...)

2511.11705

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)

Neural Information Processing SystemsOct-10-2025, 23:11:58 GMT

3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf

encoder, language model, pengi, (15 more...)

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsOct-9-2025, 08:59:27 GMT

d937cb3fe2851ed0ab9af5e38f885077-Supplemental-Conference.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country: Europe > Switzerland > Zürich > Zürich (0.15)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.75)
Information Technology > Artificial Intelligence > Natural Language (0.71)

arXiv.org Artificial IntelligenceSep-19-2025

Cross-Modal Knowledge Distillation for Speech Large Language Models

Wang, Enzhi, Li, Qicheng, Tang, Zhiyuan, Jia, Yuhang

ABSTRACT In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Index T erms-- Speech LLMs, Cross-Modal Knowledge Distillation, Catastrophic Forgetting, Modality Inequivalence, Question Answering 1. INTRODUCTION In recent years, large language models (LLMs) have made remarkable progress in multimodal capabilities, with voice interaction emerging as a key application direction. Cutting-edge models such as GPT -4o [1] already enable real-time spoken dialogue, providing users with more natural, flexible, and high-quality interaction experiences compared to traditional text-based systems. Building on this trend, many researchers have begun extending pretrained text LLMs into the speech domain, constructing large speech models with both speech understanding and generation abilities.

arxiv preprint arxiv, large language model, natural language, (11 more...)

2509.1493

Country:

Asia > China (0.28)
Europe > Austria (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceAug-26-2025

TextOnly: A Unified Function Portal for Text-Related Functions on Smartphones

Tu, Minghao, Yu, Chun, Shen, Xiyuan, Zheng, Zhi, Chen, Li, Shi, Yuanchun

Text boxes serve as portals to diverse functionalities in today's smartphone applications. However, when it comes to specific functionalities, users always need to navigate through multiple steps to access particular text boxes for input. We propose TextOnly, a unified function portal that enables users to access text-related functions from various applications by simply inputting text into a sole text box. For instance, entering a restaurant name could trigger a Google Maps search, while a greeting could initiate a conversation in WhatsApp. Despite their brevity, TextOnly maximizes the utilization of these raw text inputs, which contain rich information, to interpret user intentions effectively. TextOnly integrates large language models(LLM) and a BERT model. The LLM consistently provides general knowledge, while the BERT model can continuously learn user-specific preferences and enable quicker predictions. Real-world user studies demonstrated TextOnly's effectiveness with a top-1 accuracy of 71.35%, and its ability to continuously improve both its accuracy and inference speed. Participants perceived TextOnly as having satisfactory usability and expressed a preference for TextOnly over manual executions. Compared with voice assistants, TextOnly supports a greater range of text-related functions and allows for more concise inputs.

artificial intelligence, large language model, natural language, (18 more...)

2508.16926

Country:

North America > United States (0.46)
Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsAug-14-2025, 06:49:15 GMT

3a33ae4d634b49b0866b4142a1f82a2f-Paper-Conference.pdf

dataset, phrase sequence, shapecrafter, (9 more...)

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Utah (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Geng, Jiahui, Tran, Thy Thy, Nakov, Preslav, Gurevych, Iryna

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

arXiv.org Artificial IntelligenceJun-3-2025

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model's response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.

artificial intelligence, large language model, natural language, (16 more...)