assistant
Google Assistant will stick around a bit longer than expected for some Android users
LG TVs add'delete' option for Copilot The transition from Assistant to Gemini will continue into 2026. Google wanted to remove Assistant from most Android phones by the end of 2025 and replace it with Gemini. But now the company has announced that it needs a bit more time to make its AI assistant the new default digital helper for most of its users. Google said that it's adjusting its previously announced timeline to make sure [it delivers] a seamless transition and that updates to convert Assistant to Gemini on Android devices will continue into the next year. The company also said that it's sharing more details in the coming months, so it's possible that the transition will go past early 2026. Assistant's retirement was pretty much expected the moment Google launched Gemini and started giving it Assistant's capabilities, such as the ability to control smart devices connected to your phone.
ImF: Implicit Fingerprint for Large Language Models
jiaxuan, Wu, Wanli, Peng, hang, Fu, Yiming, Xue, juan, Wen
Training large language models (LLMs) is resource-intensive and expensive, making intellectual property (IP) protection essential. Most existing model fingerprint methods inject fingerprints into LLMs to protect model ownership. These methods create fingerprint pairs with weak semantic correlations, lacking the contextual coherence and semantic relatedness founded in normal question-answer (QA) pairs in LLMs. In this paper, we propose a Generation Revision Intervention (GRI) attack that can effectively exploit this flaw to erase fingerprints, highlighting the need for more secure model fingerprint methods. Thus, we propose a novel injected fingerprint paradigm called Implicit Fingerprints (ImF). ImF constructs fingerprint pairs with strong semantic correlations, disguising them as natural QA pairs within LLMs. This ensures the fingerprints are consistent with normal model behavior, making them indistinguishable and robust against detection and removal. Our experiment on multiple LLMs demonstrates that ImF retains high verification success rates under adversarial conditions, offering a reliable solution for protecting LLM ownership.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Security & Privacy (0.69)
- Government > Regional Government (0.68)
Why LLMs Cannot Think and How to Fix It
Jahrens, Marius, Martinetz, Thomas
This paper elucidates that current state-of-the-art Large Language Models (LLMs) are fundamentally incapable of making decisions or developing "thoughts" within the feature space due to their architectural constraints. We establish a definition of "thought" that encompasses traditional understandings of that term and adapt it for application to LLMs. We demonstrate that the architectural design and language modeling training methodology of contemporary LLMs inherently preclude them from engaging in genuine thought processes. Our primary focus is on this theoretical realization rather than practical insights derived from experimental data. Finally, we propose solutions to enable thought processes within the feature space and discuss the broader implications of these architectural modifications.
- Research Report (0.50)
- Instructional Material (0.40)
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Song, Mingyang, Zheng, Mao, Luo, Xuan
Using Large Language Models (LLMs) to evaluate and compare two answers from different models typically involves having LLM-based judges select the better answer. However, humans often approach problem-solving from a reverse perspective, for instance, by choosing the worse option instead of the better one in a pairwise comparison. Generally, this kind of reverse thinking plays a crucial role in human reasoning and decision-making and can further test the difference between original and reverse thought processes simultaneously. To address the above issue, in this paper, we propose a Goal-Reversed Prompting (GRP) approach for pairwise evaluation that shifts the original task from selecting the better answer to choosing the worse one. We encourage LLMs to think in reverse by prompting LLMs to identify the worse response. Experiments on closed-source models demonstrate that GRP significantly enhances evaluation capabilities, outperforming the prompt template with the original goal.
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Krumdick, Michael, Lovering, Charles, Reddy, Varshini, Ebner, Seth, Tanner, Chris
LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.
- North America > United States > Massachusetts (0.14)
- North America > Mexico > Mexico City (0.14)
- Europe > Spain (0.14)
- Asia > Thailand (0.14)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Static Vs. Agentic Game Master AI for Facilitating Solo Role-Playing Experiences
Jørgensen, Nicolai Hejlesen, Tharmabalan, Sarmilan, Aslan, Ilhan, Hansen, Nicolai Brodersen, Merritt, Timothy
This paper presents a game master AI for single-player role-playing games. The AI is designed to deliver interactive text-based narratives and experiences typically associated with multiplayer tabletop games like Dungeons & Dragons. We report on the design process and the series of experiments to improve the functionality and experience design, resulting in two functional versions of the system. While v1 of our system uses simplified prompt engineering, v2 leverages a multi-agent architecture and the ReAct framework to include reasoning and action. A comparative evaluation demonstrates that v2 as an agentic system maintains play while significantly improving modularity and game experience, including immersion and curiosity. Our findings contribute to the evolution of AI-driven interactive fiction, highlighting new avenues for enhancing solo role-playing experiences.
- North America > United States > New York (0.28)
- Europe > Denmark (0.14)
- South America > Brazil (0.14)
- Asia > Thailand (0.14)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.68)
- Personal > Interview (0.67)
- Research Report > New Finding (0.48)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Health & Medicine > Therapeutic Area (0.92)
OWLViz: An Open-World Benchmark for Visual Question Answering
Nguyen, Thuy, Nguyen, Dang, Nguyen, Hoang, Luong, Thuan, Dang, Long Hoang, Lai, Viet Dac
We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems' ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.
- North America > United States > Oregon > Lane County > Eugene (0.15)
- Europe > Austria > Vienna (0.14)
- Asia > Thailand (0.14)
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Wang, Kuang, Li, Xianfei, Yang, Shenghao, Zhou, Li, Jiang, Feng, Li, Haizhou
User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
- Asia > China (0.28)
- Asia > Thailand (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- (3 more...)
Eeyore: Realistic Depression Simulation via Supervised and Preference Optimization
Liu, Siyang, Brie, Bianca, Li, Wenda, Biester, Laura, Lee, Andrew, Pennebaker, James, Mihalcea, Rada
Large Language Models (LLMs) have been previously explored for mental healthcare training and therapy client simulation, but they still fall short in authentically capturing diverse client traits and psychological conditions. We introduce \textbf{Eeyore}, an 8B model optimized for realistic depression simulation through a structured alignment framework, incorporating expert input at every stage. First, we systematically curate real-world depression-related conversations, extracting depressive traits to guide data filtering and psychological profile construction, and use this dataset to instruction-tune Eeyore for profile adherence. Next, to further enhance realism, Eeyore undergoes iterative preference optimization -- first leveraging model-generated preferences and then calibrating with a small set of expert-annotated preferences. Throughout the entire pipeline, we actively collaborate with domain experts, developing interactive interfaces to validate trait extraction and iteratively refine structured psychological profiles for clinically meaningful role-play customization. Despite its smaller model size, the Eeyore depression simulation outperforms GPT-4o with SOTA prompting strategies, both in linguistic authenticity and profile adherence.
- North America > United States > Texas (0.14)
- North America > United States > Michigan (0.14)
- Europe > Czechia (0.14)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Personal > Interview (0.67)