Goto

Collaborating Authors

 Media


R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

arXiv.org Artificial Intelligence

Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at https://github.com/QingFei1/R-Search.


Evaluating Apple Intelligence's Writing Tools for Privacy Against Large Language Model-Based Inference Attacks: Insights from Early Datasets

arXiv.org Artificial Intelligence

The misuse of Large Language Models (LLMs) to infer emotions from text for malicious purposes, known as emotion inference attacks, poses a significant threat to user privacy. In this paper, we investigate the potential of Apple Intelligence's writing tools, integrated across iPhone, iPad, and MacBook, to mitigate these risks through text modifications such as rewriting and tone adjustment. By developing early novel datasets specifically for this purpose, we empirically assess how different text modifications influence LLM-based detection. This capability suggests strong potential for Apple Intelligence's writing tools as privacy-preserving mechanisms. Our findings lay the groundwork for future adaptive rewriting systems capable of dynamically neutralizing sensitive emotional content to enhance user privacy. To the best of our knowledge, this research provides the first empirical analysis of Apple Intelligence's text-modification tools within a privacy-preservation context with the broader goal of developing on-device, user-centric privacy-preserving mechanisms to protect against LLMs-based advanced inference attacks on deployed systems.


MultiHoax: A Dataset of Multi-hop False-Premise Questions

arXiv.org Artificial Intelligence

As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.


Reddit sues AI company Anthropic for allegedly 'scraping' user comments to train chatbot

The Guardian

The social media platform Reddit has sued the artificial intelligence company Anthropic, alleging that it is illegally "scraping" the comments of Reddit users to train its chatbot Claude. Reddit claims that Anthropic has used automated bots to access the social network's content despite being asked not to do so, and "intentionally trained on the personal data of Reddit users without ever requesting their consent". Anthropic did not immediately return a request for comment. The claim was filed on Wednesday in the superior court of California in San Francisco. "AI companies should not be allowed to scrape information and content from people without clear limitations on how they can use that data," said Ben Lee, Reddit's chief legal officer, in a statement on Wednesday.


Generative Artificial Intelligence Policies under the Microscope

Communications of the ACM

Since the rise of ChatGPT, generative artificial intelligence (GenAI) technologies gained widespread popularity, impacting academic research and everyday communication.5,10 While GenAI offers benefits in task automation,9 it can also be misused and abused in nefarious applications,7 with significant risks to long-tail populations.6 Professionals in fields such as journalism and law still remain cautious due to concerns related to hallucinations and ethical issues, but scholars in computer science (CS), the field where GenAI originated, appear to be cautiously, yet actively exploring its use. For instance, Liang, W. et al.3 report the increased use of large language models (LLMs) in the CS scholarly articles (up to 17.5%), compared to mathematics articles (up to 6.3%), and Liang, W. et al.2 report that, between 6.5% and 16.9% of peer reviews at ICLR 2024, NeurIPS 2023, CoRL 2023, and EMNLP 2023 may have been altered by LLMs beyond minor revisions. Considering researchers' increasing adoption of GenAI, it is crucial to establish usage policies to promote fair and ethical practices in scholarly writing and peer reviews.


"Mountainhead" Channels the Absurdity of the Tech Bro

The New Yorker

Four tech billionaires walk into a mansion. It sounds like the setup for a punch line, but it also forms nearly the entire conceit behind "Mountainhead," a savagely entertaining but somewhat shallow new satire written and directed by Jesse Armstrong, the creator of "Succession." The film, which is streaming on HBO's Max, is a sort of chamber play, its stage a modernist castle in Utah--the Mountainhead of the title--overlooking snowy peaks. The players are a quartet of friends, or, more accurately, frenemies, who resemble a mishmash of real-world Silicon Valley founders. Steve Carell plays Randall Garrett, the group's Peter Thiel-esque mentor who, not unlike the late Steve Jobs, has cancer that his doctor tells him is incurable.


Surface Laptop 13in review: Microsoft's cheaper, more compact Windows 11 machine

The Guardian

Microsoft's latest Surface Laptop is smaller and cheaper, managing to condense most of what is great about its larger siblings into a more compact frame without compromising too much on power. The Guardian's journalism is independent. We will earn a commission if you buy something through an affiliate link. The Surface Laptop 13in joins the current seventh-generation Laptop 13.8in and 15in that were launched in the summer last year. It sits at the bottom of the premium pile in price, costing from 899 ( 1,099/ 900/A 1,699), but above the Laptop Go 3, which is likely to be phased out.


Exploring Explanations Improves the Robustness of In-Context Learning

arXiv.org Artificial Intelligence

In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X$^2$-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X$^2$-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.


Assumption-free stability for ranking problems

arXiv.org Machine Learning

In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-$k$ items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often unstable, in the sense that estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to provide a score for each item (say, by aggregating preference data over a sample of users), then for two items with similar scores, small fluctuations in the data can alter the relative ranking of those items. Many existing theoretical results for ranking problems assume a separation condition to avoid this challenge, but real-world data often contains items whose scores are approximately tied, limiting the applicability of existing theory. To address this gap, we develop a new algorithmic stability framework for ranking problems, and propose two novel ranking operators for achieving stable ranking: the \emph{inflated top-$k$} for the top-$k$ selection problem and the \emph{inflated full ranking} for ranking the full list. To enable stability, each method allows for expressing some uncertainty in the output. For both of these two problems, our proposed methods provide guaranteed stability, with no assumptions on data distributions and no dependence on the total number of candidates to be ranked. Experiments on real-world data confirm that the proposed methods offer stability without compromising the informativeness of the output.


Generate, Not Recommend: Personalized Multimodal Content Generation

arXiv.org Artificial Intelligence

To address the challenge of information overload from massive web contents, recommender systems are widely applied to retrieve and present personalized results for users. However, recommendation tasks are inherently constrained to filtering existing items and lack the ability to generate novel concepts, limiting their capacity to fully satisfy user demands and preferences. In this paper, we propose a new paradigm that goes beyond content filtering and selecting: directly generating personalized items in a multimodal form, such as images, tailored to individual users. To accomplish this, we leverage any-to-any Large Multimodal Models (LMMs) and train them in both supervised fine-tuning and online reinforcement learning strategy to equip them with the ability to yield tailored next items for users. Experiments on two benchmark datasets and user study confirm the efficacy of the proposed method. Notably, the generated images not only align well with users' historical preferences but also exhibit relevance to their potential future interests.