Goto

Collaborating Authors

 Generative AI


Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

arXiv.org Artificial Intelligence

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.


Knowledge Retrieval Based on Generative AI

arXiv.org Artificial Intelligence

This study develops a question-answering system based on Retrieval-Augmented Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources. Using TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for dense vector retrieval to obtain highly relevant search results and BGE-reranker to reorder these results based on query relevance. The most pertinent retrieval outcomes serve as reference knowledge for a Large Language Model (LLM), enhancing its ability to answer questions and establishing a knowledge retrieval system grounded in generative AI. The system's effectiveness is assessed through a two-stage evaluation: automatic and assisted performance evaluations. The automatic evaluation calculates accuracy by comparing the model's auto-generated labels with ground truth answers, measuring performance under standardized conditions without human intervention. The assisted performance evaluation involves 20 finance-related multiple-choice questions answered by 20 participants without financial backgrounds. Initially, participants answer independently. Later, they receive system-generated reference information to assist in answering, examining whether the system improves accuracy when assistance is provided. The main contributions of this research are: (1) Enhanced LLM Capability: By integrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly relevant results, reduces hallucinations, and dynamically accesses authorized or public knowledge sources. (2) Improved Data Privacy: A customized RAG architecture enables local operation of the LLM, eliminating the need to send private data to external servers. This approach enhances data security, reduces reliance on commercial services, lowers operational costs, and mitigates privacy risks.


Google brings real-time information from The Associated Press to Gemini

Engadget

Google is partnering with The Associated Press to bring real-time information from the news agency to its Gemini app, the search giant announced on Wednesday. The financial terms of the agreement were not disclosed. The deal builds on an existing partnership Google had with The Associated Press to source real-time information for its search engine. "This will be particularly helpful to [Gemini app] users looking for up-to-date information," Google says of the deal. "AP and Google's longstanding relationship is based on working together to provide timely, accurate news and information to global audiences," said Kristin Heitmann, The Associated Press senior vice president and chief revenue officer.


Now you can instruct ChatGPT to do things in the future

PCWorld

OpenAI has now updated the ChatGPT with the new beta feature Tasks, which allows the AI chatbot to perform tasks at a later time. Users simply say what they need and when they need it. For example, a user can instruct ChatGPT to inform them of current stock prices every morning, remind them of their language studies every evening, or give them a daily 15-minute personal training session. The Tasks feature is currently being rolled out to Plus, Team, and Pro subscribers. It can be found in the model selector, where it is called "GPT-4o with Scheduled Activities (beta)."


Axios partners with OpenAI, forgetting the scorpion stung the frog

Engadget

Axios is expanding its local newsletter presence from 30 to 34 cities. In its continued pretense of benefiting newsrooms, OpenAI has partnered with Axios in a three-year deal to cover Pittsburgh, Pennsylvania; Kansas City, Missouri; Boulder, Colorado; and Huntsville, Alabama. What does OpenAI get in exchange for its funding? Oh, just the ability to use Axios content to answer users' questions. Like the close to 20 newsrooms that OpenAI has already partnered with, Axios seems to have forgotten that the scorpion did end up stinging the frog.


NVIDIA's AI NPCs are a nightmare

Engadget

The rise of AI NPCs has felt like a looming threat for years, as if developers couldn't wait to dump human writers and offload NPC conversations to generative AI models. At CES 2025, NVIDIA made it plainly clear the technology was right around the corner. PUBG developer Krafton, for instance, plans to use NVIDIA's ACE (Avatar Cloud Engine) to power AI companions, which will assist and banter with you during matches. Krafton isn't just stopping there -- it's also using ACE in its life simulation title InZOI to make characters smarter and generate objects. While the use of generative AI in games seems almost inevitable, as the medium has always toyed with new methods for making enemies and NPCs seem smarter and more realistic, seeing several NVIDIA ACE demos back-to-back made me genuinely sick to my stomach.


Evaluating GenAI for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

arXiv.org Artificial Intelligence

Generative artificial intelligence (GenAI) holds great promise as a tool to support personalized learning. Teachers need tools to efficiently and effectively enhance content readability of educational texts so that they are matched to individual students reading levels, while retaining key details. Large Language Models (LLMs) show potential to fill this need, but previous research notes multiple shortcomings in current approaches. In this study, we introduced a generalized approach and metrics for the systematic evaluation of the accuracy and consistency in which LLMs, prompting techniques, and a novel multi-agent architecture to simplify sixty informational reading passages, reducing each from the twelfth grade level down to the eighth, sixth, and fourth grade levels. We calculated the degree to which each LLM and prompting technique accurately achieved the targeted grade level for each passage, percentage change in word count, and consistency in maintaining keywords and key phrases (semantic similarity). One-sample t-tests and multiple regression models revealed significant differences in the best performing LLM and prompt technique for each of the four metrics. Both LLMs and prompting techniques demonstrated variable utility in grade level accuracy and consistency of keywords and key phrases when attempting to level content down to the fourth grade reading level. These results demonstrate the promise of the application of LLMs for efficient and precise automated text simplification, the shortcomings of current models and prompting methods in attaining an ideal balance across various evaluation criteria, and a generalizable method to evaluate future systems.


SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector

arXiv.org Artificial Intelligence

The rapid adoption of generative AI in the public sector, encompassing diverse applications ranging from automated public assistance to welfare services and immigration processes, highlights its transformative potential while underscoring the pressing need for thorough risk assessments. Despite its growing presence, evaluations of risks associated with AI-driven systems in the public sector remain insufficiently explored. Building upon an established taxonomy of AI risks derived from diverse government policies and corporate guidelines, we investigate the critical risks posed by generative AI in the public sector while extending the scope to account for its multimodal capabilities. In addition, we propose a Systematic dAta generatIon Framework for evaluating the risks of generative AI (SAIF). SAIF involves four key stages: breaking down risks, designing scenarios, applying jailbreak methods, and exploring prompt types. It ensures the systematic and consistent generation of prompt data, facilitating a comprehensive evaluation while providing a solid foundation for mitigating the risks. Furthermore, SAIF is designed to accommodate emerging jailbreak methods and evolving prompt types, thereby enabling effective responses to unforeseen risk scenarios. We believe that this study can play a crucial role in fostering the safe and responsible integration of generative AI into the public sector.


How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering

arXiv.org Artificial Intelligence

Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to improve productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.


RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

arXiv.org Artificial Intelligence

Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.