Goto

Collaborating Authors

 Generative AI


Reasoning Beyond Limits: Advances and Open Problems for LLMs

arXiv.org Artificial Intelligence

Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI's o1 & o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.


GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

arXiv.org Artificial Intelligence

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.


Heavy ChatGPT users tend to be more lonely, suggests research

The Guardian

Heavy users of ChatGPT tend to be lonelier, more emotionally dependent on the AI tool and have fewer offline social relationships, new research suggests. Only a small number of users engage emotionally with ChatGPT, but those who do are among the heaviest users, according to a pair of studies from OpenAI and the MIT Media Lab. The researchers wrote that the users who engaged in the most emotionally expressive personal conversations with the chatbots tended to experience higher loneliness – though it isn't clear if this is caused by the chatbot or because lonely people are seeking emotional bonds. While the researchers have stressed that the studies are preliminary, they ask pressing questions about how AI chatbot tools, which according to OpenAI is used by more than 400 million people a week, are influencing people's offline lives. The researchers, who plan to submit both studies to peer-reviewed journals, found that participants who "bonded" with ChatGPT – typically in the top 10% for time spent with the tool – were more likely than others to be lonely, and to rely on it more.


Now you can generate images directly from ChatGPT and Sora

Engadget

OpenAI just announced that all users will soon be able to generate images directly inside of ChatGPT. This will be the default image generation tool in 4o, so there will be no need to open Dall-E whenever you want to whip up a picture of a cat in space eating lasagna or whatever. The company says that the platform will "generate high-quality images based on your prompt, conversation and uploaded files." To the latter point, it'll be able to transform pre-existing images based on prompts. OpenAI is also boasting about significant improvements in text rendering and contextual understanding.


OpenAI's new image generator aims to be practical enough for designers and advertisers

MIT Technology Review

The new model makes progress on technical issues that have plagued AI image generators for years. While most have been great at creating fantastical images or realistic deepfakes, they've been terrible at something called binding, which refers to the ability to identify certain objects correctly and put them in their proper place (like a sign that says "hot dogs" properly placed above a food cart, not somewhere else in the image). It was only a few years ago that models started to succeed at things like "Put the red cube on top of the blue cube," a feature that is essential for any creative professional use of AI. Generators also struggle with text generation, typically creating distorted jumbles of letter shapes that look more like captchas than readable text. The model is able to generate 12 discrete graphics within a single image--like a cat emoji or a lightning bolt--and place them in proper order.


How Nvidia's CEO got me excited about our 'agentic AI' future

PCWorld

"The era we're in now is called the era of reasoning AI which is going to be the foundation layer of the next era of AI, or agentic AI." This was a statement from Nvidia's CEO Jensen Huang at HP's Amplify Conference in Nashville, last week. And in reflecting on it, it struck me as poignant for the following reason: Up until now I've viewed AI on a single timeline starting at the point where I first saw simple AI tools pop up, to now, where they are a lot more complex and capable than they used to be. In doing so, I've failed to see the true potential of AI -- the fact that we're about to enter a whole new age of AI that will make even those smart generative AI tools we have today pale in comparison to what we'll soon have on our PCs and other connected devices. It's not that I haven't been impressed with the advancements I've seen in AI tools so far.


Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

arXiv.org Artificial Intelligence

Can Visual Language Models (VLMs) effectively capture huma n visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learnin g methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageRewar d and Human Preference Score v2 (HPSv2), our models achieve accurac ies of 64.9% on the ImageReward test set (trained on ImageReward official sp lit) and 65.4% on HPSv2 (trained on approximately 25% of its data). These resu lts match traditional encoder-based models while providing transparent r easoning and enhanced generalization. This approach allows to use not only rich VL M world knowledge, but also its potential to think, yielding interpretable out comes that help decision-making processes. By demonstrating that human visual prefe rences reasonable by current VLMs, we introduce efficient soft-reward strateg ies for image ranking, outperforming simplistic selection or scoring methods. Th is reasoning capability enables VLMs to rank arbitrary images--regardless of aspect ratio or complexity--thereby potentially amplifying the effectiveness of v isual Preference Optimization. By reducing the need for extensive markup while im proving reward generalization and explainability, our findings can be a str ong mile-stone that will enhance text-to-vision models even further.


Open Deep Search: Democratizing Search with Open-source Reasoning Agents

arXiv.org Artificial Intelligence

We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs -- for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES -- with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.


Guarding against artificial intelligence--hallucinated citations: the case for full-text reference deposit

arXiv.org Artificial Intelligence

The tendency of generative artificial intelligence (AI) sys tems to "hallucinate" false information is well-known; AI-generated cit ations to nonexistent sources have made their way into the reference list s of peer-reviewed publications. Here, I propose a solution to this pr oblem, taking inspiration from the T ransparency and Openness Promotion ( TOP) data sharing guidelines, the clash of generative AI with the Amer ican judiciary, and the precedent set by submissions of prior art to the Unite d States Patent and T rademark Office. Journals should require authors to sub mit the full text of each cited source along with their manuscripts, ther eby preventing authors from citing any material whose full text they cannot produce. This solution requires limited additional work on the part of aut hors or editors while effectively immunizing journals against hallucinat ed references. Within the same month, commenters on Pub-Peer raised concerns regarding the article's reference list.


Membership Inference Attacks on Large-Scale Models: A Survey

arXiv.org Artificial Intelligence

The adoption of the Large Language Model (LLM) has accelerated dramatically since the ChatGPT from OpenAI went online in November 2022. Recent advances in Large Multimodal Models (LMMs), which process diverse data types and enable interaction through various channels, have expanded beyond the text-to-text limitations of early LLMs, attracting significant and concurrent attention from both researchers and industry. While LLMs and LMMs are starting to spread widely, concerns about their privacy risks are increasing as well. Membership Inference Attacks (MIAs), techniques used to determine whether a particular data point was part of a model's training set, serve as a key metric for assessing the privacy vulnerabilities of machine learning models. Hu et al. show that various machine learning algorithms are vulnerable to MIA. Despite extensive studies on MIAs in traditional models, there remains a lack of systematic surveys addressing their effectiveness and implications in modern large-scale models like LLMs and LMMs. In this paper, we systematically reviewed recent studies of MIA against LLMs and LMMs. We analyzed and categorized each attack based on their methodology and scenario and discussed the limitations in existing research. Additionally, we examine privacy concerns associated with the fine-tuning process. Finally, we provided some suggestions for future research in this direction.