Goto

Collaborating Authors

 search capability


Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

arXiv.org Artificial Intelligence

Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.


ScholarSearch: Benchmarking Scholar Searching Ability of LLMs

arXiv.org Artificial Intelligence

Large Language Models (LLMs)' search capabilities have garnered significant attention. Existing benchmarks, such as OpenAI's BrowseComp, primarily focus on general search scenarios and fail to adequately address the specific demands of academic search. These demands include deeper literature tracing and organization, professional support for academic databases, the ability to navigate long-tail academic knowledge, and ensuring academic rigor. Here, we proposed ScholarSearch, the first dataset specifically designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research. ScholarSearch possesses the following key characteristics: Academic Practicality, where question content closely mirrors real academic learning and research environments, avoiding deliberately misleading models; High Difficulty, with answers that are challenging for single models (e.g., Grok DeepSearch or Gemini Deep Research) to provide directly, often requiring at least three deep searches to derive; Concise Evaluation, where limiting conditions ensure answers are as unique as possible, accompanied by clear sources and brief solution explanations, greatly facilitating subsequent audit and verification, surpassing the current lack of analyzed search datasets both domestically and internationally; and Broad Coverage, as the dataset spans at least 15 different academic disciplines. Through ScholarSearch, we expect to more precisely measure and promote the performance improvement of LLMs in complex academic information retrieval tasks.


EvolveSearch: An Iterative Self-Evolving Search Agent

arXiv.org Artificial Intelligence

The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.


Meta's metaverse is getting an AI makeover

Engadget

Meta's Connect keynote felt different this year, and not just because it marked the return of an in-person event. It's been nearly two years since Mark Zuckerberg used Connect to announce that Facebook was changing its name to Meta and reorienting the entire company around the metaverse. But at this year's event, it felt almost as if Zuckerberg was trying to avoid saying the word "metaverse." While he did utter the word a couple of times, he spent much more time talking up Meta's new AI features, many of which will be available on Instagram and Facebook and other non-metaverse apps. Horizon Worlds, the company's signature metaverse experience that was highlighted at last year's Connect, was barely mentioned. That may not be particularly surprising if you've been following the company's metaverse journey lately.


Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for Instruction Generation Models

arXiv.org Artificial Intelligence

Recent work studies the cognitive capabilities of language models through psychological tests designed for humans. While these studies are helpful for understanding the general capabilities of these models, there is no guarantee that a model possessing sufficient capabilities to pass those tests would actually use those capabilities in performing real-life tasks. In this work, we formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks. These capabilities are (i) the ability to quickly generate good candidate utterances (the search capability) (ii) the ability to predict how a listener interprets those utterances and choose the most appropriate one (the pragmatic capability). We design an evaluation scheme for comparing these capabilities of a language model with those of a human. Applying this scheme to examine various models in a navigation instruction generation problem, we find that their pragmatic capability is severely lacking. This insight leads us to augment them with better models of the listener and obtain a significant boost of 11% in success rate in guiding real humans. Our work advocates for having a principled procedure for aligning language models with humans that involves (i) formulating task-oriented capabilities, (ii) devising a method to quantify their deficiency, and (iii) iteratively improving them.


How AI search is overcoming the unstructured data challenge

#artificialintelligence

With 80 per cent of company data being unstructured, including text, images and video, getting the most possible value from rising amounts of these assets is proving a challenge across all business sectors. Businesses often meet pitfalls in keyword search capabilities that fail to properly take context, formats or languages into account, leaving users with insufficient results. To solve this challenge, Barcelona-headquartered data startup Nuclia is delivering an API that leverages what company CEO and co-founder Eudald Camprubi has named'AI search as a service', capable of finding and indexing data across any source. An end-to-end solution, it can extract data from file repositories, audio, video, URLs and databases, split it into paragraphs, and present an index that shows exactly where any chosen piece of information is in the file. This is based on continuously trained language models, the creation of which owes much to data annotation.


how-ai-is-shaping-the-future-of-live-shopping-e-commerce

#artificialintelligence

In fact, odds are you've already worked for a company that uses AI and/or machine learning tools to some extent! But AI will also have wide-reaching consequences for the economy be on the workplace. In fact, AI is sure to shape the future of live shopping and e-commerce marketplaces for years to come. Let's take a look at six major ways AI will change e-commerce and live shopping in the near future. For starters, AI technology developments will allow the further development of visual search capabilities and programs.


The Future of Search Is Now! - Expert.ai

#artificialintelligence

Every day, billions of internet users type questions into search engines via smartphones, desktop computers or IoT devices, 90 percent of whom are using Google. As a result, each time the company releases a new algorithm into cyberspace, top-ranked SEO marketers and webpage owners become fearful of losing their page-one rankings. However, the company's latest iteration is notably different from those previously released. Now, the tech giant has decided to take the next step and marry its latest algorithm with natural language processing (NLP). Many believe that this dynamic pairing could prove to be a game changer for search. As the primary tool for people to access information, the importance of search engines can't be overestimated.


Arcanum makes Hungarian heritage accessible with Amazon Rekognition

#artificialintelligence

Arcanum specializes in digitizing Hungarian language content, including newspapers, books, maps, and art. With over 30 years of experience, Arcanum serves more than 30,000 global subscribers with access to Hungarian culture, history, and heritage. Amazon Rekognition Solutions Architects worked with Arcanum to add highly scalable image analysis to Arcanum Digitheca, a free service provided by Arcanum, which enables you to search and explore Hungarian cultural heritage, including 600,000 faces over 500,000 images. For example, you can find historical works by author Mór Jókai or photos on topics like weddings. The Arcanum team chose Amazon Rekognition to free valuable staff from time and cost-intensive manual labeling, and improved label accuracy to make 200,000 previously unsearchable images (approximately 40% of image inventory), available to users.


Configuring your Amazon Kendra Confluence Server connector

#artificialintelligence

These types of workspaces are rich with data and contain sets of knowledge and information that can be a great source of truth to answer organizational questions. Unfortunately, it isn't always easy to tap into these data sources to extract the information you need. For example, the data source might not be connected to an enterprise search service within the organization, or the service is outdated and lacks natural language search capabilities, leading to poorer search experiences. Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Ken dra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they're looking for, even when it's scattered across multiple locations and content repositories within your organization.