deepdive
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Lu, Rui, Hou, Zhenyu, Wang, Zihan, Zhang, Hanchen, Liu, Xiao, Li, Yujiang, Feng, Shi, Tang, Jie, Dong, Yuxiao
Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. To encourage diversity and reduce redundancy, we design a redundancy penalty that discourages repeated similar queries. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.
The Machine Ethics podcast – DeepDive: AI and the environment
Hosted by Ben Byford, The Machine Ethics Podcast brings together interviews with academics, authors, business leaders, designers and engineers on the subject of autonomous algorithms, artificial intelligence, machine learning, and technology's impact on society. This is our 100th episode! A super special look at AI and the environment, we interviewed four experts for this DeepDive episode. We chatted about water stress, the energy usage of AI systems and data centres, using AI for fossil fuel discovery, the geo-political nature of AI, GenAI vs other ML algorithms for energy use, demanding transparency on energy usage for training and operating AI, more AI regulation for carbon consumption, things we can change today like picking renewable hosting solutions, publishing your data, when doing "responsible AI" you must include the environment, considering who are the controllers of the technology and what do they want, and more… Hannah Smith is Director of Operations for Green Web Foundation and co-founder of Green Tech South West. She has a background in Computer Science.
- North America > United States > Virginia (0.05)
- North America > United States > Maryland (0.05)
- North America > Canada (0.05)
- Energy (0.73)
- Government (0.56)
- Law (0.51)
Disentangled Interleaving Variational Encoding
Wong, Noelle Y. L., Cheu, Eng Yeow, Chiam, Zhonglin, Srinivasan, Dipti
Conflicting objectives present a considerable challenge in interleaving multi-task learning, necessitating the need for meticulous design and balance to ensure effective learning of a representative latent data space across all tasks without mutual negative impact. Our proposed model, Deep Disentangled Interleaving Variational Encoding (Deep-DIVE) learns disentangled features from the original input to form clusters in the embedding space and unifies these features via the cross-attention mechanism in the fusion stage. We theoretically prove that combining the objectives for reconstruction and forecasting fully captures the lower bound and mathematically derive a loss function for disentanglement using Naïve Bayes. Experiments on two public datasets show that DeepDIVE disentangles the original input and yields forecast accuracies better than the original VAE and comparable to existing state-of-the-art baselines. In multi-objective deep learning, gradients from different objectives can conflict, when the different loss terms induce competing gradient directions during training of the network. Balancing these gradients to ensure stable and effective learning is a significant challenge prompting the development of methods to mitigate this issue, such as Liu et al. (2021); Yu et al. (2020); Sener & Koltun (2018) which solve an additional optmization problem before each gradient update step, to manipulate conflicting gradients before the update.
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- Asia > Singapore (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Toward the Automated Construction of Probabilistic Knowledge Graphs for the Maritime Domain
Shiri, Fatemeh, Wang, Teresa, Pan, Shirui, Chang, Xiaojun, Li, Yuan-Fang, Haffari, Reza, Nguyen, Van, Yu, Shuang
International maritime crime is becoming increasingly sophisticated, often associated with wider criminal networks. Detecting maritime threats by means of fusing data purely related to physical movement (i.e., those generated by physical sensors, or hard data) is not sufficient. This has led to research and development efforts aimed at combining hard data with other types of data (especially human-generated or soft data). Existing work often assumes that input soft data is available in a structured format, or is focused on extracting certain relevant entities or concepts to accompany or annotate hard data. Much less attention has been given to extracting the rich knowledge about the situations of interest implicitly embedded in the large amount of soft data existing in unstructured formats (such as intelligence reports and news articles). In order to exploit the potentially useful and rich information from such sources, it is necessary to extract not only the relevant entities and concepts but also their semantic relations, together with the uncertainty associated with the extracted knowledge (i.e., in the form of probabilistic knowledge graphs). This will increase the accuracy of and confidence in, the extracted knowledge and facilitate subsequent reasoning and learning. To this end, we propose Maritime DeepDive, an initial prototype for the automated construction of probabilistic knowledge graphs from natural language data for the maritime domain. In this paper, we report on the current implementation of Maritime DeepDive, together with preliminary results on extracting probabilistic events from maritime piracy incidents. This pipeline was evaluated on a manually crafted gold standard, yielding promising results.
- North America > Panama (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (6 more...)
- Transportation > Marine (1.00)
- Government > Military (0.94)
- Information Technology > Security & Privacy (0.93)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
- (2 more...)
Apple's new AI acquisition has some healthcare experience
Lattice, the "dark data" startup Apple snapped up earlier this week, may well have aspirations in the healthcare space. Apple acquired the company a few weeks ago according to Tech Crunch, which says a source put the deal at $200 million. Lattice is working on using artificial intelligence to turn "dark data", which is unstructured, largely unusable data generated through digital interactions, into structured, usable data. Obviously Apple is working on a number of different initiatives, and the company has declined to comment on its plans for the acquisition. But there is reason to think that Lattice might be taken up as part of Apple's many healthcare efforts.
Apple acquires AI company Lattice Data, a specialist in unstructured 'dark data', for $200M
As large tech companies gear up to make a stronger push into machine learning and artificial intelligence, Apple has acquired a company to fill out its own capabilities in the area. Specifically, Apple has picked up Lattice Data, a company that applies an AI enabled inference engine to take unstructured, "dark" data and turn it into structured (and more usable) information. We've heard from a single source that Apple has paid a price of around $200 million. The deal was closed a couple of weeks ago, the source said, and about 20 engineers have joined the larger company. A source first alerted us to this, and we've received the standard Apple confirmation.
- North America > United States > Michigan (0.06)
- Asia > South Korea (0.06)
DeepDive
The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms.
Incremental knowledge base construction using DeepDive
Anything you'd add to the list?) Regular readers will no doubt have noticed that these are the subject areas I most often cover on The Morning Paper. I've chosen today's paper as representative of a large body of work at Stanford on a system called DeepDive. DeepDive sits at a very interesting intersection of the above topics, and its goal is to build a knowledge base – stored in a relational database – from information in large volumes of semi-structured and unstructured data. Such data is sometimes called dark data, and creating a knowledge base from it is the task of knowledge base construction (KBC).
Wikipedia Knowledge Graph with DeepDive
Palomares, Thomas (Stanford University) | Ahres, Youssef (Stanford University) | Kangaspunta, Juhana (Stanford University) | Ré, Christopher (Stanford University)
Despite the tremendous amount of information on Wikipedia, only a very small amount is structured. Most of the information is embedded in unstructured text and extracting it is a non trivial challenge. In this paper, we propose a full pipeline built on top of DeepDive to successfully extract meaningful relations from the Wikipedia text corpus. We evaluated the system by extracting company-founders and family relations from the text. As a result, we extracted more than 140,000 distinct relations with an average precision above 90%.