AITopics | data discovery

Collaborating Authors

data discovery

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

From keywords to semantics: Perceptions of large language models in data discovery

Halstead, Maura E, Green, Mark A., Jay, Caroline, Kingston, Richard, Topping, David, Singleton, Alexander

arXiv.org Artificial IntelligenceOct-3-2025

This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers' perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.01473

Country:

North America (0.28)
Europe > United Kingdom (0.28)

Genre:

Research Report > Experimental Study (0.95)
Research Report > New Finding (0.70)

Industry:

Health & Medicine (1.00)
Government (1.00)
Information Technology (0.68)
Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science

Salemi, Alireza, Parmar, Mihir, Goyal, Palash, Song, Yiwen, Yoon, Jinsung, Zamani, Hamed, Palangi, Hamid, Pfister, Tomas

arXiv.org Artificial IntelligenceOct-3-2025

The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent's capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents -- either responsible for a partition of the data lake or general information retrieval -- volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents' expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.

agent, artificial intelligence, llm-based multi-agent blackboard system, (11 more...)

arXiv.org Artificial Intelligence

2510.01285

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.46)
Health & Medicine (0.34)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

A Generative AI System for Biomedical Data Discovery with Grammar-Based Visualizations

Lange, Devin, Gao, Shanghua, Sui, Pengwei, Money, Austen, Misner, Priya, Zitnik, Marinka, Gehlenborg, Nils

arXiv.org Artificial IntelligenceSep-23-2025

We explore the potential for combining generative AI with grammar-based visualizations for biomedical data discovery. In our prototype, we use a multi-agent system to generate visualization specifications and apply filters. These visualizations are linked together, resulting in an interactive dashboard that is progressively constructed. Our system leverages the strengths of natural language while maintaining the utility of traditional user interfaces. Furthermore, we utilize generated interactive widgets enabling user adjustment. Finally, we demonstrate the potential utility of this system for biomedical data discovery with a case study.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.16454

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.70)

Add feedback

Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial IntelligenceJan-27-2025

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.

data mining, information retrieval, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2501.16605

Country:

Oceania > Australia > Tasmania (0.04)
Europe > United Kingdom (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Law (1.00)
Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Data Science > Data Quality (1.00)
(5 more...)

Add feedback

GNN: Graph Neural Network and Large Language Model for Data Discovery

Hoang, Thomas

arXiv.org Artificial IntelligenceAug-27-2024

Our algorithm GNN: Graph Neural Network and Large Language Model for Data Discovery inherit the benefits of \cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), \cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences, not only numerical values but also text values, making the promise of data science and analytics purposes.

algorithm, graph neural network, utility function, (15 more...)

arXiv.org Artificial Intelligence

2408.13609

Country:

North America > United States > Ohio (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems

Feng, Yanlin, Rahman, Sajjadur, Feng, Aaron, Chen, Vincent, Kandogan, Eser

arXiv.org Artificial IntelligenceJun-1-2024

Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks via interactions with tools and data retrievers have garnered significant interest within database and AI communities. While these systems have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years -- silos of multimodal data sources, created across teams and departments within an organization, make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms. We adapt existing datasets and benchmarks in open-domain -- from question answering and complex reasoning tasks to natural language querying over structured data -- to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance -- a 46% drop in task accuracy on average -- across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.

benchmark, graph, modality, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3665601.3669846

2406.00583

Country:

Asia > Japan (0.04)
North America > United States > Utah (0.04)
North America > United States > Minnesota (0.04)
(5 more...)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Basketball (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

METAM: Goal-Oriented Data Discovery

Galhotra, Sainyam, Gong, Yue, Fernandez, Raul Castro

arXiv.org Artificial IntelligenceApr-18-2023

Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.

data mining, information retrieval, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2304.09068

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Education (0.93)
Banking & Finance > Real Estate (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Computer scientist, Data scientist or similar with a focus on knowledge management (f/m/x) - Data Discovery for Anonymised Health Data

#artificialintelligenceOct-22-2022, 03:35:28 GMT

The focus of the DLR Institute for Data Science in Jena is to find solutions for the major challenges of the digitalisation age. The research focuses on the areas of data extraction and mobilisation, data management and preparation, and data analysis and intelligence. The position is part of the BMBF project Avatar (anonymisation of personal health data by creating virtual avatars). Topics include, in particular, the semantic modelling of relevant metadata and data discovery. The overall goal of the project is providing anonymised health data for both academic and commercial research.

anonymised health data, computer scientist, knowledge management, (2 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (1.00)
Information Technology > Biomedical Informatics > Clinical Informatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.67)

Add feedback

Data Discovery for ML Engineers / DataScienceCentral.com

#artificialintelligenceApr-5-2022, 23:43:03 GMT

Real-world production ML systems consist of two main components: data and code. Data is clearly the leader, and rapidly taking center stage. Data defines the quality of almost any ML-based product, more so than code or any other aspect. In Feature Store as a Foundation for Machine Learning, we have discussed how feature stores are an integral part of the machine learning workflow. They improve the ROI of data engineering, reduce cost per model, and accelerate model-to-market by simplifying feature definition and extraction.

data catalog, data discovery, feature store, (9 more...)

#artificialintelligence

Genre: Workflow (0.71)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.47)

Add feedback

Exclusive Interview with Naren Vijay, EVP of Lumenore

#artificialintelligenceJan-21-2022, 10:09:18 GMT

Organizational intelligence (OI) is the capability of an organization to comprehend and create knowledge relevant to its purpose. In other words, it is the intellectual capacity of the entire organization. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application. Analytics Insight has engaged in an exclusive interview with Naren Vijay, EVP of Lumenore. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application.

business intelligence, intelligence, lumenore, (13 more...)

#artificialintelligence

Genre: Personal > Interview (0.61)

Industry: Information Technology > Software (0.56)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.79)

Add feedback