Goto

Collaborating Authors

 data discovery


From keywords to semantics: Perceptions of large language models in data discovery

Halstead, Maura E, Green, Mark A., Jay, Caroline, Kingston, Richard, Topping, David, Singleton, Alexander

arXiv.org Artificial Intelligence

This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers' perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.


LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science

Salemi, Alireza, Parmar, Mihir, Goyal, Palash, Song, Yiwen, Yoon, Jinsung, Zamani, Hamed, Palangi, Hamid, Pfister, Tomas

arXiv.org Artificial Intelligence

The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent's capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents -- either responsible for a partition of the data lake or general information retrieval -- volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents' expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.


Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial Intelligence

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.


GNN: Graph Neural Network and Large Language Model for Data Discovery

Hoang, Thomas

arXiv.org Artificial Intelligence

Our algorithm GNN: Graph Neural Network and Large Language Model for Data Discovery inherit the benefits of \cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), \cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences, not only numerical values but also text values, making the promise of data science and analytics purposes.


CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems

Feng, Yanlin, Rahman, Sajjadur, Feng, Aaron, Chen, Vincent, Kandogan, Eser

arXiv.org Artificial Intelligence

Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks via interactions with tools and data retrievers have garnered significant interest within database and AI communities. While these systems have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years -- silos of multimodal data sources, created across teams and departments within an organization, make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms. We adapt existing datasets and benchmarks in open-domain -- from question answering and complex reasoning tasks to natural language querying over structured data -- to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance -- a 46% drop in task accuracy on average -- across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.


METAM: Goal-Oriented Data Discovery

Galhotra, Sainyam, Gong, Yue, Fernandez, Raul Castro

arXiv.org Artificial Intelligence

Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.


Computer scientist, Data scientist or similar with a focus on knowledge management (f/m/x) - Data Discovery for Anonymised Health Data

#artificialintelligence

The focus of the DLR Institute for Data Science in Jena is to find solutions for the major challenges of the digitalisation age. The research focuses on the areas of data extraction and mobilisation, data management and preparation, and data analysis and intelligence. The position is part of the BMBF project Avatar (anonymisation of personal health data by creating virtual avatars). Topics include, in particular, the semantic modelling of relevant metadata and data discovery. The overall goal of the project is providing anonymised health data for both academic and commercial research.


Data Discovery for ML Engineers / DataScienceCentral.com

#artificialintelligence

Real-world production ML systems consist of two main components: data and code. Data is clearly the leader, and rapidly taking center stage. Data defines the quality of almost any ML-based product, more so than code or any other aspect. In Feature Store as a Foundation for Machine Learning, we have discussed how feature stores are an integral part of the machine learning workflow. They improve the ROI of data engineering, reduce cost per model, and accelerate model-to-market by simplifying feature definition and extraction.


Exclusive Interview with Naren Vijay, EVP of Lumenore

#artificialintelligence

Organizational intelligence (OI) is the capability of an organization to comprehend and create knowledge relevant to its purpose. In other words, it is the intellectual capacity of the entire organization. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application. Analytics Insight has engaged in an exclusive interview with Naren Vijay, EVP of Lumenore. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application.


How advanced AI tools can give organisations a holistic understanding of their data and improve compliance

#artificialintelligence

It doesn't generate revenue, but it is an essential part of operating effectively as a business today. Whether it's industry specific regulations, or the standout regulation of our time--GDPR--we are all acutely aware of the damage, both reputational and financial, that non-compliance can cause. GDPR has equipped employees across industries with an appreciation of the context, usage, and security of data, but there is another factor that is essential for establishing an effective data strategy, which is data discoverability. To ensure regulatory compliance, data must not only be secure, it must also be discoverable so that compliance personnel can locate all information needed to prove compliance. Increasingly, AI tools are being harnessed to automate workflows and governance, but such capabilities can only be delivered when a strong data foundation is in place.