Goto

Collaborating Authors

 data catalog


Exploring LLM Capabilities in Extracting DCAT-Compatible Metadata for Data Cataloging

Busch, Lennart, Tebernum, Daniel, Velarde, Gissel

arXiv.org Artificial Intelligence

Efficient data exploration is crucial as data becomes increasingly important for accelerating processes, improving forecasts and developing new business models. Data consumers often spend 25-98 % of their time searching for suitable data due to the exponential growth, heterogeneity and distribution of data. Data catalogs can support and accelerate data exploration by using metadata to answer user queries. However, as metadata creation and maintenance is often a manual process, it is time-consuming and requires expertise. This study investigates whether LLMs can automate metadata maintenance of text-based data and generate high-quality DCAT-compatible metadata. We tested zero-shot and few-shot prompting strategies with LLMs from different vendors for generating metadata such as titles and keywords, along with a fine-tuned model for classification. Our results show that LLMs can generate metadata comparable to human-created content, particularly on tasks that require advanced semantic understanding. Larger models outperformed smaller ones, and fine-tuning significantly improves classification accuracy, while few-shot prompting yields better results in most cases. Although LLMs offer a faster and reliable way to create metadata, a successful application requires careful consideration of task-specific criteria and domain context.


Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs

Singh, Mayank, Kumar, Abhijeet, Donaparthi, Sasidhar, Karambelkar, Gayatri

arXiv.org Artificial Intelligence

Data catalogs serve as repositories for organizing and accessing diverse collection of data assets, but their effectiveness hinges on the ease with which business users can look-up relevant content. Unfortunately, many data catalogs within organizations suffer from limited searchability due to inadequate metadata like asset descriptions. Hence, there is a need of content generation solution to enrich and curate metadata in a scalable way. This paper explores the challenges associated with metadata creation and proposes a unique prompt enrichment idea of leveraging existing metadata content using retrieval based fewshot technique tied with generative large language models (LLM). The literature also considers finetuning an LLM on existing content and studies the behavior of few-shot pretrained LLM (Llama, GPT3.5) vis-à-vis few-shot finetuned LLM (Llama2-7b) by evaluating their performance based on accuracy, factual grounding, and toxicity. Our preliminary results exhibit more than 80% Rouge-1 F1 for the generated content. This implied 87%- 88% of instances accepted as is or curated with minor edits by data stewards. By automatically generating descriptions for tables and columns in most accurate way, the research attempts to provide an overall framework for enterprises to effectively scale metadata curation and enrich its data catalog thereby vastly improving the data catalog searchability and overall usability. NTRODUCTION In the modern digital ecosystem, locating relevant data has become increasingly challenging due to the rapid expansion of data assets.


LEDD: Large Language Model-Empowered Data Discovery in Data Lakes

An, Qi, Ying, Chihua, Zhu, Yuqing, Xu, Yihao, Zhang, Manwei, Wang, Jianmin

arXiv.org Artificial Intelligence

Data discovery in data lakes with ever increasing datasets has long been recognized as a big challenge in the realm of data management, especially for semantic search of and hierarchical global catalog generation of tables. While large language models (LLMs) facilitate the processing of data semantics, challenges remain in architecting an end-to-end system that comprehensively exploits LLMs for the two semantics-related tasks. In this demo, we propose LEDD, an end-to-end system with an extensible architecture that leverages LLMs to provide hierarchical global catalogs with semantic meanings and semantic table search for data lakes. Specifically, LEDD can return semantically related tables based on natural-language specification. These features make LEDD an ideal foundation for downstream tasks such as model training and schema linking for text-to-SQL tasks. LEDD also provides a simple Python interface to facilitate the extension and the replacement of data discovery algorithms.


Metadata driven development realises "smart manufacturing" of data ecosystems – blog 3 - Solita Data

#artificialintelligence

This is the third part of the blog series. The 1st blog focused on the maturity model and explained how the large monolith data warehouses were created. The 2nd blog focused on metadata driven development or "smart manufacturing" of data ecosystems. This 3rd blog will talk about reverse engineering or how existing data assets can be discovered to accelerate the development of new data products. Companies have increasing pressure to start addressing the data silos to reduce cost, improve agility & accelerate innovation, but they struggle to deliver value from their data assets. Many companies have hundreds of systems, containing thousands of databases hundreds of thousands of tables, millions of columns, and millions of lines of code across many different technologies. The starting point is a "data spaghetti" that nobody knows well.


AWS launches DataZone, a new ML-based data management service • TechCrunch

#artificialintelligence

At its re:Invent conference, AWS today announced Amazon DataZone, a new data management service that can help enterprises catalog, discover, share and -- most importantly -- govern their data. The nifty part here is that AWS is using machine learning to help businesses build these data catalogs and generate the metadata to make it searchable. "To unlock the full power, the full value of data, we need to make it easy for the right people and applications to find, access and share the right data when they need it -- and to keep data safe and secure," AWS CEO Adam Selipsky said in today's keynote. The tool will provide users with fine-grained controls to manage and govern this data. That's long been a major problem for enterprises, but it has only gotten harder as the amount of data has increased, ensuring that the right users have access to the right data, without compromising personally identifiable information, for example.


What is Data Governance? Top Data Governance Tools for Data Science and Machine Learning Research in 2022

#artificialintelligence

The process of developing internal data standards and enacting rules governing who has access to data and how it is utilized for analytical applications and business operations is known as data governance. A good data governance program guarantees that data is reliable, consistent, and accessible and that its use complies with applicable rules and regulations regarding data protection. In addition to master data management (MDM) projects, it frequently includes data quality improvement initiatives. Software of this type offers features that facilitate the formulation of data governance policies, the construction of business glossaries and data catalogs, data mapping and classification, workflow management, collaboration, and process documentation. Software for data governance can be used in conjunction with MDM, metadata management, and data quality solutions. Data governance aims to promote confident decisions supported by solid data resources. Building policies that define data ownership, duties, and delegates are the goal of data governance.


Data Discovery for ML Engineers / DataScienceCentral.com

#artificialintelligence

Real-world production ML systems consist of two main components: data and code. Data is clearly the leader, and rapidly taking center stage. Data defines the quality of almost any ML-based product, more so than code or any other aspect. In Feature Store as a Foundation for Machine Learning, we have discussed how feature stores are an integral part of the machine learning workflow. They improve the ROI of data engineering, reduce cost per model, and accelerate model-to-market by simplifying feature definition and extraction.


Alation Acquires Artificial Intelligence Vendor Lyngo Analytics

#artificialintelligence

WIRE)--Alation Inc., the leader in enterprise data intelligence solutions, today announced the acquisition of Lyngo Analytics, a Los Altos, Calif.-based data insights company. The acquisition will elevate the business user experience within the data catalog, scale data intelligence, and help organizations drive data culture. Lyngo Analytics CEO and co-founder Jennifer Wu and CTO and co-founder Joachim Rahmfeld will join the company. Lyngo Analytics uses a natural language interface to empower users to discover data and insights by asking questions using simple, familiar business terms. Alation offers the most intelligent and user-friendly machine-learning data catalog on the market.


Data Catalog

#artificialintelligence

Data is key for the success of any business, and this is more relevant than ever before in the current crisis that industry and mankind are facing. Data insights will be a key driver in dealing with the situation of COVID-19 and it will be instrumental in finding the cure as well. Data insights are also important for the financial industry to read the current and upcoming market trends as events unfold every day. After spending two decades of my career in the financial industry, I have realized that most firms lag in data maturity, and this crisis is revealing many loopholes in their governance process. As I start my journey into retail and transportation with my recent client, I am realizing that never before was data so important for the retail sector, and especially for grocers as it is now.


Application of Artificial Intelligence in Business Transformation

#artificialintelligence

Artificial Intelligence or AI is the use of algorithms that simulates human behavior to perform cognitive functions. Use AI to solve problems through interaction, learning, visual perception, planning, reasoning, and natural language processing. Artificial Intelligence is a broad and generic term like any other computer software program which engages human-like processes. Thus, calling an application AI might be correct but will not cover its specifics. The most widely available notion about AI is emerging out of a sci-fi movie.