AITopics | dirty data

Collaborating Authors

dirty data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RECOST: External Knowledge Guided Data-efficient Instruction Tuning

Zhang, Qi, Zhang, Yiming, Wang, Haobo, Zhao, Junbo

arXiv.org Artificial IntelligenceFeb-27-2024

In the current landscape of large language models (LLMs), the process of instruction tuning serves as an essential step. Considering the high computing power overhead, data-efficient instruction tuning was proposed to reduce the training data size in this process, aiming at selecting high-quality instructional data. Nevertheless, we argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a common scenario in this field, dirty samples will even be selected with a higher probability than other samples. To address these challenges, we utilized external knowledge (relevant examples or paragraphs) to evaluate those samples synthesized by LLMs with an in-context-based relative predictive entropy. Based on the new metric, we proposed a framework, dubbed as \textbf{RECOST}, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline. Through extensive experiments on several synthetic datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our method and achieve even better results with only \textbf{1\%} of the full dataset.

dataset, instruction, recost, (15 more...)

arXiv.org Artificial Intelligence

2402.17355

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Causal Strategic Classification: A Tale of Two Shifts

Horowitz, Guy, Rosenfeld, Nir

arXiv.org Artificial IntelligenceJun-9-2023

When users can benefit from certain predictive outcomes, they may be prone to act to achieve those outcome, e.g., by strategically modifying their features. The goal in strategic classification is therefore to train predictive models that are robust to such behavior. However, the conventional framework assumes that changing features does not change actual outcomes, which depicts users as "gaming" the system. Here we remove this assumption, and study learning in a causal strategic setting where true outcomes do change. Focusing on accuracy as our primary objective, we show how strategic behavior and causal effects underlie two complementing forms of distribution shift. We characterize these shifts, and propose a learning algorithm that balances between these two forces and over time, and permits end-to-end training. Experiments on synthetic and semi-synthetic data demonstrate the utility of our approach.

artificial intelligence, causal strategic classification, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2302.0628

Country:

Asia > Middle East > Israel > Southern District > Eilat (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

Abdelaal, Mohamed, Koparde, Rashmi, Schoening, Harald

arXiv.org Artificial IntelligenceApr-26-2023

Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to search the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.

artificial intelligence, data quality, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3593078.3593930

2304.13636

Country:

Europe > Germany > Saxony-Anhalt > Magdeburg (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.50)
Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)
Automobiles & Trucks (0.34)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.90)

Add feedback

The Importance of Data Preprocessing for Machine Learning in the E-Commerce Industry

#artificialintelligenceJul-11-2022, 04:25:58 GMT

Big data, as the name suggests, are large volumes of data that contain a variety of data that travel in high velocity. Big data are bound to contain dirty data as it is collected from various sources that are raw or unprocessed. Data preprocessing is the process of transforming raw data to an understandable format which is ready for analytical uses. Machine Learning is an artificial intelligence subset and an analytical application that is used to make decisions without programming by receiving and analyzing data. E-commerce industry is the industry which revolves around the application of technology into commercial businesses.

data preprocessing, e-commerce industry, machine learning, (10 more...)

#artificialintelligence

Industry: Information Technology > Services > e-Commerce Services (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.70)

Add feedback

Dirty Data -- Quality Assessment & Cleaning Measures - DataScienceCentral.com

#artificialintelligenceMar-8-2022, 03:08:15 GMT

In the book'Bad Data Handbook' Q Ethan McCallum has rightly said, "We all say we like data, but it's not the data but the insights that we derive from it are what we care about." Yet, a data analyst gets to dedicate only 20% of her time to the art and science of generating insights out of data. The rest of her time is spent in structuring and cleaning the data. In order to minimize the time investment in data cleaning, there is a need of standardized frameworks and tools that work for the diverse data and business use cases across industries, functions, and domains. This blog aims to equip you with the knowledge you need to build and execute such standardized data quality frameworks that work for your data and use cases.

dashboard, data quality, use case, (13 more...)

#artificialintelligence

Country: Asia > India (0.05)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.31)

Add feedback

5 Data Science Projects to Learn 5 Critical Data Science Skills - KDnuggets

#artificialintelligenceMar-7-2022, 22:35:32 GMT

If you're trying to break into the data science industry, it can be great to get some projects under your belt. Doing data science projects helps you to develop the skills you'll need to work as a data scientist. You'll also have a product you can put on your resume and discuss during interviews, which is critical to show you know what you're doing. The data science development cycle is the main pattern of any data science project, whether it's for a company or for your own personal project.You'll need to be comfortable with data collection, cleaning, modeling, and visualization to be a proficient data scientist. The specific tool stack you use at your future data science job may vary from the tools I recommend below, but like anything in the computer science world, it's more about learning how to think than specific syntax or features of one tool over another.

data science, data scientist, visualization, (10 more...)

#artificialintelligence

Industry: Information Technology > Services (0.31)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Automate annotation of image training data with Amazon Rekognition

#artificialintelligenceJul-23-2021, 00:05:05 GMT

Every machine learning (ML) model demands data to train it. If your model isn't predicting Titanic survival or iris species, then acquiring a dataset might be one of the most time-consuming parts of your model-building process--second only to data cleaning. What data cleaning looks like varies from dataset to dataset. For example, the following is a set of images tagged robin that you might want to use to train an image recognition model on bird species. That nest might count as dirty data, and some model applications may make it inappropriate to include American and European robins in the same category, but this seems pretty good so far.

amazon rekognition, associate solution architect, training data, (12 more...)

#artificialintelligence

Country: North America > United States > Texas (0.05)

Industry:

Information Technology > Security & Privacy (0.52)
Retail > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

AI can be Effective in Handling Dirty Data In Supply Chain

#artificialintelligenceMay-28-2021, 20:23:31 GMT

Dirty data holds the potential of breaking a well-established project. Truth be told, dirty data goes far beyond duplicate, incomplete, and illicit data. This is the reason why 80% of the job of data scientists include cleaning the dirty data. Business organizations and industries strive to formulate effective riddance of dirty data that accumulate daily. No matter how much effort is put to clean the erroneous data, granules of it are always left behind which too can cause unwanted troubles.

business operation, dirty data, handling dirty data, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.53)

Add feedback

Testing the Robustness of AutoML Systems

Halvari, Tuomas, Nurminen, Jukka K., Mikkonen, Tommi

arXiv.org Machine LearningJul-22-2020

Automated machine learning (AutoML) systems aim at finding the best machine learning (ML) pipeline that automatically matches the task and data at hand. We investigate the robustness of machine learning pipelines generated with three AutoML systems, TPOT, H2O, and AutoKeras. In particular, we study the influence of dirty data on accuracy, and consider how using dirty training data may help create more robust solutions. Furthermore, we also analyze how the structure of the generated pipelines differs in different cases.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.4204/EPTCS.319.8

2005.02649

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

CIOs Discuss the Promise of AI and Data Science

#artificialintelligenceMar-23-2020, 14:53:35 GMT

A few years ago, I asked CIOs about data science and it turned into a yawner of a discussion. However, in the last few years as chief data officers have made their mark at more and more enterprises, CIOs have needed to build their data chops. Given this, it was time to assess where CIOs are today. To do this, I ran a #CIOChat on AI and Data Science. From this discussion, it was clear CIOs are spending more time considering the "I" part of their titles.

artificial intelligence, james maguire, rob enderle, (9 more...)

#artificialintelligence

Genre: Personal (0.34)

Industry: Health & Medicine (0.71)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.72)
Information Technology > Artificial Intelligence > Natural Language (0.47)

Add feedback