dirty data
RECOST: External Knowledge Guided Data-efficient Instruction Tuning
Zhang, Qi, Zhang, Yiming, Wang, Haobo, Zhao, Junbo
In the current landscape of large language models (LLMs), the process of instruction tuning serves as an essential step. Considering the high computing power overhead, data-efficient instruction tuning was proposed to reduce the training data size in this process, aiming at selecting high-quality instructional data. Nevertheless, we argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a common scenario in this field, dirty samples will even be selected with a higher probability than other samples. To address these challenges, we utilized external knowledge (relevant examples or paragraphs) to evaluate those samples synthesized by LLMs with an in-context-based relative predictive entropy. Based on the new metric, we proposed a framework, dubbed as \textbf{RECOST}, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline. Through extensive experiments on several synthetic datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our method and achieve even better results with only \textbf{1\%} of the full dataset.
Causal Strategic Classification: A Tale of Two Shifts
When users can benefit from certain predictive outcomes, they may be prone to act to achieve those outcome, e.g., by strategically modifying their features. The goal in strategic classification is therefore to train predictive models that are robust to such behavior. However, the conventional framework assumes that changing features does not change actual outcomes, which depicts users as "gaming" the system. Here we remove this assumption, and study learning in a causal strategic setting where true outcomes do change. Focusing on accuracy as our primary objective, we show how strategic behavior and causal effects underlie two complementing forms of distribution shift. We characterize these shifts, and propose a learning algorithm that balances between these two forces and over time, and permits end-to-end training. Experiments on synthetic and semi-synthetic data demonstrate the utility of our approach.
- Asia > Middle East > Israel > Southern District > Eilat (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
Abdelaal, Mohamed, Koparde, Rashmi, Schoening, Harald
Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to search the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.
- Europe > Germany > Saxony-Anhalt > Magdeburg (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
- Health & Medicine (0.50)
- Transportation > Ground > Road (0.34)
- Information Technology > Robotics & Automation (0.34)
- Automobiles & Trucks (0.34)
The Importance of Data Preprocessing for Machine Learning in the E-Commerce Industry
Big data, as the name suggests, are large volumes of data that contain a variety of data that travel in high velocity. Big data are bound to contain dirty data as it is collected from various sources that are raw or unprocessed. Data preprocessing is the process of transforming raw data to an understandable format which is ready for analytical uses. Machine Learning is an artificial intelligence subset and an analytical application that is used to make decisions without programming by receiving and analyzing data. E-commerce industry is the industry which revolves around the application of technology into commercial businesses.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.70)
Dirty Data -- Quality Assessment & Cleaning Measures - DataScienceCentral.com
In the book'Bad Data Handbook' Q Ethan McCallum has rightly said, "We all say we like data, but it's not the data but the insights that we derive from it are what we care about." Yet, a data analyst gets to dedicate only 20% of her time to the art and science of generating insights out of data. The rest of her time is spent in structuring and cleaning the data. In order to minimize the time investment in data cleaning, there is a need of standardized frameworks and tools that work for the diverse data and business use cases across industries, functions, and domains. This blog aims to equip you with the knowledge you need to build and execute such standardized data quality frameworks that work for your data and use cases.
5 Data Science Projects to Learn 5 Critical Data Science Skills - KDnuggets
If you're trying to break into the data science industry, it can be great to get some projects under your belt. Doing data science projects helps you to develop the skills you'll need to work as a data scientist. You'll also have a product you can put on your resume and discuss during interviews, which is critical to show you know what you're doing. The data science development cycle is the main pattern of any data science project, whether it's for a company or for your own personal project.You'll need to be comfortable with data collection, cleaning, modeling, and visualization to be a proficient data scientist. The specific tool stack you use at your future data science job may vary from the tools I recommend below, but like anything in the computer science world, it's more about learning how to think than specific syntax or features of one tool over another.
Automate annotation of image training data with Amazon Rekognition
Every machine learning (ML) model demands data to train it. If your model isn't predicting Titanic survival or iris species, then acquiring a dataset might be one of the most time-consuming parts of your model-building process--second only to data cleaning. What data cleaning looks like varies from dataset to dataset. For example, the following is a set of images tagged robin that you might want to use to train an image recognition model on bird species. That nest might count as dirty data, and some model applications may make it inappropriate to include American and European robins in the same category, but this seems pretty good so far.
- Information Technology > Security & Privacy (0.52)
- Retail > Online (0.40)
AI can be Effective in Handling Dirty Data In Supply Chain
Dirty data holds the potential of breaking a well-established project. Truth be told, dirty data goes far beyond duplicate, incomplete, and illicit data. This is the reason why 80% of the job of data scientists include cleaning the dirty data. Business organizations and industries strive to formulate effective riddance of dirty data that accumulate daily. No matter how much effort is put to clean the erroneous data, granules of it are always left behind which too can cause unwanted troubles.
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.53)
Testing the Robustness of AutoML Systems
Halvari, Tuomas, Nurminen, Jukka K., Mikkonen, Tommi
Automated machine learning (AutoML) systems aim at finding the best machine learning (ML) pipeline that automatically matches the task and data at hand. We investigate the robustness of machine learning pipelines generated with three AutoML systems, TPOT, H2O, and AutoKeras. In particular, we study the influence of dirty data on accuracy, and consider how using dirty training data may help create more robust solutions. Furthermore, we also analyze how the structure of the generated pipelines differs in different cases.
- Europe > Finland > Uusimaa > Helsinki (0.04)
- North America > United States > New York > New York County > New York City (0.04)
CIOs Discuss the Promise of AI and Data Science
A few years ago, I asked CIOs about data science and it turned into a yawner of a discussion. However, in the last few years as chief data officers have made their mark at more and more enterprises, CIOs have needed to build their data chops. Given this, it was time to assess where CIOs are today. To do this, I ran a #CIOChat on AI and Data Science. From this discussion, it was clear CIOs are spending more time considering the "I" part of their titles.