automated data science
Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems.We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features.Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic regression on our datasets. Furthermore, CAAFE is interpretable by providing a textual explanation for each generated feature.CAAFE paves the way for more extensive semi-automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML. We release our code, a simple demo and a python package .
Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems.We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features.Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic regression on our datasets. Furthermore, CAAFE is interpretable by providing a textual explanation for each generated feature.CAAFE paves the way for more extensive semi-automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML. We release our code, a simple demo and a python package.
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning
Guo, Siyuan, Deng, Cheng, Wen, Ying, Chen, Hechang, Chang, Yi, Wang, Jun
In this work, we investigate the potential of large language models (LLMs) based agents to automate data science tasks, with the goal of comprehending task requirements, then building and training the best-fit machine learning models. Despite their widespread success, existing LLM agents are hindered by generating unreasonable experiment plans within this scenario. To this end, we present DS-Agent, a novel automatic framework that harnesses LLM agent and case-based reasoning (CBR). In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle, and facilitate consistent performance improvement through the feedback mechanism. Moreover, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm to adapt past successful solutions from the development stage for direct code generation, significantly reducing the demand on foundational capabilities of LLMs. Empirically, DS-Agent with GPT-4 achieves 100\% success rate in the development stage, while attaining 36\% improvement on average one pass rate across alternative LLMs in the deployment stage. In both stages, DS-Agent achieves the best rank in performance, costing \$1.60 and \$0.13 per run with GPT-4, respectively. Our data and code are open-sourced at https://github.com/guosyjlu/DS-Agent.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning (1.00)
Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
Hollmann, Noah, Müller, Samuel, Hutter, Frank
As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic regression on our datasets. Furthermore, CAAFE is interpretable by providing a textual explanation for each generated feature. CAAFE paves the way for more extensive semi-automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML. We release our $\href{https://github.com/automl/CAAFE}{code}$, a simple $\href{https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a}{demo}$ and a $\href{https://pypi.org/project/caafe/}{python\ package}$.
Top 10 Automated Data Science and Machine Learning Platforms in 2020
The employment of Data Science and Machine Learning technologies is at a peak. We can see several software and tools with various innovative features in the market that serve us with the efficiency of new-age data technologies that can potentially increase a business's efficiency and value proposition. With continuous evolution at scale such solutions too, get revamped with time. Now is the era for automated data science and machine learning software that not only enhance the operational proficiency of such tools but also assist data scientists with great potential. They help automate the repetitive and mundane tasks within the ML or data science processes without compromising model performance and productivity. Therefore, here is the list of top 10 automated data science and machine learning software presented by some key players of the respective market.
Get Ahead of Automated Machine Learning (AutoML) to Accelerate Your AI Roadmap
Being great at data science to keep your business ahead of the competition curve is finally becoming more affordable and less complex to manage as open source technology becomes commonplace. In this POV, we'll explore the emergence of Automated Machine Learning (AutoML) which is making it much more feasible to use machine learning algorithms to develop machine learning algorithms. This is how quickly the AI industry is progressing today. We are already seeing the data science community explore ways to make analytics and machine learning tasks cheaper, faster, easier and increasingly automous and self-remediating. Business leaders should prepare for automated data science to become commonplace – not necessarily as a way to entirely replace data scientists, but to boost significantly their capabilities and provide a starting point to ML. AutoML is a step in this direction.
'Automated Data Science' to offer a competitive edge to enterprises - CIOL
According to a recent Indian jobs study, data science is one of the topmost and fastest growing fields in India and its relevance is increasing in almost every sector. Reports from NASSCOM suggests that India's data industry would reach $16 billion by 2025 from the present level of $2 billion. At the core of it, data science is the science of examining raw data and applying statistical techniques for the purpose of drawing business related conclusions and predicting business outcomes. In every organization, there are opportunities to implement data science and transform the way business is carried out. Leading analysts like Gartner and Forrester have quoted 2018 as a milestone year for organizations, with over 70% of them expected to leverage data science for Business Optimization.
Cartoon: Data Scientist was the sexiest job of the 21st century until …
We revisit our popular cartoon, which has not lost any relevancy. A few years ago the Harvard Business Review article by Thomas Davenport and DJ Patil proclaimed Data Scientist: The Sexiest Job of the 21st Century But here is what may be coming ... Data Scientist: "I thought I had the sexiest job of the 21st century" This cartoon was ably drawn by Jon Carter. Here are more KDnuggets posts on Data Science automation Automated Machine Learning vs Automated Data Science The Current State of Automated Machine Learning Automated Data Science & Machine Learning: An Interview with the Auto-sklearn Team Contest Winner: Winning the AutoML Challenge with Auto-sklearn Contest 2nd Place: Automating Data Science Data Science Automation: Debunking Misconceptions and KDnuggets tags Automated Data Science, Here is KDnuggets Big Data, Data Mining, and Data Science Cartoon page More recent KDnuggets Cartoons Cartoon: FIFA World Cup Football and Machine Learning Cartoon: GDPR first effect on Privacy ...
- Information Technology (1.00)
- Leisure & Entertainment > Sports > Soccer (0.99)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.62)
Contest 2nd Place: Automated Data Science and Machine Learning in Digital Advertising
Editor's note: This blog post was an entrant in the recent KDnuggets Automated Data Science and Machine Learning blog contest, where it tied for second place. Digital Advertising provides an exciting playground for machine learning in general and automated predictive modeling in particular. An increasing proportion of digital advertising is delivered through real-time bidding ad exchanges. Ad exchanges connect sellers of ad placements (usually websites with ad space to monetize) and buyers (usually technology firms like Dstillery, operating on behalf of consumer brands and agencies). The goals of the buyers vary.
- Marketing (1.00)
- Information Technology > Services (1.00)