code cell
Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transformation
Elhashemy, Hanya, Lotfy, Youssef, Tang, Yongjian
The increasing adoption of Jupyter notebooks in data science and machine learning workflows has created a gap between exploratory code development and production-ready software systems. While notebooks excel at iterative development and visualization, they often lack proper software engineering principles, making their transition to production environments challenging. This paper presents Codelevate, a novel multi-agent system that automatically transforms Jupyter notebooks into well-structured, maintainable Python code repositories. Our system employs three specialized agents - Architect, Developer, and Structure - working in concert through a shared dependency tree to ensure architectural coherence and code quality. Our experimental results validate Codelevate's capability to bridge the prototype-to-production gap through autonomous code transformation, yielding quantifiable improvements in code quality metrics while preserving computational semantics.
Themisto: Jupyter-Based Runtime Benchmark
Grotov, Konstantin, Titov, Sergey
A BSTRACT In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context. 1 I NTRODUCTION Recent developments in code completion and generation have been significant. Over the past several years, the field has progressed from generating relatively simple programs (Chen et al., 2021) to solving real-world issues within software repositories (Jimenez et al., 2023). However, most studies in this area are based on static snapshots of code (Jiang et al., 2024), with only a small body of research exploring the potential of leveraging dynamic code properties, such as runtime information and memory state, for code generation (Chen et al., 2024). A key reason for this limitation is that common programming environments rarely allow code generation during execution, which is when runtime information can be gathered.
DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science
You, Ziming, Zhang, Yumiao, Xu, Dexuan, Lou, Yiwei, Yan, Yandong, Wang, Wei, Zhang, Huaming, Huang, Yu
Data Science tasks are multifaceted, dynamic, and often domain-specific. Existing LLM-based approaches largely concentrate on isolated phases, neglecting the interdependent nature of many data science tasks and limiting their capacity for comprehensive end-to-end support. We propose DatawiseAgent, a notebook-centric LLM agent framework that unifies interactions among user, agent and the computational environment through markdown and executable code cells, supporting flexible and adaptive automated data science. Built on a Finite State Transducer(FST), DatawiseAgent orchestrates four stages, including DSF-like planning, incremental execution, self-debugging, and post-filtering. Specifically, the DFS-like planning stage systematically explores the solution space, while incremental execution harnesses real-time feedback and accommodates LLM's limited capabilities to progressively complete tasks. The self-debugging and post-filtering modules further enhance reliability by diagnosing and correcting errors and pruning extraneous information. Extensive experiments on diverse tasks, including data analysis, visualization, and data modeling, show that DatawiseAgent consistently outperforms or matches state-of-the-art methods across multiple model settings. These results highlight its potential to generalize across data science scenarios and lay the groundwork for more efficient, fully automated workflows.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
Contextualized Data-Wrangling Code Generation in Computational Notebooks
Huang, Junjie, Guo, Daya, Wang, Chenglong, Gu, Jiazhen, Lu, Shuai, Inala, Jeevana Priya, Yan, Cong, Gao, Jianfeng, Duan, Nan, Lyu, Michael R.
Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation. To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized datawrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and code&textual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at url...
- North America > United States > California > Sacramento County > Sacramento (0.05)
- Asia > China > Hong Kong (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
A Flexible Cell Classification for ML Projects in Jupyter Notebooks
Perez, Miguel, Aydin, Selin, Lichter, Horst
Jupyter Notebook is an interactive development environment commonly used for rapid experimentation of machine learning (ML) solutions. Describing the ML activities performed along code cells improves the readability and understanding of Notebooks. Manual annotation of code cells is time-consuming and error-prone. Therefore, tools have been developed that classify the cells of a notebook concerning the ML activity performed in them. However, the current tools are not flexible, as they work based on look-up tables that have been created, which map function calls of commonly used ML libraries to ML activities. These tables must be manually adjusted to account for new or changed libraries. This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We discuss the design rationales and describe the developed classifiers in detail. We implemented the new flexible cell classification approach in a tool called JupyLabel. Its evaluation and the obtained metric scores regarding precision, recall, and F1-score are discussed. Additionally, we compared JupyLabel with HeaderGen, an existing cell classification tool. We were able to show that the presented flexible cell classification approach outperforms this tool significantly.
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
- North America > United States > New York (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
Unlocking Insights: Semantic Search in Jupyter Notebooks
Semantic search, a process aimed at delivering highly relevant search results by comprehending the searcher's intent and the contextual meaning of terms within a searchable dataspace, plays a pivotal role in information retrieval. In this paper, we investigate the application of large language models to enhance semantic search capabilities, specifically tailored for the domain of Jupyter Notebooks. Our objective is to retrieve generated outputs, such as figures or tables, associated functions and methods, and other pertinent information. We demonstrate a semantic search framework that achieves a comprehensive semantic understanding of the entire notebook's contents, enabling it to effectively handle various types of user queries. Key components of this framework include: 1). A data preprocessor is designed to handle diverse types of cells within Jupyter Notebooks, encompassing both markdown and code cells. 2). An innovative methodology is devised to address token size limitations that arise with code-type cells. We implement a finer-grained approach to data input, transitioning from the cell level to the function level, effectively resolving these issues.
- North America > United States > Illinois (0.05)
- Europe > Italy > Lazio > Rome (0.04)
Why should you use Cloud VM[Google Colab] for DL?
There are a lot of platforms available for coding, but in studies regarding deep learning, we need to pay extra attention to the platform's capabilities of training the model, with that being said, coders need to obtain a full knowledge about monitoring the resources and devices. Follow ups I will go over ten reasons why you should use Google Colab for Deep Learning projects. Are you still struggling with finding your files on the local drive? If so, why don't you try Google Colab? With everythings being stored on the cloud, you can easily find your files by one click.
AI Powered 3D Human Shape Estimation
In this short tutorial, we are going to look at the very cool and interesting image-based 3D human shape estimation model. We are going to implement a pipeline on Spell workplace with a jupyter lab and run the pre-trained model to turn our custom image into a 3D model. Signup to the Spell if you haven't and you can get 10$ with of GPU time on T4, P100, k80, and V100 GPUs for free. We are going to use Facebook research's pifuhd model which contains a PyTorch implementation of "Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization". Note: At least 8GB GPU memory is recommended to run the PIFuHD model. We can use any GPU from Spell.
- Information Technology > Artificial Intelligence > Vision (0.87)
- Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.82)
- Information Technology > Communications > Social Media (0.57)
11 Extensions to Power Up your Jupyter Notebook - Analytics Vidhya
Jupyter Notebook is an easy-to-use, open-source tool for web-based interactive computing. The Jupyter Notebook supports more than 40 different programming languages like R, Python, Java, etc. Therefore, most data science professionals tend to use Jupyter Notebooks to create and share documents, including code, equations, visualizations, computational outputs, markdown text, etc. The basic Jupyter Notebook environment is more befitting for general training and educational machine learning/deep learning model development requirements. However, the vanilla environment lacks certain features which makes it tedious to handle complex codes.
Introducing Jupyter and Pandas
This article is the first in a series that helps working developers get up to speed on data science tools and techniques. We'll start with a brief introduction to the series, and explain everything we're going to cover. Developers and data scientists working on data analysis and machine learning (ML) projects spend the majority of their time finding, cleaning, and organizing datasets. We'll do this by using Python, Pandas, and Seaborn in a Jupyter notebook to clean up a sample retail store's messy customer database. This seven-part series will take the initial round of messy data, clean it, and develop a set of visualizations that highlight our work. Here's what the series will cover: Before we start cleaning our dataset, let's take a quick look at two of the tools we'll use: Pandas and Jupyter Notebooks.