AITopics

2505.15874

Country: Europe > Finland (0.28)

Genre:

Workflow (1.00)
Research Report (0.63)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Murray, Andrew, Dervovic, Danial, Cashmore, Michael

ELATE: Evolutionary Language model for Automated Time-series Engineering

arXiv.org Artificial IntelligenceAug-21-2025

Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.

data mining, large language model, machine learning, (22 more...)

2508.14667

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Energy (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

arXiv.org Artificial IntelligenceMar-21-2024

Semantically Aligned Question and Code Generation for Automated Insight Generation

Singha, Ananya, Chopra, Bhavya, Khatry, Anirudh, Gulwani, Sumit, Henley, Austin Z., Le, Vu, Parnin, Chris, Singh, Mukul, Verbruggen, Gust

Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.

aligned question and code generation, participant, semantically aligned question, (12 more...)

2405.01556

Country:

Europe > Portugal > Lisbon > Lisbon (0.05)
Asia > India (0.05)
North America > United States > New York > New York County > New York City (0.04)
(6 more...)

Genre: Questionnaire & Opinion Survey (1.00)

Industry: Leisure & Entertainment > Sports > Snooker (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceOct-28-2023

ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL

Cao, Ruisheng, Zhang, Hanchong, Xu, Hongshen, Li, Jieyu, Ma, Da, Chen, Lu, Yu, Kai

Text-to-SQL aims to generate an executable SQL program given the user utterance and the corresponding database schema. To ensure the well-formedness of output SQLs, one prominent approach adopts a grammar-based recurrent decoder to produce the equivalent SQL abstract syntax tree (AST). However, previous methods mainly utilize an RNN-series decoder, which 1) is time-consuming and inefficient and 2) introduces very few structure priors. In this work, we propose an AST structure-aware Transformer decoder (ASTormer) to replace traditional RNN cells. The structural knowledge, such as node types and positions in the tree, is seamlessly incorporated into the decoder via both absolute and relative position embeddings. Besides, the proposed framework is compatible with different traversing orders even considering adaptive node selection. Extensive experiments on five text-to-SQL benchmarks demonstrate the effectiveness and efficiency of our structured decoder compared to competitive baselines.

computational linguistic, decoder, node, (14 more...)

2310.18662

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shanghai > Shanghai (0.04)
(7 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)

arXiv.org Artificial IntelligenceOct-10-2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, Carlos E., Yang, John, Wettig, Alexander, Yao, Shunyu, Pei, Kexin, Press, Ofir, Narasimhan, Karthik

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere $4.8$% and $1.7$% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

codebase, repository, swe-bench, (16 more...)

2310.0677

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-19-2022

Natural Language to Code Generation in Interactive Data Science Notebooks

Yin, Pengcheng, Li, Wen-Ding, Xiao, Kefan, Rao, Abhishek, Wen, Yeming, Shi, Kensen, Howland, Joshua, Bailey, Paige, Catasta, Michele, Michalewski, Henryk, Polozov, Alex, Sutton, Charles

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

large language model, machine learning, natural language, (21 more...)

2212.09248

Country:

Asia > Middle East > Israel (0.04)
Asia > India > West Bengal > Kolkata (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
(10 more...)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.61)

#artificialintelligenceMar-14-2022, 14:44:04 GMT

Feature Engineering for Machine Learning

"Good features allow a simple model to beat a complex model" We'll see there's an almost infinite number of ways to build new features from existing ones, so the art in Feature Generation, once you're aware of the basic techniques described below, is really in gaining the intuition on what to try. For this article, we'll be jointly describing both Feature Extraction, which generally refers to domain-specific methods of dimensionality reduction, as well as Feature Generation, accomplished via i. mapping existing features into a new space, ii. We'll be grouping methods by their applicability to the underlying data type. The periodicity may manifest at more than one time-scale so, depending on your data, you may wish to decompose a timestamp column into multiple columns, such as: Minutes, Hour, Day of week, Weekday-or-Weekend, Day of Month, Month, Season or Year. Doing so will also let you use pd.DataFrame.groupby() to perform aggregations, which is in itself one of the most powerful ways to generate new features.

feature engineering, new feature, representation, (14 more...)

Country: North America > United States (0.14)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

#artificialintelligenceOct-13-2021, 13:27:27 GMT

PySpark Tutorial

Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.

dataframe, pyspark, sparksession, (16 more...)

Industry: Information Technology (0.49)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
(4 more...)

#artificialintelligenceMar-4-2021, 13:15:52 GMT

Data Wrangling With Python -- Part 2

We can delete one or more rows from a data frame. With the help of the boolean condition, we can create a new data frame that excludes rows we want to delete. We can also use drop method like df.drop([0,1],axis 0) to drop the first two rows.More practical method is simply to wrap boolean condition inside df[]. If we notice clearly, we didn't drop any rows() The reason is drop_duplicates() defaults only dropping rows that match across all columns. Every row in the data frame is unique.

data frame, duplicate, groupby, (12 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.76)
Information Technology > Data Science > Data Quality > Data Cleaning (0.51)

#artificialintelligenceAug-30-2020, 16:25:39 GMT

6 Pandas Operations You Should Not Miss

Notice, the stats are given only for numerical columns which is an obvious behavior we can also ask describe function to include categorical columns with the parameter'include' and value equal to'all' ( include'all').

artificial intelligence, dataframe, opération, (16 more...)

Technology: Information Technology > Artificial Intelligence (0.86)