AITopics | exploratory data analysis

Collaborating Authors

exploratory data analysis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NoteEx: Interactive Visual Context Manipulation for LLM-Assisted Exploratory Data Analysis in Computational Notebooks

Payandeh, Mohammad Hasan, Yuan, Lin-Ping, Zhao, Jian

arXiv.org Artificial IntelligenceNov-11-2025

Computational notebooks have become popular for Exploratory Data Analysis (EDA), augmented by LLM-based code generation and result interpretation. Effective LLM assistance hinges on selecting informative context -- the minimal set of cells whose code, data, or outputs suffice to answer a prompt. As notebooks grow long and messy, users can lose track of the mental model of their analysis. They thus fail to curate appropriate contexts for LLM tasks, causing frustration and tedious prompt engineering. We conducted a formative study (n=6) that surfaced challenges in LLM context selection and mental model maintenance. Therefore, we introduce NoteEx, a JupyterLab extension that provides a semantic visualization of the EDA workflow, allowing analysts to externalize their mental model, specify analysis dependencies, and enable interactive selection of task-relevant contexts for LLMs. A user study (n=12) against a baseline shows that NoteEx improved mental model retention and context selection, leading to more accurate and relevant LLM responses.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2511.07223

Country:

North America > United States (1.00)
North America > Canada > Ontario (0.28)
Asia > Japan > Honshū > Kantō (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
Personal > Interview (0.67)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging

Tzanis, Eleftherios, Klontzas, Michail E.

arXiv.org Artificial IntelligenceOct-15-2025

Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.ejrai.2025.100044

2505.03785

Country: Europe (0.46)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis

Manatkar, Abhijit, Akella, Ashlesha, Gupta, Parthivi, Narayanam, Krishnasuri

arXiv.org Artificial IntelligenceOct-21-2024

Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2410.1027

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > India > West Bengal > Kharagpur (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.34)

Add feedback

ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Manatkar, Abhijit, Patel, Devarsh, Patel, Hima, Manwani, Naresh

arXiv.org Artificial IntelligenceOct-15-2024

Automating end-to-end Exploratory Data Analysis (AutoEDA) is a challenging open problem, often tackled through Reinforcement Learning (RL) by learning to predict a sequence of analysis operations (FILTER, GROUP, etc). Defining rewards for each operation is a challenging task and existing methods rely on various \emph{interestingness measures} to craft reward functions to capture the importance of each operation. In this work, we argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method, based on generative adversarial imitation learning (GAIL), generalizes well across datasets, even with limited expert data. We also introduce a novel approach for generating synthetic EDA demonstrations for training. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization, while naturally capturing diverse interestingness measures in generated EDA sessions.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2410.11276

Country:

North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.05)
Asia > India > Telangana > Hyderabad (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

VPI-Mlogs: A web-based machine learning solution for applications in petrophysics

Nguyen, Anh Tuan

arXiv.org Artificial IntelligenceOct-6-2024

Machine learning is an important part of the data science field. In petrophysics, machine learning algorithms and applications have been widely approached. In this context, Vietnam Petroleum Institute (VPI) has researched and deployed several effective prediction models, namely missing log prediction, fracture zone and fracture density forecast, etc. As one of our solutions, VPI-MLogs is a web-based deployment platform which integrates data preprocessing, exploratory data analysis, visualisation and model execution. Using the most popular data analysis programming language, Python, this approach gives users a powerful tool to deal with the petrophysical logs section. The solution helps to narrow the gap between common knowledge and petrophysics insights. This article will focus on the web-based application which integrates many solutions to grasp petrophysical data.

application, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.47800/PVJ.2022.10-06

2410.05332

Country: Asia > Vietnam (0.37)

Genre:

Research Report (0.50)
Instructional Material > Online (0.40)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Exploratory Data Analysis on Code-mixed Misogynistic Comments

Yadav, Sargam, Kaushik, Abhishek, McDaid, Kevin

arXiv.org Artificial IntelligenceMar-9-2024

The problems of online hate speech and cyberbullying have significantly worsened since the increase in popularity of social media platforms such as YouTube and Twitter (X). Natural Language Processing (NLP) techniques have proven to provide a great advantage in automatic filtering such toxic content. Women are disproportionately more likely to be victims of online abuse. However, there appears to be a lack of studies that tackle misogyny detection in under-resourced languages. In this short paper, we present a novel dataset of YouTube comments in mix-code Hinglish collected from YouTube videos which have been weak labelled as `Misogynistic' and `Non-misogynistic'. Pre-processing and Exploratory Data Analysis (EDA) techniques have been applied on the dataset to gain insights on its characteristics. The process has provided a better understanding of the dataset through sentiment scores, word clouds, etc.

dataset, detection, misogyny detection, (15 more...)

arXiv.org Artificial Intelligence

2403.09709

Country:

Europe > Ireland (0.05)
Asia > India (0.05)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.71)
Information Technology > Security & Privacy (0.55)
Media > News (0.47)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers

Urbanowicz, Ryan J., Bandhey, Harsh, Keenan, Brendan T., Maislin, Greg, Hwang, Sy, Mowery, Danielle L., Lynch, Shannon M., Mazzotti, Diego R., Han, Fang, Li, Qing Yun, Penzel, Thomas, Tufik, Sergio, Bittencourt, Lia, Gislason, Thorarinn, de Chazal, Philip, Singh, Bhajan, McArdle, Nigel, Chen, Ning-Hung, Pack, Allan, Schwab, Richard J., Cistulli, Peter A., Magalang, Ulysses J.

arXiv.org Artificial IntelligenceDec-8-2023

While machine learning (ML) includes a valuable array of tools for analyzing biomedical data, significant time and expertise is required to assemble effective, rigorous, and unbiased pipelines. Automated ML (AutoML) tools seek to facilitate ML application by automating a subset of analysis pipeline elements. In this study we develop and validate a Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE) and apply it to investigate the added utility of photography-based phenotypes for predicting obstructive sleep apnea (OSA); a common and underdiagnosed condition associated with a variety of health, economic, and safety consequences. STREAMLINE is designed to tackle biomedical binary classification tasks while adhering to best practices and accommodating complexity, scalability, reproducibility, customization, and model interpretation. Benchmarking analyses validated the efficacy of STREAMLINE across data simulations with increasingly complex patterns of association. Then we applied STREAMLINE to evaluate the utility of demographics (DEM), self-reported comorbidities (DX), symptoms (SYM), and photography-based craniofacial (CF) and intraoral (IO) anatomy measures in predicting any OSA or moderate/severe OSA using 3,111 participants from Sleep Apnea Global Interdisciplinary Consortium (SAGIC). OSA analyses identified a significant increase in ROC-AUC when adding CF to DEM+DX+SYM to predict moderate/severe OSA. A consistent but non-significant increase in PRC-AUC was observed with the addition of each subsequent feature set to predict any OSA, with CF and IO yielding minimal improvements. Application of STREAMLINE to OSA data suggests that CF features provide additional value in predicting moderate/severe OSA, but neither CF nor IO features meaningfully improved the prediction of any OSA beyond established demographics, comorbidity and symptom characteristics.

algorithm line represent mean roc-auc, pairwise performance metric comparison, replication data evaluation comparison, (13 more...)

arXiv.org Artificial Intelligence

2312.05461

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > Kansas > Douglas County > Lawrence (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance

Cenita, Jonelle Angelo S., Asuncion, Paul Richie F., Victoriano, Jayson M.

arXiv.org Artificial IntelligenceApr-25-2023

The study aimed to evaluate the regression models' performance in predicting the cost of medical insurance. The Three (3) Regression Models in Machine Learning namely Linear Regression, Gradient Boosting, and Support Vector Machine were used. The performance will be evaluated using the metrics RMSE (Root Mean Square), r2 (R Square), and K-Fold Cross-validation. The study also sought to pinpoint the feature that would be most important in predicting the cost of medical insurance.The study is anchored on the knowledge discovery in databases (KDD) process. (KDD) process refers to the overall process of discovering useful knowledge from data. It show the performance evaluation results reveal that among the three (3) Regression models, Gradient boosting received the highest r2 (R Square) 0.892 and the lowest RMSE (Root Mean Square) 1336.594. Furthermore, the 10-Fold Cross-validation weighted mean findings are not significantly different from the r2 (R Square) results of the three (3) regression models. In addition, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics observed that in the charges and smoker features the median of one group lies outside of the box of the other group, so there is a difference between the two groups. It concludes that Gradient boosting appears to perform better among the three (3) regression models. K-Fold Cross-Validation concluded that the three (3) regression models are good. Moreover, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics ceases that the highest charges are due to the smoker feature.

artificial intelligence, machine learning, regression model, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.25147/ijcsr.2017.001.1.146

2304.12605

Country:

South America > Paraguay > Asunción > Asunción (0.05)
Asia > Middle East > Jordan (0.04)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.04)
(3 more...)

Genre: Research Report (0.84)

Industry:

Banking & Finance > Insurance (1.00)
Health & Medicine > Health Care Providers & Services > Reimbursement (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Neural Information Processing SystemsApr-6-2023, 17:42:07 GMT

Two developments of nonlinear latent variable models based on radial basis functions are discussed: in the first, the use of priors or constraints on allowable models is considered as a means of preserving data structure in low-dimensional representations for visualisation purposes. Also, a resampling approach is introduced which makes more effective use of the latent samples in evaluating the likelihood.

basis function latent variable model, exploratory data analysis

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.73)

Add feedback

Breaking Into AI: Sahar Nasiri on Acing the Data Science Job Interview

#artificialintelligenceFeb-10-2023, 18:00:26 GMT

Data scientist Sahar Nasiri originally went to college to study industrial engineering. After taking Andrew Ng's Machine Learning course on a professor's recommendation, however, she knew she wanted her future to be in AI. Now she uses AI to help Delta Airlines keep its planes in top operating condition. She spoke with us about her early interview struggles, how she landed her first job, and the value of truly understanding statistics. Can you tell me about your current role? When did you start, what is your title, and what are your primary responsibilities?

algorithm, data scientist, interview, (12 more...)

#artificialintelligence

Country:

Asia > Middle East > Iran (0.05)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
North America > Canada (0.04)

Genre: Personal > Interview (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)

Add feedback