exploratory data analysis
mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging
Tzanis, Eleftherios, Klontzas, Michail E.
Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro
- North America > United States > Wisconsin (0.05)
- Europe > Greece (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.94)
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis
Manatkar, Abhijit, Akella, Ashlesha, Gupta, Parthivi, Narayanam, Krishnasuri
Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.34)
ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis
Manatkar, Abhijit, Patel, Devarsh, Patel, Hima, Manwani, Naresh
Automating end-to-end Exploratory Data Analysis (AutoEDA) is a challenging open problem, often tackled through Reinforcement Learning (RL) by learning to predict a sequence of analysis operations (FILTER, GROUP, etc). Defining rewards for each operation is a challenging task and existing methods rely on various \emph{interestingness measures} to craft reward functions to capture the importance of each operation. In this work, we argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method, based on generative adversarial imitation learning (GAIL), generalizes well across datasets, even with limited expert data. We also introduce a novel approach for generating synthetic EDA demonstrations for training. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization, while naturally capturing diverse interestingness measures in generated EDA sessions.
- North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.05)
- Asia > India > Telangana > Hyderabad (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
VPI-Mlogs: A web-based machine learning solution for applications in petrophysics
Machine learning is an important part of the data science field. In petrophysics, machine learning algorithms and applications have been widely approached. In this context, Vietnam Petroleum Institute (VPI) has researched and deployed several effective prediction models, namely missing log prediction, fracture zone and fracture density forecast, etc. As one of our solutions, VPI-MLogs is a web-based deployment platform which integrates data preprocessing, exploratory data analysis, visualisation and model execution. Using the most popular data analysis programming language, Python, this approach gives users a powerful tool to deal with the petrophysical logs section. The solution helps to narrow the gap between common knowledge and petrophysics insights. This article will focus on the web-based application which integrates many solutions to grasp petrophysical data.
- Research Report (0.50)
- Instructional Material > Online (0.40)
Exploratory Data Analysis on Code-mixed Misogynistic Comments
Yadav, Sargam, Kaushik, Abhishek, McDaid, Kevin
The problems of online hate speech and cyberbullying have significantly worsened since the increase in popularity of social media platforms such as YouTube and Twitter (X). Natural Language Processing (NLP) techniques have proven to provide a great advantage in automatic filtering such toxic content. Women are disproportionately more likely to be victims of online abuse. However, there appears to be a lack of studies that tackle misogyny detection in under-resourced languages. In this short paper, we present a novel dataset of YouTube comments in mix-code Hinglish collected from YouTube videos which have been weak labelled as `Misogynistic' and `Non-misogynistic'. Pre-processing and Exploratory Data Analysis (EDA) techniques have been applied on the dataset to gain insights on its characteristics. The process has provided a better understanding of the dataset through sentiment scores, word clouds, etc.
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.71)
- Information Technology > Security & Privacy (0.55)
- Media > News (0.47)
STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers
Urbanowicz, Ryan J., Bandhey, Harsh, Keenan, Brendan T., Maislin, Greg, Hwang, Sy, Mowery, Danielle L., Lynch, Shannon M., Mazzotti, Diego R., Han, Fang, Li, Qing Yun, Penzel, Thomas, Tufik, Sergio, Bittencourt, Lia, Gislason, Thorarinn, de Chazal, Philip, Singh, Bhajan, McArdle, Nigel, Chen, Ning-Hung, Pack, Allan, Schwab, Richard J., Cistulli, Peter A., Magalang, Ulysses J.
While machine learning (ML) includes a valuable array of tools for analyzing biomedical data, significant time and expertise is required to assemble effective, rigorous, and unbiased pipelines. Automated ML (AutoML) tools seek to facilitate ML application by automating a subset of analysis pipeline elements. In this study we develop and validate a Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE) and apply it to investigate the added utility of photography-based phenotypes for predicting obstructive sleep apnea (OSA); a common and underdiagnosed condition associated with a variety of health, economic, and safety consequences. STREAMLINE is designed to tackle biomedical binary classification tasks while adhering to best practices and accommodating complexity, scalability, reproducibility, customization, and model interpretation. Benchmarking analyses validated the efficacy of STREAMLINE across data simulations with increasingly complex patterns of association. Then we applied STREAMLINE to evaluate the utility of demographics (DEM), self-reported comorbidities (DX), symptoms (SYM), and photography-based craniofacial (CF) and intraoral (IO) anatomy measures in predicting any OSA or moderate/severe OSA using 3,111 participants from Sleep Apnea Global Interdisciplinary Consortium (SAGIC). OSA analyses identified a significant increase in ROC-AUC when adding CF to DEM+DX+SYM to predict moderate/severe OSA. A consistent but non-significant increase in PRC-AUC was observed with the addition of each subsequent feature set to predict any OSA, with CF and IO yielding minimal improvements. Application of STREAMLINE to OSA data suggests that CF features provide additional value in predicting moderate/severe OSA, but neither CF nor IO features meaningfully improved the prediction of any OSA beyond established demographics, comorbidity and symptom characteristics.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- North America > United States > Kansas > Douglas County > Lawrence (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.86)
Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance
Cenita, Jonelle Angelo S., Asuncion, Paul Richie F., Victoriano, Jayson M.
The study aimed to evaluate the regression models' performance in predicting the cost of medical insurance. The Three (3) Regression Models in Machine Learning namely Linear Regression, Gradient Boosting, and Support Vector Machine were used. The performance will be evaluated using the metrics RMSE (Root Mean Square), r2 (R Square), and K-Fold Cross-validation. The study also sought to pinpoint the feature that would be most important in predicting the cost of medical insurance.The study is anchored on the knowledge discovery in databases (KDD) process. (KDD) process refers to the overall process of discovering useful knowledge from data. It show the performance evaluation results reveal that among the three (3) Regression models, Gradient boosting received the highest r2 (R Square) 0.892 and the lowest RMSE (Root Mean Square) 1336.594. Furthermore, the 10-Fold Cross-validation weighted mean findings are not significantly different from the r2 (R Square) results of the three (3) regression models. In addition, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics observed that in the charges and smoker features the median of one group lies outside of the box of the other group, so there is a difference between the two groups. It concludes that Gradient boosting appears to perform better among the three (3) regression models. K-Fold Cross-Validation concluded that the three (3) regression models are good. Moreover, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics ceases that the highest charges are due to the smoker feature.
- South America > Paraguay > Asunción > Asunción (0.05)
- Asia > Middle East > Jordan (0.04)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.04)
- (3 more...)
- Banking & Finance > Insurance (1.00)
- Health & Medicine > Health Care Providers & Services > Reimbursement (0.86)
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models
Two developments of nonlinear latent variable models based on radial basis functions are discussed: in the first, the use of priors or constraints on allowable models is considered as a means of preserving data structure in low-dimensional representations for visualisation purposes. Also, a resampling approach is introduced which makes more effective use of the latent samples in evaluating the likelihood.
Breaking Into AI: Sahar Nasiri on Acing the Data Science Job Interview
Data scientist Sahar Nasiri originally went to college to study industrial engineering. After taking Andrew Ng's Machine Learning course on a professor's recommendation, however, she knew she wanted her future to be in AI. Now she uses AI to help Delta Airlines keep its planes in top operating condition. She spoke with us about her early interview struggles, how she landed her first job, and the value of truly understanding statistics. Can you tell me about your current role? When did you start, what is your title, and what are your primary responsibilities?
- Asia > Middle East > Iran (0.05)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- North America > Canada (0.04)
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach used by data scientists to analyze datasets and summarize their main characteristics, with the help of data visualization methods. It helps data scientists to discover patterns, and economic trends, test a hypothesis or check assumptions. The main purpose of EDA is to help look at data before making any assumptions. It can help identify the trends, patterns, and relationships within the data. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals.