Goto

Collaborating Authors

 Sarajevo


Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML

Paxton, Kuniko, Aslansefat, Koorosh, Akagić, Amila, Thakker, Dhavalkumar, Papadopoulos, Yiannis

arXiv.org Artificial Intelligence

Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists' diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes' activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.


Performance comparison of medical image classification systems using TensorFlow Keras, PyTorch, and JAX

Bećirović, Merjem, Kurtović, Amina, Smajlović, Nordin, Kapo, Medina, Akagić, Amila

arXiv.org Artificial Intelligence

Medical imaging plays a vital role in early disease diagnosis and monitoring. Specifically, blood microscopy offers valuable insights into blood cell morphology and the detection of hematological disorders. In recent years, deep learning-based automated classification systems have demonstrated high potential in enhancing the accuracy and efficiency of blood image analysis. However, a detailed performance analysis of specific deep learning frameworks appears to be lacking. This paper compares the performance of three popular deep learning frameworks, TensorFlow with Keras, PyTorch, and JAX, in classifying blood cell images from the publicly available BloodMNIST dataset. The study primarily focuses on inference time differences, but also classification performance for different image sizes. The results reveal variations in performance across frameworks, influenced by factors such as image resolution and framework-specific optimizations. Classification accuracy for JAX and PyTorch was comparable to current benchmarks, showcasing the efficiency of these frameworks for medical image classification.


Improving Diagnostic Accuracy of Pigmented Skin Lesions With CNNs: an Application on the DermaMNIST Dataset

Kadric, Nerma, Akagic, Amila, Kapo, Medina

arXiv.org Artificial Intelligence

Pigmented skin lesions represent localized areas of increased melanin and can indicate serious conditions like melanoma, a major contributor to skin cancer mortality. The MedMNIST v2 dataset, inspired by MNIST, was recently introduced to advance research in biomedical imaging and includes DermaMNIST, a dataset for classifying pigmented lesions based on the HAM10000 dataset. This study assesses ResNet-50 and EfficientNetV2L models for multi-class classification using DermaMNIST, employing transfer learning and various layer configurations. One configuration achieves results that match or surpass existing methods. This study suggests that convolutional neural networks (CNNs) can drive progress in biomedical image analysis, significantly enhancing diagnostic accuracy.


Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST

Nezović, Anida, Romano, Jalal, Marić, Nada, Kapo, Medina, Akagić, Amila

arXiv.org Artificial Intelligence

Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.


Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives

Kraljusic, Benjamin, Ajanovic, Zlatan, Covic, Nermin, Lacevic, Bakir

arXiv.org Artificial Intelligence

This work proposes a motion planning algorithm for robotic manipulators that combines sampling-based and search-based planning methods. The core contribution of the proposed approach is the usage of burs of free configuration space (C-space) as adaptive motion primitives within the graph search algorithm. Due to their feature to adaptively expand in free C-space, burs enable more efficient exploration of the configuration space compared to fixed-sized motion primitives, significantly reducing the time to find a valid path and the number of required expansions. The algorithm is implemented within the existing SMPL (Search-Based Motion Planning Library) library and evaluated through a series of different scenarios involving manipulators with varying number of degrees-of-freedom (DoF) and environment complexity. Results demonstrate that the bur-based approach outperforms fixed-primitive planning in complex scenarios, particularly for high DoF manipulators, while achieving comparable performance in simpler scenarios.


Rethinking Group Recommender Systems in the Era of Generative AI: From One-Shot Recommendations to Agentic Group Decision Support

Jannach, Dietmar, Delić, Amra, Ricci, Francesco, Zanker, Markus

arXiv.org Artificial Intelligence

More than twenty-five years ago, first ideas were developed on how to design a system that can provide recommendations to groups of users instead of individual users. Since then, a rich variety of algorithmic proposals were published, e.g., on how to acquire individual preferences, how to aggregate them, and how to generate recommendations for groups of users. However, despite the rich literature on the topic, barely any examples of real-world group recommender systems can be found. This lets us question common assumptions in academic research, in particular regarding communication processes in a group and how recommendation-supported decisions are made. In this essay, we argue that these common assumptions and corresponding system designs often may not match the needs or expectations of users. We thus call for a reorientation in this research area, leveraging the capabilities of modern Generative AI assistants like ChatGPT. Specifically, as one promising future direction, we envision group recommender systems to be systems where human group members interact in a chat and an AI-based group recommendation agent assists the decision-making process in an agentic way. Ultimately, this shall lead to a more natural group decision-making environment and finally to wider adoption of group recommendation systems in practice.


Toward Safety-First Human-Like Decision Making for Autonomous Vehicles in Time-Varying Traffic Flow

Wang, Xiao, Yu, Junru, Huang, Jun, Wu, Qiong, Vacic, Ljubo, Sun, Changyin

arXiv.org Artificial Intelligence

Despite the recent advancements in artificial intelligence technologies have shown great potential in improving transport efficiency and safety, autonomous vehicles(AVs) still face great challenge of driving in time-varying traffic flow, especially in dense and interactive situations. Meanwhile, human have free wills and usually do not make the same decisions even situate in the exactly same scenarios, leading to the data-driven methods suffer from poor migratability and high search cost problems, decreasing the efficiency and effectiveness of the behavior policy. In this research, we propose a safety-first human-like decision-making framework(SF-HLDM) for AVs to drive safely, comfortably, and social compatiblely in effiency. The framework integrates a hierarchical progressive framework, which combines a spatial-temporal attention (S-TA) mechanism for other road users' intention inference, a social compliance estimation module for behavior regulation, and a Deep Evolutionary Reinforcement Learning(DERL) model for expanding the search space efficiently and effectively to make avoidance of falling into the local optimal trap and reduce the risk of overfitting, thus make human-like decisions with interpretability and flexibility. The SF-HLDM framework enables autonomous driving AI agents dynamically adjusts decision parameters to maintain safety margins and adhering to contextually appropriate driving behaviors at the same time.


HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, Rahul K., Wei, Jason, Hicks, Rebecca Soskin, Bowman, Preston, Quiñonero-Candela, Joaquin, Tsimpourlas, Foivos, Sharman, Michael, Shah, Meghan, Vallone, Andrea, Beutel, Alex, Heidecke, Johannes, Singhal, Karan

arXiv.org Artificial Intelligence

HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, Health-Bench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.


Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

Bappy, Md. Azizul Hakim, Mustafa, Hossen A, Saha, Prottoy, Salehat, Rajinus

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.


Interview with Amina Mević: Machine learning applied to semiconductor manufacturing

AIHub

In a series of interviews, we're meeting some of the AAAI/SIGAI Doctoral Consortium participants to find out more about their research. In this latest interview, we hear from Amina Mević who is applying machine learning to semiconductor manufacturing. Find out more about her PhD research so far, what makes this field so interesting, and how she found the AAAI Doctoral Consortium experience. I am currently pursuing my PhD at the University of Sarajevo, Faculty of Electrical Engineering, Department of Computer Science and Informatics. My research is being carried out in collaboration with Infineon Technologies Austria as part of the Important Project of Common European Interest (IPCEI) in Microelectronics.