Goto

Collaborating Authors

 open access


Will Large Language Models Transform Clinical Prediction?

Yildiz, Yusuf, Nenadic, Goran, Jani, Meghna, Jenkins, David A.

arXiv.org Artificial Intelligence

Objective: Large language models (LLMs) are attracting increasing interest in healthcare. This commentary evaluates the potential of LLMs to improve clinical prediction models (CPMs) for diagnostic and prognostic tasks, with a focus on their ability to process longitudinal electronic health record (EHR) data. Findings: LLMs show promise in handling multimodal and longitudinal EHR data and can support multi-outcome predictions for diverse health conditions. However, methodological, validation, infrastructural, and regulatory chal- lenges remain. These include inadequate methods for time-to-event modelling, poor calibration of predictions, limited external validation, and bias affecting underrepresented groups. High infrastructure costs and the absence of clear regulatory frameworks further prevent adoption. Implications: Further work and interdisciplinary collaboration are needed to support equitable and effective integra- tion into the clinical prediction. Developing temporally aware, fair, and explainable models should be a priority focus for transforming clinical prediction workflow.


Natural Language Processing for Cardiology: A Narrative Review

Yang, Kailai, Leng, Yan, Zhang, Xin, Zhang, Tianlin, Thompson, Paul, Keavney, Bernard, Tomaszewski, Maciej, Ananiadou, Sophia

arXiv.org Artificial Intelligence

Cardiovascular diseases are becoming increasingly prevalent in modern society, with a profound impact on global health and well-being. These Cardiovascular disorders are complex and multifactorial, influenced by genetic predispositions, lifestyle choices, and diverse socioeconomic and clinical factors. Information about these interrelated factors is dispersed across multiple types of textual data, including patient narratives, medical records, and scientific literature. Natural language processing (NLP) has emerged as a powerful approach for analysing such unstructured data, enabling healthcare professionals and researchers to gain deeper insights that may transform the diagnosis, treatment, and prevention of cardiac disorders. This review provides a comprehensive overview of NLP research in cardiology from 2014 to 2025. We systematically searched six literature databases for studies describing NLP applications across a range of cardiovascular diseases. After a rigorous screening process, we identified 265 relevant articles. Each study was analysed across multiple dimensions, including NLP paradigms, cardiology-related tasks, disease types, and data sources. Our findings reveal substantial diversity within these dimensions, reflecting the breadth and evolution of NLP research in cardiology. A temporal analysis further highlights methodological trends, showing a progression from rule-based systems to large language models. Finally, we discuss key challenges and future directions, such as developing interpretable LLMs and integrating multimodal data. To the best of our knowledge, this review represents the most comprehensive synthesis of NLP research in cardiology to date.


A Fuzzy Approach to Project Success: Measuring What Matters

Granja-Correia, João, Hernández-Linares, Remedios, Ferranti, Luca, Rego, Arménio

arXiv.org Artificial Intelligence

This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.


Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts

Zhang, Lei, Stricker, Markus

arXiv.org Artificial Intelligence

The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion'. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.


The Linear Fallacy

Communications of the ACM

As one gains seniority, there is a presumption--dubious, perhaps--that one also gains wisdom. Thus, I find myself asked, not infrequently, to share some wisdom with junior researchers who seek insight that can foster success in their careers or life in general. I offer one cautionary bit of advice: "Life is not linear." Linearity is one of the greatest success stories in mathematics. According to Encyclopedia Britannica, "Unlike other parts of mathematics that are frequently invigorated by new ideas and unsolved problems, linear algebra is very well understood." In the U.S., in eighth grade, pupils learn how to analyze and represent linear functions and solve linear equations and systems of linear equations.


Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review

Lamart, Pierre, Yu, Yinan, Berger, Christian

arXiv.org Artificial Intelligence

Machine Learning (ML) is continuously permeating a growing amount of application domains. Generative AI such as Large Language Models (LLMs) also sees broad adoption to process multi-modal data such as text, images, audio, and video. While the trend is to use ever-larger datasets for training, managing this data efficiently has become a significant practical challenge in the industry-double as much data is certainly not double as good. Rather the opposite is important since getting an understanding of the inherent quality and diversity of the underlying data lakes is a growing challenge for application-specific ML as well as for fine-tuning foundation models. Furthermore, information retrieval (IR) from expanding data lakes is complicated by the temporal dimension inherent in time-series data which must be considered to determine its semantic value. This study focuses on the different semantic-aware techniques to extract embeddings from mono-modal, multi-modal, and cross-modal data to enhance IR capabilities in a growing data lake. Articles were collected to summarize information about the state-of-the-art techniques focusing on applications of embedding for three different categories of data modalities.


From Open Access to Guarded Trust

Communications of the ACM

In the golden age of software engineering, data was an open book. Engineers had almost unlimited access to the information, enabling them to glean insights, refine products, and optimize system performance with relative ease. Consider the rise of platforms such as Facebook and Google, which in their early stages benefited significantly from vast datasets and harnessing user information to improve experiences, refine algorithms, and even predict user behaviors. For companies such as Amazon, customer data was not just for user experience; it was central to building recommendation systems that, to this day, account for a significant percentage of its sales. This access, however, was a double-edged sword. While data-driven insights propelled tech giants to unprecedented heights, they also led to privacy debacles.


Noise-Augmented Boruta: The Neural Network Perturbation Infusion with Boruta Feature Selection

Gharoun, Hassan, Yazdanjoe, Navid, Khorshidi, Mohammad Sadegh, Gandomi, Amir H.

arXiv.org Artificial Intelligence

With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data's overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.


Detecting Fake Job Postings Using Bidirectional LSTM

Pillai, Aravind Sasidharan

arXiv.org Artificial Intelligence

Fake job postings have become prevalent in the online job market, posing significant challenges to job seekers and employers. Despite the growing need to address this problem, there is limited research that leverages deep learning techniques for the detection of fraudulent job advertisements. This study aims to fill the gap by employing a Bidirectional Long Short-Term Memory (Bi-LSTM) model to identify fake job advertisements. Our approach considers both numeric and text features, effectively capturing the underlying patterns and relationships within the data. The proposed model demonstrates a superior performance, achieving a 0.91 ROC AUC score and a 98.71% accuracy rate, indicating its potential for practical applications in the online job market. The findings of this research contribute to the development of robust, automated tools that can help combat the proliferation of fake job postings and improve the overall integrity of the job search process. Moreover, we discuss challenges, future research directions, and ethical considerations related to our approach, aiming to inspire further exploration and development of practical solutions to combat online job fraud.


Locked AI: The Dangers of Closed Source Code in the Age of Artificial Intelligence

#artificialintelligence

OpenAI has been known for its mission to develop and promote artificial intelligence in a safe and ethical manner. However, the organization recently announced that it will no longer be open sourcing its AI code. This decision has raised concerns about the potential dangers of limiting access to AI research and development. One of the biggest dangers of not open sourcing AI code is the potential for decreased transparency and accountability. Open sourcing code allows other researchers to verify the accuracy and safety of AI models, which can lead to improvements and prevent the deployment of harmful systems. Without open sourcing, there is less transparency and accountability for the development of AI models, which could lead to unintended consequences and the deployment of unsafe AI systems.