Goto

Collaborating Authors

 score 0


LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging

Celik, Kagan, Unal, Mehmet Ozan, Ertas, Metin, Yildirim, Isa

arXiv.org Artificial Intelligence

Low-dose computed tomography (CT) represents a significant improvement in patient safety through lower radiation doses, but increased noise, blur, and contrast loss can diminish diagnostic quality. Therefore, consistency and robustness in image quality assessment become essential for clinical applications. In this study, we propose an LLM-based quality assessment system that generates both numerical scores and textual descriptions of degradations such as noise, blur, and contrast loss. Furthermore, various inference strategies - from the zero-shot approach to metadata integration and error feedback - are systematically examined, demonstrating the progressive contribution of each method to overall performance. The resultant assessments yield not only highly correlated scores but also interpretable output, thereby adding value to clinical workflows. The source codes of our study are available at https://github.com/itu-biai/lmms_ldct_iqa.


Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

Jiao, Hong, Choi, Hanna, Hua, Haowei

arXiv.org Artificial Intelligence

Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring Hong Jiao University of Maryland, College Park Hanna Choi University of Maryland, College Park Haowei Hua Princeton University Abstract This study explored the utilities of rationales generated by GPT-4.1 and GPT -5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data . Essay-based scoring was compared with rationale-based scoring. The study found in general essay -based scoring performed better than rationale -based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues . The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay -based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay -based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature. Introduction Automated essay scoring methodology develops along with the advances in AI technology. Starting from the early supervised machine learning models based on engineered features ( e.g., Mahana et al., 2012) to recent use of large language models (LLMs), the methods for automated essay scoring as demonstrated in Appendix A evolved with the advances in machine learning, deep learning, language models, and LLMs. Using automated scoring of Prompt 6 in the Automated Student Assessment Prize (ASAP) dataset from Kaggle, this study intends to explore the utility of rationales generated by LLMs in enhancing automated essay scoring. For the ASAP Prompt 6, automated scoring models have been developed since 2012 after the Kaggle competition.


Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system

Shivaprakash, Prakrithi, Shukla, Lekhansh, Mukherjee, Animesh, Chand, Prabhat, Murthy, Pratima

arXiv.org Artificial Intelligence

Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.


A Granular Study of Safety Pretraining under Model Abliteration

Agnihotri, Shashank, Jakubassa, Jonas, Dey, Priyam, Goyal, Sachin, Schiele, Bernt, Radhakrishnan, Venkatesh Babu, Keuper, Margret

arXiv.org Artificial Intelligence

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.


SAGE: A Realistic Benchmark for Semantic Understanding

Goel, Samarth, Lee, Reagan J., Ramchandran, Kannan

arXiv.org Artificial Intelligence

As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.


The power of dynamic causality in observer-based design for soft sensor applications

Farlessyost, William, Oberst, Sebastian, Singh, Shweta

arXiv.org Artificial Intelligence

This paper introduces a novel framework for optimizing observer-based soft sensors through dynamic causality analysis. Traditional approaches to sensor selection often rely on linearized observability indices or statistical correlations that fail to capture the temporal evolution of complex systems. We address this gap by leveraging liquid-time constant (LTC) networks, continuous-time neural architectures with input-dependent time constants, to systematically identify and prune sensor inputs with minimal causal influence on state estimation. Our methodology implements an iterative workflow: training an LTC observer on candidate inputs, quantifying each input's causal impact through controlled perturbation analysis, removing inputs with negligible effect, and retraining until performance degradation occurs. We demonstrate this approach on three mechanistic testbeds representing distinct physical domains: a harmonically forced spring-mass-damper system, a nonlinear continuous stirred-tank reactor, and a predator-prey model following the structure of the Lotka-Volterra model, but with seasonal forcing and added complexity. Results show that our causality-guided pruning consistently identifies minimal sensor sets that align with underlying physics while improving prediction accuracy. The framework automatically distinguishes essential physical measurements from noise and determines when derived interaction terms provide complementary versus redundant information. Beyond computational efficiency, this approach enhances interpretability by grounding sensor selection decisions in dynamic causal relationships rather than static correlations, offering significant benefits for soft sensing applications across process engineering, ecological monitoring, and agricultural domains.


Concealment of Intent: A Game-Theoretic Analysis

Wu, Xinbo, Umrawal, Abhishek, Varshney, Lav R.

arXiv.org Artificial Intelligence

As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.


Solving Scene Understanding for Autonomous Navigation in Unstructured Environments

Renji, Naveen Mathews, K, Kruthika, Keshavamurthy, Manasa, Kumari, Pooja, Rajarajeswari, S.

arXiv.org Artificial Intelligence

Autonomous vehicles are the next revolution in the automobile industry and they are expected to revolutionize the future of transportation. Understanding the scenario in which the autonomous vehicle will operate is critical for its competent functioning. Deep Learning has played a massive role in the progress that has been made till date. Semantic Segmentation, the process of annotating every pixel of an image with an object class, is one crucial part of this scene comprehension using Deep Learning. It is especially useful in Autonomous Driving Research as it requires comprehension of drivable and non-drivable areas, roadside objects and the like. In this paper semantic segmentation has been performed on the Indian Driving Dataset which has been recently compiled on the urban and rural roads of Bengaluru and Hyderabad. This dataset is more challenging compared to other datasets like Cityscapes, since it is based on unstructured driving environments. It has a four level hierarchy and in this paper segmentation has been performed on the first level. Five different models have been trained and their performance has been compared using the Mean Intersection over Union. These are UNET, UNET+RESNET50, DeepLabsV3, PSPNet and SegNet. The highest MIOU of 0.6496 has been achieved. The paper discusses the dataset, exploratory data analysis, preparation, implementation of the five models and studies the performance and compares the results achieved in the process.


Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models

Gajjar, Jugal, Ranaware, Kaustik

arXiv.org Artificial Intelligence

This project performs multimodal sentiment analysis using the CMU-MOSEI dataset, using transformer-based models with early fusion to integrate text, audio, and visual modalities. We employ BERTbased encoders for each modality, extracting embed-dings that are concatenated before classification. The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set, demonstrating the effectiveness of early fusion in capturing cross-modal interactions. The training utilized Adam optimization (lr=1e-4), dropout (0.3), and early stopping to ensure generalization and robustness. Results highlight the superiority of transformer architectures in modeling multimodal sentiment, with a low MAE (0.1060) indicating precise sentiment intensity prediction. Future work may compare fusion strategies or enhance interpretability.


MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

Liu, Yile, Ma, Ziwei, Jiang, Xiu, Hu, Jinglu, Chang, Jing, Li, Liang

arXiv.org Artificial Intelligence

With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.