image sample
Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
Yuan, Youliang, Jiao, Wenxiang, Xie, Yuejin, Shen, Chihao, Tian, Menghan, Wang, Wenxuan, Huang, Jen-tse, He, Pinjia
Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.
Efficient Breast and Ovarian Cancer Classification via ViT-Based Preprocessing and Transfer Learning
Cancer is one of the leading health challenges for women, specifically breast and ovarian cancer. Early detection can help improve the survival rate through timely intervention and treatment. Traditional methods of detecting cancer involve manually examining mammograms, CT scans, ultrasounds, and other imaging types. However, this makes the process labor-intensive and requires the expertise of trained pathologists. Hence, making it both time-consuming and resource-intensive. In this paper, we introduce a novel vision transformer (ViT)-based method for detecting and classifying breast and ovarian cancer. We use a pre-trained ViT-Base-Patch16-224 model, which is fine-tuned for both binary and multi-class classification tasks using publicly available histopathological image datasets. Further, we use a preprocessing pipeline that converts raw histophological images into standardized PyTorch tensors, which are compatible with the ViT architecture and also help improve the model performance. We evaluated the performance of our model on two benchmark datasets: the BreakHis dataset for binary classification and the UBC-OCEAN dataset for five-class classification without any data augmentation. Our model surpasses existing CNN, ViT, and topological data analysis-based approaches in binary classification. For multi-class classification, it is evaluated against recent topological methods and demonstrates superior performance. Our study highlights the effectiveness of Vision Transformer-based transfer learning combined with efficient preprocessing in oncological diagnostics.
Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding
Huang, Jianhao, Zeng, Qunsong, Huang, Kaibin
Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.
CNN-based Image Models Verify a Hypothesis that The Writers of Cuneiform Texts Improved Their Writing Skills When Studying at the Age of Hittite Empire
Kohmoto, Daichi, Fukuda, Katsutoshi, Yoshida, Daisuke, Matsui, Takafumi, Omura, Sachihiro
A cuneiform tablet KBo 23.1 ++/KUB 30.38, which is known to represent a text of Kizzuwatna rituals, was written by two writers with almost identical content in two iterations. Unlike other cuneiform tablets that contained information such as myths, essays, or business records, the reason why ancient people left such tablets for posterity remains unclear. To study this problem, we develop a new methodology by analyzing images of a tablet quantitatively using CNN (Convolutional Neural Network)-based image models, without segmenting cuneiforms one-by-one. Our data-driven methodology implies that the writer writing the first half was a `teacher' and the other writer was a `student' who was training his skills of writing cuneiforms. This result has not been reached by classical linguistics. We also discuss related conclusions and possible further directions for applying our method and its generalizations.
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
Xu, Qipan, Wang, Zhenting, He, Xiaoxiao, Han, Ligong, Tang, Ruixiang
Generative AI models, renowned for their ability to synthesize high-quality content, have sparked growing concerns over the improper generation of copyright-protected material. While recent studies have proposed various approaches to address copyright issues, the capability of large vision-language models (LVLMs) to detect copyright infringements remains largely unexplored. In this work, we focus on evaluating the copyright detection abilities of state-of-the-art LVLMs using a various set of image samples. Recognizing the absence of a comprehensive dataset that includes both IP-infringement samples and ambiguous non-infringement negative samples, we construct a benchmark dataset comprising positive samples that violate the copyright protection of well-known IP figures, as well as negative samples that resemble these figures but do not raise copyright concerns. This dataset is created using advanced prompt engineering techniques. We then evaluate leading LVLMs using our benchmark dataset. Our experimental results reveal that LVLMs are prone to overfitting, leading to the misclassification of some negative samples as IP-infringement cases. In the final section, we analyze these failure cases and propose potential solutions to mitigate the overfitting problem.
Optimized Learning for X-Ray Image Classification for Multi-Class Disease Diagnoses with Accelerated Computing Strategies
Romero, Sebastian A. Cruz, de Jesus, Ivanelyz Rivera, Quinones, Dariana J. Troche, Gallego, Wilson Rivera
X-ray image-based disease diagnosis lies in ensuring the precision of identifying afflictions within the sample, a task fraught with challenges stemming from the occurrence of false positives and false negatives. False positives introduce the risk of erroneously identifying non-existent conditions, leading to misdiagnosis and a decline in patient care quality. Conversely, false negatives pose the threat of overlooking genuine abnormalities, potentially causing delays in treatment and interventions, thereby resulting in adverse patient outcomes. The urgency to overcome these challenges compels ongoing efforts to elevate the precision and reliability of X-ray image analysis algorithms within the computational framework. This study introduces modified pre-trained ResNet models tailored for multi-class disease diagnosis of X-ray images, incorporating advanced optimization strategies to reduce the execution runtime of training and inference tasks. The primary objective is to achieve tangible performance improvements through accelerated implementations of PyTorch, CUDA, Mixed- Precision Training, and Learning Rate Scheduler. While outcomes demonstrate substantial improvements in execution runtimes between normal training and CUDA-accelerated training, negligible differences emerge between various training optimization modalities. This research marks a significant advancement in optimizing computational approaches to reduce training execution time for larger models. Additionally, we explore the potential of effective parallel data processing using MPI4Py for the distribution of gradient descent optimization across multiple nodes and leverage multiprocessing to expedite data preprocessing for larger datasets.
A Domain Translation Framework with an Adversarial Denoising Diffusion Model to Generate Synthetic Datasets of Echocardiography Images
Tiago, Cristiana, Snare, Sten Roar, Sprem, Jurica, McLeod, Kristin
Currently, medical image domain translation operations show a high demand from researchers and clinicians. Amongst other capabilities, this task allows the generation of new medical images with sufficiently high image quality, making them clinically relevant. Deep Learning (DL) architectures, most specifically deep generative models, are widely used to generate and translate images from one domain to another. The proposed framework relies on an adversarial Denoising Diffusion Model (DDM) to synthesize echocardiography images and perform domain translation. Contrary to Generative Adversarial Networks (GANs), DDMs are able to generate high quality image samples with a large diversity. If a DDM is combined with a GAN, this ability to generate new data is completed at an even faster sampling time. In this work we trained an adversarial DDM combined with a GAN to learn the reverse denoising process, relying on a guide image, making sure relevant anatomical structures of each echocardiography image were kept and represented on the generated image samples. For several domain translation operations, the results verified that such generative model was able to synthesize high quality image samples: MSE: 11.50 +/- 3.69, PSNR (dB): 30.48 +/- 0.09, SSIM: 0.47 +/- 0.03. The proposed method showed high generalization ability, introducing a framework to create echocardiography images suitable to be used for clinical research purposes.
A Novel Deep Learning based Model for Erythrocytes Classification and Quantification in Sickle Cell Disease
Bhatia, Manish, Meena, Balram, Rathi, Vipin Kumar, Tiwari, Prayag, Jaiswal, Amit Kumar, Ansari, Shagaf M, Kumar, Ajay, Marttinen, Pekka
The shape of erythrocytes or red blood cells is altered in several pathological conditions. Therefore, identifying and quantifying different erythrocyte shapes can help diagnose various diseases and assist in designing a treatment strategy. Machine Learning (ML) can be efficiently used to identify and quantify distorted erythrocyte morphologies. In this paper, we proposed a customized deep convolutional neural network (CNN) model to classify and quantify the distorted and normal morphology of erythrocytes from the images taken from the blood samples of patients suffering from Sickle cell disease ( SCD). We chose SCD as a model disease condition due to the presence of diverse erythrocyte morphologies in the blood samples of SCD patients. For the analysis, we used 428 raw microscopic images of SCD blood samples and generated the dataset consisting of 10, 377 single-cell images. We focused on three well-defined erythrocyte shapes, including discocytes, oval, and sickle. We used 18 layered deep CNN architecture to identify and quantify these shapes with 81% accuracy, outperforming other models. We also used SHAP and LIME for further interpretability. The proposed model can be helpful for the quick and accurate analysis of SCD blood samples by the clinicians and help them make the right decision for better management of SCD.
ESPT: A Self-Supervised Episodic Spatial Pretext Task for Improving Few-Shot Learning
Rong, Yi, Lu, Xiongbo, Sun, Zhaoyang, Chen, Yaxiong, Xiong, Shengwu
Self-supervised learning (SSL) techniques have recently been integrated into the few-shot learning (FSL) framework and have shown promising results in improving the few-shot image classification performance. However, existing SSL approaches used in FSL typically seek the supervision signals from the global embedding of every single image. Therefore, during the episodic training of FSL, these methods cannot capture and fully utilize the local visual information in image samples and the data structure information of the whole episode, which are beneficial to FSL. To this end, we propose to augment the few-shot learning objective with a novel self-supervised Episodic Spatial Pretext Task (ESPT). Specifically, for each few-shot episode, we generate its corresponding transformed episode by applying a random geometric transformation to all the images in it. Based on these, our ESPT objective is defined as maximizing the local spatial relationship consistency between the original episode and the transformed one. With this definition, the ESPT-augmented FSL objective promotes learning more transferable feature representations that capture the local spatial features of different images and their inter-relational structural information in each input episode, thus enabling the model to generalize better to new categories with only a few samples. Extensive experiments indicate that our ESPT method achieves new state-of-the-art performance for few-shot image classification on three mainstay benchmark datasets. The source code will be available at: https://github.com/Whut-YiRong/ESPT.
A Lightweight Reconstruction Network for Surface Defect Inspection
Hu, Chao, Yao, Jian, Wu, Weijie, Qiu, Weibin, Zhu, Liqiang
Currently, most deep learning methods cannot solve the problem of scarcity of industrial product defect samples and significant differences in characteristics. This paper proposes an unsupervised defect detection algorithm based on a reconstruction network, which is realized using only a large number of easily obtained defect-free sample data. The network includes two parts: image reconstruction and surface defect area detection. The reconstruction network is designed through a fully convolutional autoencoder with a lightweight structure. Only a small number of normal samples are used for training so that the reconstruction network can be A defect-free reconstructed image is generated. A function combining structural loss and $\mathit{L}1$ loss is proposed as the loss function of the reconstruction network to solve the problem of poor detection of irregular texture surface defects. Further, the residual of the reconstructed image and the image to be tested is used as the possible region of the defect, and conventional image operations can realize the location of the fault. The unsupervised defect detection algorithm of the proposed reconstruction network is used on multiple defect image sample sets. Compared with other similar algorithms, the results show that the unsupervised defect detection algorithm of the reconstructed network has strong robustness and accuracy.