Goto

Collaborating Authors

 Zhang, Weiyi


FFA Sora, video generation as fundus fluorescein angiography simulator

arXiv.org Artificial Intelligence

Fundus fluorescein angiography (FFA) is critical for diagnosing retinal vascular diseases, but beginners often struggle with image interpretation. This study develops FFA Sora, a text-to-video model that converts FFA reports into dynamic videos via a Wavelet-Flow Variational Autoencoder (WF-VAE) and a diffusion transformer (DiT). Trained on an anonymized dataset, FFA Sora accurately simulates disease features from the input text, as confirmed by objective metrics: Frechet Video Distance (FVD) = 329.78, Learned Perceptual Image Patch Similarity (LPIPS) = 0.48, and Visual-question-answering Score (VQAScore) = 0.61. Specific evaluations showed acceptable alignment between the generated videos and textual prompts, with BERTScore of 0.35. Additionally, the model demonstrated strong privacy-preserving performance in retrieval evaluations, achieving an average Recall@K of 0.073. Human assessments indicated satisfactory visual quality, with an average score of 1.570(scale: 1 = best, 5 = worst). This model addresses privacy concerns associated with sharing large-scale FFA data and enhances medical education.


EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis

arXiv.org Artificial Intelligence

The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we introduce EyeDiff, a text-to-image model designed to generate multimodal ophthalmic images from natural language prompts and evaluate its applicability in diagnosing common and rare diseases. EyeDiff is trained on eight large-scale datasets using the advanced latent diffusion model, covering 14 ophthalmic image modalities and over 80 ocular diseases, and is adapted to ten multi-country external datasets. The generated images accurately capture essential lesional characteristics, achieving high alignment with text prompts as evaluated by objective metrics and human experts. Furthermore, integrating generated images significantly enhances the accuracy of detecting minority classes and rare eye diseases, surpassing traditional oversampling methods in addressing data imbalance. EyeDiff effectively tackles the issue of data imbalance and insufficiency typically encountered in rare diseases and addresses the challenges of collecting large-scale annotated images, offering a transformative solution to enhance the development of expert-level diseases diagnosis models in ophthalmic field.


MiniTac: An Ultra-Compact 8 mm Vision-Based Tactile Sensor for Enhanced Palpation in Robot-Assisted Minimally Invasive Surgery

arXiv.org Artificial Intelligence

Robot-assisted minimally invasive surgery (RAMIS) provides substantial benefits over traditional open and laparoscopic methods. However, a significant limitation of RAMIS is the surgeon's inability to palpate tissues, a crucial technique for examining tissue properties and detecting abnormalities, restricting the widespread adoption of RAMIS. To overcome this obstacle, we introduce MiniTac, a novel vision-based tactile sensor with an ultra-compact cross-sectional diameter of 8 mm, designed for seamless integration into mainstream RAMIS devices, particularly the Da Vinci surgical systems. MiniTac features a novel mechanoresponsive photonic elastomer membrane that changes color distribution under varying contact pressures. This color change is captured by an embedded miniature camera, allowing MiniTac to detect tumors both on the tissue surface and in deeper layers typically obscured from endoscopic view. MiniTac's efficacy has been rigorously tested on both phantoms and ex-vivo tissues. By leveraging advanced mechanoresponsive photonic materials, MiniTac represents a significant advancement in integrating tactile sensing into RAMIS, potentially expanding its applicability to a wider array of clinical scenarios that currently rely on traditional surgical approaches.


Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

arXiv.org Artificial Intelligence

Accurate diagnosis of ophthalmic diseases relies heavily on the interpretation of multimodal ophthalmic images, a process often time-consuming and expertise-dependent. Visual Question Answering (VQA) presents a potential interdisciplinary solution by merging computer vision and natural language processing to comprehend and respond to queries about medical images. This review article explores the recent advancements and future prospects of VQA in ophthalmology from both theoretical and practical perspectives, aiming to provide eye care professionals with a deeper understanding and tools for leveraging the underlying models. Additionally, we discuss the promising trend of large language models (LLM) in enhancing various components of the VQA framework to adapt to multimodal ophthalmic tasks. Despite the promising outlook, ophthalmic VQA still faces several challenges, including the scarcity of annotated multimodal image datasets, the necessity of comprehensive and unified evaluation methods, and the obstacles to achieving effective real-world applications. This article highlights these challenges and clarifies future directions for advancing ophthalmic VQA with LLMs. The development of LLM-based ophthalmic VQA systems calls for collaborative efforts between medical professionals and AI experts to overcome existing obstacles and advance the diagnosis and care of eye diseases. Keywords: Ophthalmic Visual Question Answering, Large Language Models, Multimodal Image Interpretation, Report Generation, Generative Artificial Intelligence Introduction Accurate diagnosis of ophthalmic diseases often relies on the comprehensive analysis of multimodal ophthalmic images, including color fundus photographs (CFP), optical coherence tomography (OCT), fundus fluorescein angiography (FFA), scanning laser ophthalmoscopy (SLO), anterior segment photographs and corneal topography, etc.


SLR: Learning Quadruped Locomotion without Privileged Information

arXiv.org Artificial Intelligence

Traditional reinforcement learning control for quadruped robots often relies on privileged information, demanding meticulous selection and precise estimation, thereby imposing constraints on the development process. This work proposes a Self-learning Latent Representation (SLR) method, which achieves high-performance control policy learning without the need for privileged information. To enhance the credibility of our proposed method's evaluation, SLR is compared with open-source code repositories of state-of-the-art algorithms, retaining the original authors' configuration parameters. Across four repositories, SLR consistently outperforms the reference results. Ultimately, the trained policy and encoder empower the quadruped robot to navigate steps, climb stairs, ascend rocks, and traverse various challenging terrains. Robot experiment videos are at https://11chens.github.io/SLR/


EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

arXiv.org Artificial Intelligence

Artificial intelligence (AI) is vital in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA). However, existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility. While recent developments have brought about foundation models for ophthalmology, they are limited by the need to train separate weights for each imaging modality, preventing a comprehensive representation of multi-modal features. This highlights the need for versatile foundation models capable of handling various tasks and modalities in ophthalmology. To address this gap, we present EyeFound, a multimodal foundation model for ophthalmic images. Unlike existing models, EyeFound learns generalizable representations from unlabeled multimodal retinal images, enabling efficient model adaptation across multiple applications. Trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, EyeFound facilitates generalist representations and diverse multimodal downstream tasks, even for detecting challenging rare diseases. It outperforms previous work RETFound in diagnosing eye diseases, predicting systemic disease incidents, and zero-shot multimodal VQA. EyeFound provides a generalizable solution to improve model performance and lessen the annotation burden on experts, facilitating widespread clinical AI applications for retinal imaging.


EyeGPT: Ophthalmic Assistant with Large Language Models

arXiv.org Artificial Intelligence

Artificial intelligence (AI) has gained significant attention in healthcare consultation due to its potential to improve clinical workflow and enhance medical communication. However, owing to the complex nature of medical information, large language models (LLM) trained with general world knowledge might not possess the capability to tackle medical-related tasks at an expert level. Here, we introduce EyeGPT, a specialized LLM designed specifically for ophthalmology, using three optimization strategies including role-playing, finetuning, and retrieval-augmented generation. In particular, we proposed a comprehensive evaluation framework that encompasses a diverse dataset, covering various subspecialties of ophthalmology, different users, and diverse inquiry intents. Moreover, we considered multiple evaluation metrics, including accuracy, understandability, trustworthiness, empathy, and the proportion of hallucinations. By assessing the performance of different EyeGPT variants, we identify the most effective one, which exhibits comparable levels of understandability, trustworthiness, and empathy to human ophthalmologists (all Ps>0.05). Overall, ur study provides valuable insights for future research, facilitating comprehensive comparisons and evaluations of different strategies for developing specialized LLMs in ophthalmology. The potential benefits include enhancing the patient experience in eye care and optimizing ophthalmologists' services.


LP-SLAM: Language-Perceptive RGB-D SLAM system based on Large Language Model

arXiv.org Artificial Intelligence

Simultaneous localization and mapping (SLAM) is a critical technology that enables autonomous robots to be aware of their surrounding environment. With the development of deep learning, SLAM systems can achieve a higher level of perception of the environment, including the semantic and text levels. However, current works are limited in their ability to achieve a natural-language level of perception of the world. To address this limitation, we propose LP-SLAM, the first language-perceptive SLAM system that leverages large language models (LLMs). LP-SLAM has two major features: (a) it can detect text in the scene and determine whether it represents a landmark to be stored during the tracking and mapping phase, and (b) it can understand natural language input from humans and provide guidance based on the generated map. We illustrated three usages of the LLM in the system including text cluster, landmark judgment, and natural language navigation. Our proposed system represents an advancement in the field of LLMs based SLAM and opens up new possibilities for autonomous robots to interact with their environment in a more natural and intuitive way.


MIXPGD: Hybrid Adversarial Training for Speech Recognition Systems

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) systems based on deep neural networks are weak against adversarial perturbations. We propose mixPGD adversarial training method to improve the robustness of the model for ASR systems. In standard adversarial training, adversarial samples are generated by leveraging supervised or unsupervised methods. We merge the capabilities of both supervised and unsupervised approaches in our method to generate new adversarial samples which aid in improving model robustness. Extensive experiments and comparison across various state-of-the-art defense methods and adversarial attacks have been performed to show that mixPGD gains 4.1% WER of better performance than previous best performing models under white-box adversarial attack setting. We tested our proposed defense method against both white-box and transfer based black-box attack settings to ensure that our defense strategy is robust against various types of attacks. Empirical results on several adversarial attacks validate the effectiveness of our proposed approach.


Attack on practical speaker verification system using universal adversarial perturbations

arXiv.org Artificial Intelligence

In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking.