Zhang, Dong
Generative Style Transfer for MRI Image Segmentation: A Case of Glioma Segmentation in Sub-Saharan Africa
Chepchirchir, Rancy, Sunday, Jill, Confidence, Raymond, Zhang, Dong, Chaudhry, Talha, Anazodo, Udunna C., Muchungi, Kendi, Zou, Yujing
In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic Resonance Imaging (MRI) technology raises questions about the applicability of machine learning (ML) methods for clinical tasks. This study aims to provide a robust deep learning-based brain tumor segmentation (BraTS) method tailored for the SSA population using a threefold approach. Firstly, the impact of domain shift from the SSA training data on model efficacy was examined, revealing no significant effect. Secondly, a comparative analysis of 3D and 2D full-resolution models using the nnU-Net framework indicates similar performance of both the models trained for 300 epochs achieving a five-fold cross-validation score of 0.93. Lastly, addressing the performance gap observed in SSA validation as opposed to the relatively larger BraTS glioma (GLI) validation set, two strategies are proposed: fine-tuning SSA cases using the GLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel neural style transfer-based data augmentation technique for the SSA cases. This investigation underscores the potential of enhancing brain tumor prediction within SSA's unique healthcare landscape.
Merging Context Clustering with Visual State Space Models for Medical Image Segmentation
Zhu, Yun, Zhang, Dong, Lin, Yi, Feng, Yifei, Tang, Jinhui
Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.
Parameter-efficient Fine-tuning for improved Convolutional Baseline for Brain Tumor Segmentation in Sub-Saharan Africa Adult Glioma Dataset
Adhikari, Bijay, Kulung, Pratibha, Bohaju, Jakesh, Poudel, Laxmi Kanta, Raymond, Confidence, Zhang, Dong, Anazodo, Udunna C, Khanal, Bishesh, Shakya, Mahesh
Automating brain tumor segmentation using deep learning methods is an ongoing challenge in medical imaging. Multiple lingering issues exist including domain-shift and applications in low-resource settings which brings a unique set of challenges including scarcity of data. As a step towards solving these specific problems, we propose Convolutional adapter-inspired Parameter-efficient Fine-tuning (PEFT) of MedNeXt architecture. To validate our idea, we show our method performs comparable to full fine-tuning with the added benefit of reduced training compute using BraTS-2021 as pre-training dataset and BraTS-Africa as the fine-tuning dataset. BraTS-Africa consists of a small dataset (60 train / 35 validation) from the Sub-Saharan African population with marked shift in the MRI quality compared to BraTS-2021 (1251 train samples). We first show that models trained on BraTS-2021 dataset do not generalize well to BraTS-Africa as shown by 20% reduction in mean dice on BraTS-Africa validation samples. Then, we show that PEFT can leverage both the BraTS-2021 and BraTS-Africa dataset to obtain mean dice of 0.8 compared to 0.72 when trained only on BraTS-Africa. Finally, We show that PEFT (0.80 mean dice) results in comparable performance to full fine-tuning (0.77 mean dice) which may show PEFT to be better on average but the boxplots show that full finetuning results is much lesser variance in performance. Nevertheless, on disaggregation of the dice metrics, we find that the model has tendency to oversegment as shown by high specificity (0.99) compared to relatively low sensitivity(0.75). The source code is available at https://github.com/CAMERA-MRI/SPARK2024/tree/main/PEFT_MedNeXt
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Gan, Ziliang, Lu, Yu, Zhang, Dong, Li, Haohan, Liu, Che, Liu, Jian, Liu, Ji, Wu, Haipang, Fu, Chaoyou, Xu, Zenglin, Zhang, Rongjunchen, Dai, Yong
In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
Wang, Xinghao, Wang, Pengyu, Wang, Bo, Zhang, Dong, Zhou, Yunhua, Qiu, Xipeng
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from capability to availability, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce BitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering finegrained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Figure 1: BitStack enables LLMs to dynamically adjust their size in variable memory environments (a) at a megabyte-level, while still matching or surpassing the performance of practical compression methods such as GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2024) with the same memory footprint(b). Large language models (LLMs) have demonstrated superior performance on various benchmarks (Achiam et al., 2023; Dubey et al., 2024) and are increasingly serving as practical assistants in people's daily lives, such as general language assistants (OpenAI, 2024; Google, 2024; Anthropic, 2024), search engines (Perplexity.AI, 2024), and code assistants (GitHub, 2024). With the blessing of scaling laws (Kaplan et al., 2020), LLMs are becoming more powerful as their sizes expand, and the main bottleneck for deploying task-capable LLMs has shifted from their capability to their availability.
MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time
Zhang, Mozhi, Wang, Pengyu, Tan, Chenkun, Huang, Mianqiu, Zhang, Dong, Zhou, Yaqian, Qiu, Xipeng
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential. Existing alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed predefined preferences directly within the model's parameters. These methods, however, often result in a static alignment that can not account for the diversity of human preferences in practical applications. In response to this challenge, we propose an effective method, \textbf{MetaAlign}, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time. Experimental results show that LLMs optimized on our meticulously constructed MetaAlign Dataset can effectively align with any preferences specified at the inference stage, validating the feasibility of MetaAlign. We hope that our work can provide some insights into the alignment of language models.
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Zhang, Xin, Lyu, Xiang, Du, Zhihao, Chen, Qian, Zhang, Dong, Hu, Hangrui, Tan, Chaohong, Zhao, Tianyu, Wang, Yuxuan, Zhang, Bin, Lu, Heng, Zhou, Yaqian, Qiu, Xipeng
Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoice, an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named IntrinsicVoice-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/. Large language models (LLMs) (Yang et al., 2024; Dubey et al., 2024; OpenAI, 2023) and multimodal large language models (MLLMs) (Tang et al., 2023; Chu et al., 2024; Liu et al., 2024) have exhibited exceptional performance across a variety of natural language processing tasks and multimodal comprehension tasks, allowing them to become powerful solvers for general tasks.
Deep-MPC: A DAGGER-Driven Imitation Learning Strategy for Optimal Constrained Battery Charging
Espin, Jorge, Zhang, Dong, Toti, Daniele, Pozzi, Andrea
This challenge becomes apparent when the model on batteries, particularly within the realm of sustainable deviates from the expert's path and continues to make errors mobility driven by electric vehicles (EVs) [1]. This transition that lead it into unfamiliar states, thus exacerbating the underscores the vital role of batteries in promoting ecofriendly initial mistake [11]. Dataset Aggregation (DAGGER) was transportation. However, it also highlights the pressing introduced by [12] as a method to address the challenge of need to enhance battery efficiency, long-lasting battery distributional shift. This iterative algorithm aims to minimize performance, and safety, particularly during the charging the compounding of errors resulting from the shift by iteratively phase. To address these challenges, advanced battery management integrating the decisions made by both the learning systems, often employing Model Predictive Control model and an expert policy. This integration prevents the (MPC), have gained prominence [2], [3].
On the Temperature of Machine Learning Systems
Zhang, Dong
We develop a thermodynamic theory for machine learning (ML) systems. Similar to physical thermodynamic systems which are characterized by energy and entropy, ML systems possess these characteristics as well. This comparison inspire us to integrate the concept of temperature into ML systems grounded in the fundamental principles of thermodynamics, and establish a basic thermodynamic framework for machine learning systems with non-Boltzmann distributions. We introduce the concept of states within a ML system, identify two typical types of state, and interpret model training and refresh as a process of state phase transition. We consider that the initial potential energy of a ML system is described by the model's loss functions, and the energy adheres to the principle of minimum potential energy. For a variety of energy forms and parameter initialization methods, we derive the temperature of systems during the phase transition both analytically and asymptotically, highlighting temperature as a vital indicator of system data distribution and ML training complexity. Moreover, we perceive deep neural networks as complex heat engines with both global temperature and local temperatures in each layer. The concept of work efficiency is introduced within neural networks, which mainly depends on the neural activation functions. We then classify neural networks based on their work efficiency, and describe neural networks as two types of heat engines.
Location-guided Head Pose Estimation for Fisheye Image
Li, Bing, Zhang, Dong, Huang, Cheng, Xian, Yun, Li, Ming, Lee, Dah-Jye
Abstract--Camera with a fisheye or ultra-wide lens covers a wide field of view that cannot be modeled by the perspective projection. Serious fisheye lens distortion in the peripheral region of the image leads to degraded performance of the existing head pose estimation models trained on undistorted images. This paper presents a new approach for head pose estimation that uses the knowledge of head location in the image to reduce the negative effect of fisheye distortion. We develop an end-to-end convolutional neural network to estimate the head pose with the multi-task learning of head pose and head location. This distortion nature of the fisheye camera makes estimating head pose from fisheye image a very challenging task. HE goal of head pose estimation (HPE) in the context of computer vision is to estimate the orientation of the Head pose estimation from rectilinear image has already head concerning the camera coordinate system. The estimated achieved promising results. An intuitive idea for estimating head pose is usually expressed by Euler angles (pitch, yaw, the head pose from fisheye image is the so-called two-stage roll) [1].