Wang, Chenyu
CellFlow: Simulating Cellular Morphology Changes via Flow Matching
Zhang, Yuhui, Su, Yuchang, Wang, Chenyu, Li, Tianhong, Wefers, Zoe, Nirschl, Jeffrey, Burgess, James, Ding, Daisy, Lozano, Alejandro, Lundberg, Emma, Yeung-Levy, Serena
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlow, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlow models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlow generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlow enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research.
Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review
Uehara, Masatoshi, Zhao, Yulai, Wang, Chenyu, Li, Xiner, Regev, Aviv, Levine, Sergey, Biancalani, Tommaso
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at https://github.com/masa-ue/AlignInversePro
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Zhang, Yuhui, Su, Yuchang, Liu, Yiming, Wang, Xiaohan, Burgess, James, Sui, Elaine, Wang, Chenyu, Aklilu, Josiah, Lozano, Alejandro, Wei, Anjiang, Schmidt, Ludwig, Yeung-Levy, Serena
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Predicting Human Brain States with Transformer
Sun, Yifei, Cabezas, Mariano, Lee, Jiah, Wang, Chenyu, Zhang, Wei, Calamante, Fernando, Lv, Jinglei
The human brain is a complex and highly dynamic system, and our current knowledge of its functional mechanism is still very limited. Fortunately, with functional magnetic resonance imaging (fMRI), we can observe blood oxygen level-dependent (BOLD) changes, reflecting neural activity, to infer brain states and dynamics. In this paper, we ask the question of whether the brain states rep-resented by the regional brain fMRI can be predicted. Due to the success of self-attention and the transformer architecture in sequential auto-regression problems (e.g., language modelling or music generation), we explore the possi-bility of the use of transformers to predict human brain resting states based on the large-scale high-quality fMRI data from the human connectome project (HCP). Current results have shown that our model can accurately predict the brain states up to 5.04s with the previous 21.6s. Furthermore, even though the prediction error accumulates for the prediction of a longer time period, the gen-erated fMRI brain states reflect the architecture of functional connectome. These promising initial results demonstrate the possibility of developing gen-erative models for fMRI data using self-attention that learns the functional or-ganization of the human brain. Our code is available at: https://github.com/syf0122/brain_state_pred.
Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation
Wang, Chenyu, Zhou, Weichao, Ghosh, Shantanu, Batmanghelich, Kayhan, Li, Wenchao
Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by $10$%, achieved by rejecting $20$% of reports using the Radialog model on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an $82.9$% success rate.
An Information Criterion for Controlled Disentanglement of Multimodal Data
Wang, Chenyu, Gupta, Sharut, Zhang, Xinyi, Tonekaboni, Sana, Jegelka, Stefanie, Jaakkola, Tommi, Uhler, Caroline
Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities. By disentangling modality-specific information from information that is shared across modalities, we can improve interpretability and robustness and enable downstream tasks such as the generation of counterfactual outcomes. Separating the two types of information is challenging since they are often deeply entangled in many real-world applications. We present a comprehensive analysis of the optimality of each disentangled representation, particularly focusing on the scenario not covered in prior work where the so-called Minimum Necessary Information (MNI) point is not attainable. SSL successfully learns shared and modality-specific features on multiple synthetic and real-world datasets and consistently outperforms baselines on various downstream tasks, including prediction tasks for vision-language data, as well as molecule-phenotype retrieval tasks for biological data. Humans understand and interact with the world using multiple senses, each providing unique and complementary information essential for forming a comprehensive mental representation of the environment. Large multimodal representation learning models such as CLIP (Radford et al., 2021), trained through self-supervised learning, maximally capture the mutual information shared across multiple modalities, exploiting the assumption of multi-view redundancy (Tosh et al., 2021; Sridharan & Kakade, 2008). This property indicates that shared information between modalities is exactly what is relevant for downstream tasks. However, the modality gap, rooted in the inherent differences in representational nature and information content across modalities (Liang et al., 2022b; Ramasinghe et al., 2024; Huh et al., 2024), leads to the misalignment between modalities and restricts the application of these methods in many real-world multimodal scenarios.
Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design
Wang, Chenyu, Uehara, Masatoshi, He, Yichun, Wang, Amy, Biancalani, Tommaso, Lal, Avantika, Jaakkola, Tommi, Levine, Sergey, Wang, Hanchen, Regev, Aviv
Recent studies have demonstrated the strong empirical performance of diffusion models on discrete sequences (i.e., discrete diffusion models) across domains from natural language to biological sequence generation. For example, in the protein inverse folding task, where the goal is to generate a protein sequence from a given backbone structure, conditional diffusion models have achieved impressive results in generating natural-like sequences that fold back into the original structure. However, practical design tasks often require not only modeling a conditional distribution but also optimizing specific task objectives. For instance, in the inverse folding task, we may prefer protein sequences with high stability. To address this, we consider the scenario where we have pre-trained discrete diffusion models that can generate natural-like sequences, as well as reward models that map sequences to task objectives. We then formulate the reward maximization problem within discrete diffusion models, analogous to reinforcement learning (RL), while minimizing the KL divergence against pretrained diffusion models to preserve naturalness. To solve this RL problem, we propose a novel algorithm, DRAKES, that enables direct backpropagation of rewards through entire trajectories generated by diffusion models, by making the originally nondifferentiable trajectories differentiable using the Gumbel-Softmax trick. Our theoretical analysis indicates that our approach can generate sequences that are both natural-like (i.e., have a high probability under a pretrained model) and yield high rewards. While similar tasks have been recently explored in diffusion models for continuous domains, our work addresses unique algorithmic and theoretical challenges specific to discrete diffusion models, which arise from their foundation in continuous-time Markov chains rather than Brownian motion. Finally, we demonstrate the effectiveness of our algorithm in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein-based therapeutics. Diffusion models have gained widespread recognition as effective generative models in continuous spaces, such as image and video generation (Song et al., 2020; Ho et al., 2022). Inspired by seminal works (e.g., Austin et al. (2021); Campbell et al. (2022); Sun et al. (2022)), recent studies (Lou et al., 2023; Shi et al., 2024; Sahoo et al., 2024) have shown that diffusion models are also highly effective in discrete spaces, including natural language and biological sequence generation (DNA, RNA, proteins). Work mainly done during an internship at Genentech.
Evaluating Zero-Shot Long-Context LLM Compression
Wang, Chenyu, Wang, Yihan
This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.
In-Context Symmetries: Self-Supervised Learning through Contextual World Models
Gupta, Sharut, Wang, Chenyu, Wang, Yifei, Jaakkola, Tommi, Jegelka, Stefanie
At the core of self-supervised learning for vision is the idea of learning invariant or equivariant representations with respect to a set of data transformations. This approach, however, introduces strong inductive biases, which can render the representations fragile in downstream tasks that do not conform to these symmetries. In this work, drawing insights from world models, we propose to instead learn a general representation that can adapt to be invariant or equivariant to different transformations by paying attention to context -- a memory module that tracks task-specific states, actions, and future states. Here, the action is the transformation, while the current and future states respectively represent the input's representation before and after the transformation. Our proposed algorithm, Contextual Self-Supervised Learning (ContextSSL), learns equivariance to all transformations (as opposed to invariance). In this way, the model can learn to encode all relevant features as general representations while having the versatility to tail down to task-wise symmetries when given a few examples as the context. Empirically, we demonstrate significant performance gains over existing methods on equivariance-related tasks, supported by both qualitative and quantitative evaluations.
EPIM: Efficient Processing-In-Memory Accelerators based on Epitome
Wang, Chenyu, Dong, Zhen, Zhou, Daquan, Zhu, Zhenhua, Wang, Yu, Feng, Jiashi, Keutzer, Kurt
The utilization of large-scale neural networks on Processing-In-Memory (PIM) accelerators encounters challenges due to constrained on-chip memory capacity. To tackle this issue, current works explore model compression algorithms to reduce the size of Convolutional Neural Networks (CNNs). Most of these algorithms either aim to represent neural operators with reduced-size parameters (e.g., quantization) or search for the best combinations of neural operators (e.g., neural architecture search). Designing neural operators to align with PIM accelerators' specifications is an area that warrants further study. In this paper, we introduce the Epitome, a lightweight neural operator offering convolution-like functionality, to craft memory-efficient CNN operators for PIM accelerators (EPIM). On the software side, we evaluate epitomes' latency and energy on PIM accelerators and introduce a PIM-aware layer-wise design method to enhance their hardware efficiency. We apply epitome-aware quantization to further reduce the size of epitomes. On the hardware side, we modify the datapath of current PIM accelerators to accommodate epitomes and implement a feature map reuse technique to reduce computation cost. Experimental results reveal that our 3-bit quantized EPIM-ResNet50 attains 71.59% top-1 accuracy on ImageNet, reducing crossbar areas by 30.65 times. EPIM surpasses the state-of-the-art pruning methods on PIM.