Lee, Chankyu
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
Lin, Sheng-Chieh, Lee, Chankyu, Shoeybi, Mohammad, Lin, Jimmy, Catanzaro, Bryan, Ping, Wei
State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Lee, Chankyu, Roy, Rajarshi, Xu, Mengyao, Raiman, Jonathan, Shoeybi, Mohammad, Catanzaro, Bryan, Ping, Wei
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last
ChatQA: Building GPT-4 Level Conversational QA Models
Liu, Zihan, Ping, Wei, Roy, Rajarshi, Xu, Peng, Lee, Chankyu, Shoeybi, Mohammad, Catanzaro, Bryan
In this work, we introduce ChatQA, a family of conversational question answering (QA) models that obtain GPT-4 level accuracies. Specifically, we propose a two-stage instruction tuning method that can significantly improve the zero-shot conversational QA results from large language models (LLMs). To handle retrieval-augmented generation in conversational QA, we fine-tune a dense retriever on a multi-turn QA dataset, which provides comparable results to using the state-of-the-art query rewriting model while largely reducing deployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of average score on 10 conversational QA datasets (54.14 vs. 53.90), without relying on any synthetic data from OpenAI GPT models.
Towards Understanding the Effect of Leak in Spiking Neural Networks
Chowdhury, Sayeed Shafayet, Lee, Chankyu, Roy, Kaushik
Over the past few years, the advancements of deep artificial neural networks (ANNs) have led to remarkable success in various cognitive tasks (e.g., vision, language and behavior). In some cases, neural networks have outperformed the conventional algorithms and achieved human-level performance [1, 2]. However, recent ANNs are becoming extremely compute-intensive and often do not generalize well to previously unseen data during training. On the other hand, human brain can reliably learn and compute intricate cognitive tasks with only a few watts of power budget. Recently, Spiking Neural Networks (SNNs) have been explored toward realizing robust and energy-efficient machine intelligence guided by the cues from neuroscience experiments [3]. SNNs are categorized as the new generation neural networks [4] based on their neuronal functionalities. A variety of spiking neuron models largely resemble biological neuronal mechanisms, which transmit information through discrete spatiotemporal events (or spikes). These spiking neuron models can be characterized by their internal state called the membrane potential. A spiking neuron integrates the inputs over time and fires a spike-output whenever the membrane potential exceeds a threshold.