Karmaker, Santu
Revisiting Word Embeddings in the LLM Era
Mahajan, Yash, Freestone, Matthew, Aakur, Sathyanarayanan, Karmaker, Santu
Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.
Set-Theoretic Compositionality of Sentence Embeddings
Bansal, Naman, mahajan, Yash, Sinha, Sanjeev, Karmaker, Santu
Sentence encoders play a pivotal role in various NLP tasks; hence, an accurate evaluation of their compositional properties is paramount. However, existing evaluation methods predominantly focus on goal task-specific performance. This leaves a significant gap in understanding how well sentence embeddings demonstrate fundamental compositional properties in a task-independent context. Leveraging classical set theory, we address this gap by proposing six criteria based on three core "set-like" compositions/operations: \textit{TextOverlap}, \textit{TextDifference}, and \textit{TextUnion}. We systematically evaluate $7$ classical and $9$ Large Language Model (LLM)-based sentence encoders to assess their alignment with these criteria. Our findings show that SBERT consistently demonstrates set-like compositional properties, surpassing even the latest LLMs. Additionally, we introduce a new dataset of ~$192$K samples designed to facilitate future benchmarking efforts on set-like compositionality of sentence embeddings.
BongLLaMA: LLaMA for Bangla Language
Zehady, Abdullah Khan, Mamun, Safi Al, Islam, Naymul, Karmaker, Santu
Bangla (or "Bengali") is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This work addresses this gap by introducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks. We believe BongLLaMA will serve as the new standard baseline for Bangla Language Models and, thus, facilitate future benchmarking studies focused on this widely-spoken yet "low-resource" language. All BongLLaMA models are available for public use at https://huggingface.co/BanglaLLM.