Goto

Collaborating Authors

 sbert model


Towards Building Efficient Sentence BERT Models using Layer Pruning

arXiv.org Artificial Intelligence

This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity. We assess BERT models like Muril and MahaBERT-v2 before and after pruning, comparing them with smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Through a two-phase SBERT fine-tuning process involving Natural Language Inference (NLI) and Semantic Textual Similarity (STS), we evaluate the impact of layer reduction on embedding quality. Our findings show that pruned models, despite fewer layers, perform competitively with fully layered versions. Moreover, pruned models consistently outperform similarly sized, scratch-trained models, establishing layer pruning as an effective strategy for creating smaller, efficient embedding models. These results highlight layer pruning as a practical approach for reducing computational demand while preserving high-quality embeddings, making SBERT models more accessible for languages with limited technological resources.


Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

arXiv.org Artificial Intelligence

In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati. The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content. Leveraging the HASOC 2023 datasets, we fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech. Our findings underscore the superiority of monolingual sentence-BERT models, particularly in the Bengali language, where we achieved the highest ranking. However, the performance in Assamese and Gujarati languages signifies ongoing opportunities for enhancement. Our goal is to foster inclusive online spaces by countering hate speech proliferation.


Short Answer Grading Using One-shot Prompting and Text Similarity Scoring Model

arXiv.org Artificial Intelligence

In this study, we developed an automated short answer grading (ASAG) model that provided both analytic scores and final holistic scores. Short answer items typically consist of multiple sub-questions, and providing an analytic score and the text span relevant to each sub-question can increase the interpretability of the automated scores. Furthermore, they can be used to generate actionable feedback for students. Despite these advantages, most studies have focused on predicting only holistic scores due to the difficulty in constructing dataset with manual annotations. To address this difficulty, we used large language model (LLM)-based one-shot prompting and a text similarity scoring model with domain adaptation using small manually annotated dataset. The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset. The model achieved a substantial improvement over the majority baseline.


L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

arXiv.org Artificial Intelligence

The multilingual Sentence-BERT (SBERT) models map different languages to common representation space and are useful for cross-language similarity and mining tasks. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We simply aggregate translated NLI or STS datasets of the low-resource target languages together and perform SBERT-like fine-tuning of the vanilla multilingual BERT model. We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach without explicit cross-lingual training yields exceptional cross-lingual properties. We show the efficacy of our approach on 10 major Indic languages and also show the applicability of our approach to non-Indic languages German and French. Using this approach, we further present L3Cube-IndicSBERT, the first multilingual sentence representation model specifically for Indian languages Hindi, Marathi, Kannada, Telugu, Malayalam, Tamil, Gujarati, Odia, Bengali, and Punjabi. The IndicSBERT exhibits strong cross-lingual capabilities and performs significantly better than alternatives like LaBSE, LASER, and paraphrase-multilingual-mpnet-base-v2 on Indic cross-lingual and monolingual sentence similarity tasks. We also release monolingual SBERT models for each of the languages and show that IndicSBERT performs competitively with its monolingual counterparts. These models have been evaluated using embedding similarity scores and classification accuracy.


Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture

arXiv.org Artificial Intelligence

In this paper, we present the Logically submissions to De-Factify 2 challenge (DE-FACTIFY 2023) on task 1 of Multi-Modal Fact Checking. We describe our submission to this challenge including explored evidence retrieval and selection techniques, pre-trained cross-modal and unimodal models, and a cross-modal veracity model based on the well established Transformer Encoder (TE) architecture which heavily relies on the concept of self-attention. Exploratory analysis is also conducted on the Factify 2 data set that uncovers the salient multi-modal patterns and hypothesis motivating the architecture proposed in this work. A series of preliminary experiments were done to investigate and benchmark different pre-trained embedding models, evidence retrieval settings and thresholds. The final system, a standard two-stage evidence based veracity detection system, yielded a weighted average F1 score of 0.79 on both the validation set and final blind test set of task 1, which achieved 3rd place with a small margin to the top performing systems on the leaderboard among 9 participants.


Kwame: A Bilingual AI Teaching Assistant for Online SuaCode Courses

arXiv.org Artificial Intelligence

Introductory hands-on courses such as our smartphone-based coding courses, SuaCode require a lot of support for students to accomplish learning goals. Online environments make it even more difficult to get assistance especially more recently because of COVID-19. Given the multilingual context of our students (learners across 38 African countries), in this work, we developed an AI Teaching Assistant (Kwame) that provides answers to students' coding questions from our SuaCode courses in English and French. Kwame is a Sentence-BERT(SBERT)-based question-answering (QA) system that we trained and evaluated using question-answer pairs created from our course's quizzes and students' questions in past cohorts. It finds the paragraph most semantically similar to the question via cosine similarity. We compared the system with TF-IDF and Universal Sentence Encoder. Our results showed that SBERT performed the worst for the duration of 6 secs per question but the best for accuracy and fine-tuning on our course data improved the result.


Ranking Clarification Questions via Natural Language Inference

arXiv.org Artificial Intelligence

Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems. Such interactions could help in filling information gaps for better machine comprehension of the query. For the task of ranking clarification questions, we hypothesize that determining whether a clarification question pertains to a missing entry in a given post (on QA forums such as StackExchange) could be considered as a special case of Natural Language Inference (NLI), where both the post and the most relevant clarification question point to a shared latent piece of information or context. We validate this hypothesis by incorporating representations from a Siamese BERT model fine-tuned on NLI and Multi-NLI datasets into our models and demonstrate that our best performing model obtains a relative performance improvement of 40 percent and 60 percent respectively (on the key metric of Precision@1), over the state-of-the-art baseline(s) on the two evaluation sets of the StackExchange dataset, thereby, significantly surpassing the state-of-the-art.