Ravi, Selvan Sunitha
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking
Deshpande, Darshan, Ravi, Selvan Sunitha, CH-Wang, Sky, Mielczarek, Bartosz, Kannappan, Anand, Qian, Rebecca
The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Romanou, Angelika, Foroutan, Negar, Sotnikova, Anna, Chen, Zeming, Nelaturu, Sree Harsha, Singh, Shivalika, Maheshwary, Rishabh, Altomare, Micol, Haggag, Mohamed A., A, Snegha, Amayuelas, Alfonso, Amirudin, Azril Hafizi, Aryabumi, Viraat, Boiko, Danylo, Chang, Michael, Chim, Jenny, Cohen, Gal, Dalmia, Aditya Kumar, Diress, Abraham, Duwal, Sharad, Dzenhaliou, Daniil, Florez, Daniel Fernando Erazo, Farestam, Fabian, Imperial, Joseph Marvin, Islam, Shayekh Bin, Isotalo, Perttu, Jabbarishiviari, Maral, Karlsson, Bรถrje F., Khalilov, Eldar, Klamm, Christopher, Koto, Fajri, Krzemiลski, Dominik, de Melo, Gabriel Adriano, Montariol, Syrielle, Nan, Yiyang, Niklaus, Joel, Novikova, Jekaterina, Ceron, Johan Samir Obando, Paul, Debjit, Ploeger, Esther, Purbey, Jebish, Rajwal, Swati, Ravi, Selvan Sunitha, Rydell, Sara, Santhosh, Roshan, Sharma, Drishti, Skenduli, Marjana Prifti, Moakhar, Arshia Soltani, Moakhar, Bardia Soltani, Tamir, Ran, Tarun, Ayush Kumar, Wasi, Azmine Toushik, Weerasinghe, Thenuka Ovin, Yilmaz, Serhan, Zhang, Mike, Schlag, Imanol, Fadaee, Marzieh, Hooker, Sara, Bosselut, Antoine
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (i.e., multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. The rapid advancement of AI technologies underscores the importance of developing LLMs that are proficient across diverse linguistic and cultural contexts, ensuring fair and equitable performance for stakeholders from various language groups. However, the lack of high-quality evaluation benchmarks in many languages discourages practitioners from training multilingual LLMs to meet this challenge. This evaluation gap limits the effective deployment of LLMs for many regions, exacerbates digital divides, and inhibits the economic and societal value of AI tools in many underserved communities. The source of this gap is the multitude of challenges in evaluating LLMs for multilingual contexts. First, at a meta-level, the majority of benchmarks for LLMs are only in English (Hendrycks et al., 2020, inter alia). Technical challenges also abound due to the manner in which multilingual datasets are often collected. Certain datasets are constructed using manually applied templates, resulting in low prompt and completion diversity (Muennighoff et al., 2022). Many more are composed of translations from high-resource languages (e.g., English; Holtermann et al., 2024; Myung et al., 2024; Lai et al., 2023; Foroutan et al., 2023). These datasets often contain errors (Ponti et al., 2020; Plaza et al., 2024) and create translationese artifacts (Vanmassenhove et al., 2021; Hartung et al., 2023; Savoldi et al., 2021; Ji et al., 2023).
Lynx: An Open Source Hallucination Evaluation Model
Ravi, Selvan Sunitha, Mielczarek, Bartosz, Kannappan, Anand, Kiela, Douwe, Qian, Rebecca
Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.
Leveraging Multimodal Behavioral Analytics for Automated Job Interview Performance Assessment and Feedback
Agrawal, Anumeha, George, Rosa Anil, Ravi, Selvan Sunitha, S, Sowmya Kamath, M, Anand Kumar
Behavioral cues play a significant part in human communication and cognitive perception. In most professional domains, employee recruitment policies are framed such that both professional skills and personality traits are adequately assessed. Hiring interviews are structured to evaluate expansively a potential employee's suitability for the position - their professional qualifications, interpersonal skills, ability to perform in critical and stressful situations, in the presence of time and resource constraints, etc. Therefore, candidates need to be aware of their positive and negative attributes and be mindful of behavioral cues that might have adverse effects on their success. We propose a multimodal analytical framework that analyzes the candidate in an interview scenario and provides feedback for predefined labels such as engagement, speaking rate, eye contact, etc. We perform a comprehensive analysis that includes the interviewee's facial expressions, speech, and prosodic information, using the video, audio, and text transcripts obtained from the recorded interview. We use these multimodal data sources to construct a composite representation, which is used for training machine learning classifiers to predict the class labels. Such analysis is then used to provide constructive feedback to the interviewee for their behavioral cues and body language. Experimental validation showed that the proposed methodology achieved promising results.