Tewari, Dinesh
SMOL: Professionally translated parallel data for 115 under-represented languages
Caswell, Isaac, Nielsen, Elizabeth, Luo, Jiaming, Cherry, Colin, Kovacs, Geza, Shemtov, Hadar, Talukdar, Partha, Tewari, Dinesh, Diane, Baba Mamadi, Doumbouya, Koulako Moussa, Diane, Djibrila, Cissé, Solo Farabado
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.
Agricultural Landscape Understanding At Country-Scale
Dua, Radhika, Saxena, Nikita, Agarwal, Aditi, Wilson, Alex, Singh, Gaurav, Tran, Hoang, Deshpande, Ishan, Kaur, Amandeep, Aggarwal, Gaurav, Nath, Chandan, Basu, Arnab, Batchu, Vishal, Holla, Sharath, Kurle, Bindiya, Missura, Olana, Aggarwal, Rahul, Garg, Shubhika, Shah, Nishi, Singh, Avneet, Tewari, Dinesh, Dondzik, Agata, Adsul, Bharat, Sohoni, Milind, Praveen, Asim Rama, Dangi, Aaryan, Kadivar, Lisan, Abhishek, E, Sudhansu, Niranjan, Hattekar, Kamlakar, Datar, Sameer, Chaithanya, Musty Krishna, Reddy, Anumas Ranjith, Kumar, Aashish, Tirumala, Betala Laxmi, Talekar, Alok
The global food system is facing unprecedented challenges. In 2023, 2.4 billion people experienced moderate to severe food insecurity [1], a crisis precipitated by anthropogenic climate change and evolving dietary preferences. Furthermore, the food system itself significantly contributes to the climate crisis, with food loss and waste accounting for 2.4 gigatonnes of carbon dioxide equivalent emissions per year (GT CO2e/yr) [2], and the production, mismanagement, and misapplication of agricultural inputs such as fertilizers and manure generating an additional 2.5 GT CO2e/yr [3]. To sustain a projected global population of 9.6 billion by 2050, the Food and Agriculture Organization (FAO) estimates that food production must increase by at least 60% [1]. However, this also presents an opportunity: transitioning to sustainable agricultural practices can transform the sector from a net source of greenhouse gas emissions to a vital carbon sink.
IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages
Singh, Harman, Gupta, Nitish, Bharadwaj, Shikhar, Tewari, Dinesh, Talukdar, Partha
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at www.github.com/google-research-datasets/indic-gen-bench
Building Socio-culturally Inclusive Stereotype Resources with Community Engagement
Dev, Sunipa, Goyal, Jaya, Tewari, Dinesh, Dave, Shachi, Prabhakaran, Vinodkumar
With rapid development and deployment of generative language models in global settings, there is an urgent need to also scale our measurements of harm, not just in the number and types of harms covered, but also how well they account for local cultural contexts, including marginalized identities and the social biases experienced by them. Current evaluation paradigms are limited in their abilities to address this, as they are not representative of diverse, locally situated but global, socio-cultural perspectives. It is imperative that our evaluation resources are enhanced and calibrated by including people and experiences from different cultures and societies worldwide, in order to prevent gross underestimations or skews in measurements of harm. In this work, we demonstrate a socio-culturally aware expansion of evaluation resources in the Indian societal context, specifically for the harm of stereotyping. We devise a community engaged effort to build a resource which contains stereotypes for axes of disparity that are uniquely present in India. The resultant resource increases the number of stereotypes known for and in the Indian context by over 1000 stereotypes across many unique identities. We also demonstrate the utility and effectiveness of such expanded resources for evaluations of language models. CONTENT WARNING: This paper contains examples of stereotypes that may be offensive.