Goto

Collaborating Authors

 Ramesh, Krithika


Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

arXiv.org Artificial Intelligence

The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.


MEGA: Multilingual Evaluation of Generative AI

arXiv.org Artificial Intelligence

Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.


Fairness in Language Models Beyond English: Gaps and Challenges

arXiv.org Artificial Intelligence

With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. Most research on evaluating and mitigating fairness harms has been concentrated on English, while multilingual models and non-English languages have received comparatively little attention. This paper presents a survey of fairness in multilingual and non-English contexts, highlighting the shortcomings of current research and the difficulties faced by methods designed for English. We contend that the multitude of diverse cultures and languages across the world makes it infeasible to achieve comprehensive coverage in terms of constructing fairness datasets. Thus, the measurement and mitigation of biases must evolve beyond the current dataset-driven practices that are narrowly focused on specific dimensions and types of biases and, therefore, impossible to scale across languages and cultures.


Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation

arXiv.org Artificial Intelligence

Although our computational techniques and hardware resources have advanced greatly these past few decades, given the rise of large language models which have applications in multiple sectors, the environmental impact of training and developing NLP models, particularly on a large scale, could have detrimental consequences on the environment. This is due to the energy usage (whether carbon neutral or not) [1, 2] possibly contributing directly or indirectly to the effects of climate change. With experiments on total time expected for models such as Transformer, BERT, and GPT-2 to train and the subsequent cost of training, Strubell et al. [2] provides substantial evidence that researchers need to increasingly prioritize computationally efficient hardware and algorithms. There has been research to suggest that large language models could be outperformed by their less computationally intensive counterparts on multiple tasks with the help of fine-tuning [3] and techniques such as using random search for hyperparameter search [1, 4-6] or pruning [7, 8]. Additionally, as performance across different tasks tends to vary based on the languages used, data availability, model architectures among other factors, it is likely that training models to a certain performance level for some languages are less carbon-intensive than others. This is speculation is substantiated by the correlation found between morphological ambiguity of languages and the performance of language models on European languages [9]. The primary objective of our work is to measure the differences in carbon emissions released between multiple language pairs and assess the contributions of various components, within the two architectures we've used, to the carbon We are grateful to the Research Society MIT, Manipal for supporting this work, and we attribute equal contribution to all the authors of this paper.


Evaluating Gender Bias in Hindi-English Machine Translation

arXiv.org Artificial Intelligence

With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.