Goto

Collaborating Authors

 Ghuge, Shardul


FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?

arXiv.org Artificial Intelligence

The rapid evolution of Large Language Models (LLMs) underscores the critical importance of ethical considerations and data integrity in AI development, emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles have long been a cornerstone of ethical data stewardship, their application in LLM training data is less prevalent, an issue our research aims to address. Our study begins with a review of existing literature, highlighting the significance of FAIR principles in data management for model training. Building on this foundation, we introduce a novel framework that incorporates FAIR principles into the LLM training process. A key aspect of this approach is a comprehensive checklist, designed to assist researchers and developers in consistently applying FAIR data principles throughout the model development lifecycle. The practicality and effectiveness of our framework are demonstrated through a case study that involves creating a FAIR-compliant dataset to detect and reduce biases. This case study not only validates the usefulness of our framework but also establishes new benchmarks for more equitable, transparent, and ethical practices in LLM training. We offer this framework to the community as a means to promote technologically advanced, ethically sound, and socially responsible AI models.


Analyzing the Impact of Fake News on the Anticipated Outcome of the 2024 Election Ahead of Time

arXiv.org Artificial Intelligence

Despite increasing awareness and research around fake news, there is still a significant need for datasets that specifically target racial slurs and biases within North American political speeches. This is particulary important in the context of upcoming North American elections. This study introduces a comprehensive dataset that illuminates these critical aspects of misinformation. To develop this fake news dataset, we scraped and built a corpus of 40,000 news articles about political discourses in North America. A portion of this dataset (4000) was then carefully annotated, using a blend of advanced language models and human verification methods. We have made both these datasets openly available to the research community and have conducted benchmarking on the annotated data to demonstrate its utility. We release the best-performing language model along with data. We encourage researchers and developers to make use of this dataset and contribute to this ongoing initiative.


Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis

arXiv.org Artificial Intelligence

Bias detection in text is imperative due to its role in reinforcing negative stereotypes, disseminating misinformation, and influencing decisions. Current language models often fall short in generalizing beyond their training sets. In response, we introduce the Contextualized Bi-Directional Dual Transformer (CBDT) Classifier. This novel architecture utilizes two synergistic transformer networks: the Context Transformer and the Entity Transformer, aiming for enhanced bias detection. Our dataset preparation follows the FAIR principles, ensuring ethical data usage. Through rigorous testing on various datasets, CBDT showcases its ability in distinguishing biased from neutral statements, while also pinpointing exact biased lexemes. Our approach outperforms existing methods, achieving a 2-4\% increase over benchmark performances. This opens avenues for adapting the CBDT model across diverse linguistic and cultural landscapes.