Ranchi
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes
Nawale, Janki Atul, Khan, Mohammed Safi Ur Rahman, D, Janani, Gupta, Mansi, Pruthi, Danish, Khapra, Mitesh M.
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
Effective Feature Selection for Predicting Spreading Factor with ML in Large LoRaWAN-based Mobile IoT Networks
Prakash, Aman, Choudhury, Nikumani, Hazarika, Anakhi, Gorrela, Alekhya
LoRaWAN is a low-power long-range protocol that enables reliable and robust communication. This paper addresses the challenge of predicting the spreading factor (SF) in LoRaWAN networks using machine learning (ML) techniques. Optimal SF allocation is crucial for optimizing data transmission in IoT-enabled mobile devices, yet it remains a challenging task due to the fluctuation in environment and network conditions. We evaluated ML model performance across a large publicly available dataset to explore the best feature across key LoRaWAN features such as RSSI, SNR, frequency, distance between end devices and gateways, and antenna height of the end device, further, we also experimented with 31 different combinations possible for 5 features. We trained and evaluated the model using k-nearest neighbors (k-NN), Decision Tree Classifier (DTC), Random Forest (RF), and Multinomial Logistic Regression (MLR) algorithms. The combination of RSSI and SNR was identified as the best feature set. The finding of this paper provides valuable information for reducing the overall cost of dataset collection for ML model training and extending the battery life of LoRaWAN devices. This work contributes to a more reliable LoRaWAN system by understanding the importance of specific feature sets for optimized SF allocation.
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Zhang, Songming, Zhang, Xue, Zhang, Tong, Hu, Bojie, Chen, Yufeng, Xu, Jinan
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
Qiu, Wenjie, Li, Yi-Chen, Zhang, Xuqin, Zhang, Tianyi, Zhang, Yihang, Zhang, Zongzhang, Yu, Yang
Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).
ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And Segmentation Of GI Bleeding
Singh, Ayushman, Prakash, Sharad, Das, Aniket, Kushwaha, Nidhi
This study presents an integrated deep learning model for automatic detection and classification of Gastrointestinal bleeding in the frames extracted from Wireless Capsule Endoscopy (WCE) videos. The dataset has been released as part of Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model attained the highest performance among 75 teams that took part in this competition. It aims to efficiently utilizes CNN based model i.e. DenseNet and UNet to detect and segment bleeding and non-bleeding areas in the real-world complex dataset. The model achieves an impressive overall accuracy of 80% which would surely help a skilled doctor to carry out further diagnostics.
T-REG: Preference Optimization with Token-Level Reward Regularization
Zhou, Wenxuan, Zhang, Shujian, Zhao, Lingxiao, Meng, Tao
Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at https://github.com/wzhouad/T-REG.
Can LLMs faithfully generate their layperson-understandable 'self'?: A Case Study in High-Stakes Domains
Das, Arion, Mishra, Asutosh, Patel, Amitesh, De, Soumilya, Gurucharan, V., Ghosh, Kripabandhu
Large Language Models (LLMs) have significantly impacted nearly every domain of human knowledge. However, the explainability of these models esp. to laypersons, which are crucial for instilling trust, have been examined through various skeptical lenses. In this paper, we introduce a novel notion of LLM explainability to laypersons, termed $\textit{ReQuesting}$, across three high-priority application domains -- law, health and finance, using multiple state-of-the-art LLMs. The proposed notion exhibits faithful generation of explainable layman-understandable algorithms on multiple tasks through high degree of reproducibility. Furthermore, we observe a notable alignment of the explainable algorithms with intrinsic reasoning of the LLMs.
Token-level Proximal Policy Optimization for Query Generation
Ouyang, Yichen, Wang, Lu, Yang, Fangkai, Zhao, Pu, Huang, Chenghua, Liu, Jianfeng, Pang, Bochen, Yang, Yaming, Zhan, Yuefeng, Sun, Hao, Lin, Qingwei, Rajmohan, Saravan, Deng, Weiwei, Zhang, Dongmei, Sun, Feng, Zhang, Qi
Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
Li, Shilong, He, Yancheng, Huang, Hui, Bu, Xingyuan, Liu, Jiaheng, Guo, Hangyu, Wang, Weixun, Gu, Jihao, Su, Wenbo, Zheng, Bo
Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. Reward models play a pivotal role in both techniques. In RLHF, reward models serve as proxies for human values, providing feedback on generated text, which helps align language models (policy models) during training (Ouyang et al., 2022; Dong et al., 2024). In Inference Scaling Law, reward models are used to select the best response from a set of candidates based on predicted rewards (Wu et al., 2024; Snell et al., 2024). Despite their significance, benchmarks for reward models remain under-explored compared to the rapid advancements in aligned language model evaluation, namely the policy model (Hendrycks et al., 2020; bench authors, 2023; Chiang et al., 2024; Hendrycks et al., 2021). To conduct a faithful and systematical evaluation, an ideal benchmark for reward models should adhere to three key principles: 1) Assessing Reward Models' Sensitivity to Subtle Changes: A faithful reward model should sensitively distinguish subtle changes and assign a higher reward to the correct response. For example, in Table 1, Response 1 and Response 2 differ by only one word but express completely different meanings, requiring the reward model to focus on content quality. For example, in Table 1, Response 3 is factually incorrect but longer than Response 1, which could mislead the reward model into assigning a higher reward to Response 3. 3) Correlating with Policy Models: A good reward model benchmark should highly correlate with the performance of the aligned language model (the policy model). This would make it a reliable proxy for selecting the best reward model for alignment. Recent efforts (Lambert et al., 2024; Zhu et al., 2023; Jiang et al., 2023) have made progress by constructing benchmarks from existing preference datasets.