Goto

Collaborating Authors

 Shahid, Simra


Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination

arXiv.org Artificial Intelligence

A good idea should be relevant to the scientist's interests and novel within the scientific community. Research papers are a major source of inspiration for relevant and novel ideas, as they expose scientists to relevant concepts to re-combine and form new ideas [4, 21, 36]. However, generating relevant and novel scientific ideas by recombining concepts from research papers is difficult for multiple reasons. For one, scientists must wade through an ever-expanding scientific literature to find relevant concepts [2, 19]. Moreover, the phenomenon of fixation biases scientists against considering more diverse concepts and concept recombinations for their research; instead, they are predisposed to thinking about a problem in familiar terms, which hinders the stimulation of novel ideas [11, 37]. Even if a scientist manages to identify interesting concept recombinations to form potential research ideas, assessing the ideas' novelty in comparison to the existing literature is a cumbersome yet critical task. Building a fully or semi-automated ideation system has been an ambition of researchers for decades, and Scideatorbuilds on strong prior work from many other researchers, filling a unique niche. We extend a line of work that presents systems for finding analogies between research papers [4, 21, 36], adopting their facet-based framework but using modern large language model (LLM) methods to identify relevant facets and perform facet recombinations. We are also inspired by recent work showing that LLMs have promise to assist ideation in domains outside science, helping people to generate more ideas [6] and more diverse ideas [27, 40].


Towards Operationalizing Right to Data Protection

arXiv.org Artificial Intelligence

The recent success of large language models (LLMs) has exposed the vulnerability of public data as these models are trained on data scraped at scale from public forums and news articles [Touvron et al., 2023] without consent, and the collection of this data remains largely unregulated. As a result, governments worldwide have passed several regulatory frameworks, such as the GDPR [Voigt and Von dem Bussche, 2017] in the EU, the Personal Information Protection and Electronic Documents Act in Canada [PIPEDA], the Data Protection Act in the UK [DPA], the Personal Data Protection Commission (PDPC) [Commission et al., 2022] in Singapore, and the EU AI Act [Neuwirth, 2022], to safeguard algorithmic decisions and data usage practices. The aforementioned legislative frameworks emphasize individuals' rights over how their data is used, even in public contexts. These laws are not limited to private or sensitive data but also encompass the ethical use of publicly accessible information, especially in contexts where such data is used for profiling, decision-making, or large-scale commercial gains. Despite the regulatory efforts, state-of-the-art LLMs are increasingly used in real-world applications to exploit personal data and predict political affiliations [Rozado, 2024, Hernandes, 2024], societal biases [Liang et al., 2021, Dong et al., 2024], and sensitive information of individuals [Wan et al., 2023b, Salewski et al., 2024, Suman et al., 2021], highlighting significant gaps between research and regulatory frameworks. In this work, we aim to make the first attempt to operationalize one principle of "right to protect data" into algorithmic implementation in practice, i.e., people having control over their online data, and propose R


Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models

arXiv.org Artificial Intelligence

Existing debiasing techniques are typically training-based or require access to the model's internals and output distributions, so they are inaccessible to end-users looking to adapt LLM outputs for their particular needs. In this study, we examine whether structured prompting techniques can offer opportunities for fair text generation. We evaluate a comprehensive end-user-focused iterative framework of debiasing that applies System 2 thinking processes for prompts to induce logical, reflective, and critical text generation, with single, multi-step, instruction, and role-based variants. By systematically evaluating many LLMs across many datasets and different prompting strategies, we show that the more complex System 2-based Implicative Prompts significantly improve over other techniques demonstrating lower mean bias in the outputs with competitive performance on the downstream tasks. Our work offers research directions for the design and the potential of end-user-focused evaluative frameworks for LLM use.


All Should Be Equal in the Eyes of Language Models: Counterfactually Aware Fair Text Generation

arXiv.org Artificial Intelligence

Fairness in Language Models (LMs) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates or exemplars. Regardless, they dont address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability


HyHTM: Hyperbolic Geometry based Hierarchical Topic Models

arXiv.org Artificial Intelligence

Hierarchical Topic Models (HTMs) are useful for discovering topic hierarchies in a collection of documents. However, traditional HTMs often produce hierarchies where lowerlevel topics are unrelated and not specific enough to their higher-level topics. Additionally, these methods can be computationally expensive. We present HyHTM - a Hyperbolic geometry based Hierarchical Topic Models - that addresses these limitations by incorporating hierarchical information from hyperbolic geometry to explicitly model hierarchies in topic models. Experimental results with four baselines show that HyHTM can better attend to parent-child relationships among topics. HyHTM produces coherent topic hierarchies that specialise in granularity from generic higher-level topics to specific lowerlevel topics. Further, our model is significantly faster and leaves a much smaller memory footprint than our best-performing baseline.We have made the source code for our algorithm publicly accessible.