Goto

Collaborating Authors

 examining


Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

Neural Information Processing Systems

Recent advances in AI have been significantly driven by the capabilities of large language models (LLMs) to solve complex problems in ways that resemble human thinking. However, there is an ongoing debate about the extent to which LLMs are capable ofactual reasoning. Central to this debate are two key probabilistic concepts that are essential for connecting causesto their effects: the probability of necessity (PN) and the probability of sufficiency (PS). This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms using these probabilistic measures. By viewing LLMs as abstract machines that process information through a natural language interface, we examine the conditions under which it is possible to compute suitable approximations of PN and PS. Our research marks an important step towards gaining a deeper understanding of when LLMs are capable of reasoning, as illustrated by a series of math examples.


The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

Cuadron, Alejandro, Li, Dacheng, Ma, Wenjie, Wang, Xingyao, Wang, Yichuan, Zhuang, Siyuan, Liu, Shu, Schroeder, Luis Gaspar, Xia, Tian, Mao, Huanzhi, Thumiger, Nicholas, Desai, Aditya, Stoica, Ion, Klimovic, Ana, Neubig, Graham, Gonzalez, Joseph E.

arXiv.org Artificial Intelligence

Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.


From Creation to Curriculum: Examining the role of generative AI in Arts Universities

Sims, Atticus

arXiv.org Artificial Intelligence

The age of Artificial Intelligence (AI) is marked by its transformative "generative" capabilities, distinguishing it from prior iterations. This burgeoning characteristic of AI has enabled it to produce new and original content, inherently showcasing its creative prowess. This shift challenges and requires a recalibration in the realm of arts education, urging a departure from established pedagogies centered on human-driven image creation. The paper meticulously addresses the integration of AI tools, with a spotlight on Stable Diffusion (SD), into university arts curricula. Drawing from practical insights gathered from workshops conducted in July 2023, which culminated in an exhibition of AI-driven artworks, the paper aims to provide a roadmap for seamlessly infusing these tools into academic settings. Given their recent emergence, the paper delves into a comprehensive overview of such tools, emphasizing the intricate dance between artists, developers, and researchers in the open-source AI art world. This discourse extends to the challenges and imperatives faced by educational institutions. It presents a compelling case for the swift adoption of these avant-garde tools, underscoring the paramount importance of equipping students with the competencies required to thrive in an AI-augmented artistic landscape.


The Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks

Barman, Niyar R, Sharma, Krish, Aziz, Ashhar, Bajpai, Shashwat, Biswas, Shwetangshu, Sharma, Vasu, Jain, Vinija, Chadha, Aman, Sheth, Amit, Das, Amitava

arXiv.org Artificial Intelligence

The rapid advancement of text-to-image generation systems, exemplified by models like Stable Diffusion, Midjourney, Imagen, and DALL-E, has heightened concerns about their potential misuse. In response, companies like Meta and Google have intensified their efforts to implement watermarking techniques on AI-generated images to curb the circulation of potentially misleading visuals. However, in this paper, we argue that current image watermarking methods are fragile and susceptible to being circumvented through visual paraphrase attacks. The proposed visual paraphraser operates in two steps. First, it generates a caption for the given image using KOSMOS-2, one of the latest state-of-the-art image captioning systems. Second, it passes both the original image and the generated caption to an image-to-image diffusion system. During the denoising step of the diffusion pipeline, the system generates a visually similar image that is guided by the text caption. The resulting image is a visual paraphrase and is free of any watermarks. Our empirical findings demonstrate that visual paraphrase attacks can effectively remove watermarks from images. This paper provides a critical assessment, empirically revealing the vulnerability of existing watermarking techniques to visual paraphrase attacks. While we do not propose solutions to this issue, this paper serves as a call to action for the scientific community to prioritize the development of more robust watermarking techniques. Our first-of-its-kind visual paraphrase dataset and accompanying code are publicly available.


Examining the Effect of Implementation Factors on Deep Learning Reproducibility

Coakley, Kevin, Kirkpatrick, Christine R., Gundersen, Odd Erik

arXiv.org Artificial Intelligence

Reproducing published deep learning papers to validate their conclusions can be difficult due to sources of irreproducibility. We investigate the impact that implementation factors have on the results and how they affect reproducibility of deep learning studies. Three deep learning experiments were ran five times each on 13 different hardware environments and four different software environments. The analysis of the 780 combined results showed that there was a greater than 6% accuracy range on the same deterministic examples introduced from hardware or software environment variations alone. To account for these implementation factors, researchers should run their experiments multiple times in different hardware and software environments to verify their conclusions are not affected.


Examining the Values Reflected by Children during AI Problem Formulation

Dwivedi, Utkarsh, Elsayed-ali, Salma, Bonsignore, Elizabeth, Kacorri, Hernisa

arXiv.org Artificial Intelligence

Understanding how children design and what they value in AI interfaces that allow them to explicitly train their models such as teachable machines, could help increase such activities' impact and guide the design of future technologies. In a co-design session using a modified storyboard, a team of 5 children (aged 7-13 years) and adult co-designers, engaged in AI problem formulation activities where they imagine their own teachable machines. Our findings, leveraging an established psychological value framework (the Rokeach Value Survey), illuminate how children conceptualize and embed their values in AI systems that they themselves devise to support their everyday activities. Specifically, we find that children's proposed ideas require advanced system intelligence, e.g. emotion detection and understanding the social relationships of a user. The underlying models could be trained under multiple modalities and any errors would be fixed by adding more data or by anticipating negative examples. Children's ideas showed they cared about family and expected machines to understand their social context before making decisions.


Examining the Limitations of Computational Rumor Detection Models Trained on Static Datasets

Mu, Yida, Song, Xingyi, Bontcheva, Kalina, Aletras, Nikolaos

arXiv.org Artificial Intelligence

A crucial aspect of a rumor detection model is its ability to generalize, particularly its ability to detect emerging, previously unknown rumors. Past research has indicated that content-based (i.e., using solely source posts as input) rumor detection models tend to perform less effectively on unseen rumors. At the same time, the potential of context-based models remains largely untapped. The main contribution of this paper is in the in-depth evaluation of the performance gap between content and context-based models specifically on detecting new, unseen rumors. Our empirical findings demonstrate that context-based models are still overly dependent on the information derived from the rumors' source post and tend to overlook the significant role that contextual information can play. We also study the effect of data split strategies on classifier performance. Based on our experimental results, the paper also offers practical suggestions on how to minimize the effects of temporal concept drift in static datasets during the training of rumor detection methods.


Examining the Influence of Varied Levels of Domain Knowledge Base Inclusion in GPT-based Intelligent Tutors

Castleman, Blake, Turkcan, Mehmet Kerem

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have facilitated the development of chatbots with sophisticated conversational capabilities. However, LLMs exhibit frequent inaccurate responses to queries, hindering applications in educational settings. In this paper, we investigate the effectiveness of integrating a knowledge base (KB) with LLM intelligent tutors to increase response reliability. To achieve this, we design a scaleable KB that affords educational supervisors seamless integration of lesson curricula, which is automatically processed by the intelligent tutoring system. We then detail an evaluation, where student participants were presented with questions about the artificial intelligence curriculum to respond to. GPT-4 intelligent tutors with varying hierarchies of KB access and human domain experts then assessed these responses. Lastly, students cross-examined the intelligent tutors' responses to the domain experts' and ranked their various pedagogical abilities. Results suggest that, although these intelligent tutors still demonstrate a lower accuracy compared to domain experts, the accuracy of the intelligent tutors increases when access to a KB is granted. We also observe that the intelligent tutors with KB access exhibit better pedagogical abilities to speak like a teacher and understand students than those of domain experts, while their ability to help students remains lagging behind domain experts.


Examining the Effect of Pre-training on Time Series Classification

Pu, Jiashu, Zhao, Shiwei, Cheng, Ling, Chang, Yongzhu, Wu, Runze, Lv, Tangjie, Zhang, Rongsheng

arXiv.org Artificial Intelligence

Although the pre-training followed by fine-tuning paradigm is used extensively in many fields, there is still some controversy surrounding the impact of pre-training on the fine-tuning process. Currently, experimental findings based on text and image data lack consensus. To delve deeper into the unsupervised pre-training followed by fine-tuning paradigm, we have extended previous research to a new modality: time series. In this study, we conducted a thorough examination of 150 classification datasets derived from the Univariate Time Series (UTS) and Multivariate Time Series (MTS) benchmarks. Our analysis reveals several key conclusions. (i) Pre-training can only help improve the optimization process for models that fit the data poorly, rather than those that fit the data well. (ii) Pre-training does not exhibit the effect of regularization when given sufficient training time. (iii) Pre-training can only speed up convergence if the model has sufficient ability to fit the data. (iv) Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume, such as faster convergence. (v) While both the pre-training task and the model structure determine the effectiveness of the paradigm on a given dataset, the model structure plays a more significant role.


Examining the Effectiveness of Chatbots in Gathering Family History Information in Comparison to the Standard In-Person Interview-Based Approach

Drumm, Kieron, Tran, Vincent

arXiv.org Artificial Intelligence

One of the most common things that a genealogist is tasked with is the gathering of a person's initial family history, normally via in-person interviews or with the use of a platform such as ancestry.com, as this can provide a strong foundation upon which a genealogist may build. However, the ability to conduct these interviews can often be hindered by both geographical constraints and the technical proficiency of the interviewee, as the interviewee in these types of interviews is most often an elderly person with a lower than average level of technical proficiency. With this in mind, this study presents what we believe, based on prior research, to be the first chatbot geared entirely towards the gathering of family histories, and explores the viability of utilising such a chatbot by comparing the performance and usability of such a method with the aforementioned alternatives. With a chatbot-based approach, we show that, though the average time taken to conduct an interview may be longer than if the user had used ancestry.com or participated in an in-person interview, the number of mistakes made and the level of confusion from the user regarding the UI and process required is lower than the other two methods. Note that the final metric regarding the user's confusion is not applicable for the in-person interview sessions due to its lack of a UI. With refinement, we believe this use of a chatbot could be a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries where it is not possible to conduct an in-person interview.