Goto

Collaborating Authors

 chatgpt4


Enhancing Visual Inspection Capability of Multi-Modal Large Language Models on Medical Time Series with Supportive Conformalized and Interpretable Small Specialized Models

Li, Huayu, Chen, Xiwen, Zhang, Ci, Quan, Stuart F., Killgore, William D. S., Wung, Shu-Fen, Chen, Chen X., Yuan, Geng, Lu, Jin, Li, Ao

arXiv.org Artificial Intelligence

Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning required for complex clinical decision-making. To address these challenges, we propose ConMIL (Conformalized Multiple Instance Learning), a decision-support SSM that integrates seamlessly with LLMs. By using Multiple Instance Learning (MIL) to identify clinically significant signal segments and conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs' interpretative capabilities for medical time-series analysis. Experimental results demonstrate that ConMIL significantly improves the performance of state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically, \ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for confident samples in arrhythmia detection and sleep staging, compared to standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the potential of ConMIL to bridge task-specific precision and broader contextual reasoning, enabling more reliable and interpretable AI-driven clinical decision support.


Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams

McGee, Monnie, Sadler, Bivin

arXiv.org Artificial Intelligence

The association of social mobility with a college education has been studied since the early 1950's [1]. Although there are some indications that a college education is not as effective as it once was in helping graduates climb the social ladder [2], it is still the most reliable way of doing so. US News & World Report updated its rankings in 2023 to include social mobility [3], and many institutions of higher education are paying more attention to recruitment of first-generation college students and talented students from disadvantaged backgrounds. With the inclusion of such students in the typical college class comes some important considerations. For example, a student from difficult financial circumstances with an academic background to match the profile of any student an elite institution will have more difficulty paying for textbooks, a laptop, a smartphone, and other items that are almost essential to current college life [2]. As of November 2022, one such item that students from advantaged backgrounds will have access to that those from lower income brackets will not is ChatGPT4 [4]. It currently costs $20 per month for a subscription and has been called a "significant leap forward" compared to ChatGPT3.5 [5], which is free [6]. While use of generative AI is prohibited in some college classrooms, this is hard to police, and many students use it regardless of classroom restrictions [7]. When generative AI is allowed, there is a wide array of platforms from which students can choose.


How Good is ChatGPT in Giving Adaptive Guidance Using Knowledge Graphs in E-Learning Environments?

Ocheja, Patrick, Flanagan, Brendan, Dai, Yiling, Ogata, Hiroaki

arXiv.org Artificial Intelligence

E-learning environments are increasingly harnessing large language models (LLMs) like GPT-3.5 and GPT-4 for tailored educational support. This study introduces an approach that integrates dynamic knowledge graphs with LLMs to offer nuanced student assistance. By evaluating past and ongoing student interactions, the system identifies and appends the most salient learning context to prompts directed at the LLM. Central to this method is the knowledge graph's role in assessing a student's comprehension of topic prerequisites. Depending on the categorized understanding (good, average, or poor), the LLM adjusts its guidance, offering advanced assistance, foundational reviews, or in-depth prerequisite explanations, respectively. Preliminary findings suggest students could benefit from this tiered support, achieving enhanced comprehension and improved task outcomes. However, several issues related to potential errors arising from LLMs were identified, which can potentially mislead students. This highlights the need for human intervention to mitigate these risks. This research aims to advance AI-driven personalized learning while acknowledging the limitations and potential pitfalls, thus guiding future research in technology and data-driven education.


Rethinking Scale: The Efficacy of Fine-Tuned Open-Source LLMs in Large-Scale Reproducible Social Science Research

Carammia, Marcello, Iacus, Stefano Maria, Porro, Giuseppe

arXiv.org Machine Learning

Large Language Models (LLMs) are distinguished by their architecture, which dictates their parameter size and performance capabilities. Social scientists have increasingly adopted LLMs for text classification tasks, which are difficult to scale with human coders. While very large, closed-source models often deliver superior performance, their use presents significant risks. These include lack of transparency, potential exposure of sensitive data, challenges to replicability, and dependence on proprietary systems. Additionally, their high costs make them impractical for large-scale research projects. In contrast, open-source models, although available in various sizes, may underperform compared to commercial alternatives if used without further fine-tuning. However, open-source models offer distinct advantages: they can be run locally (ensuring data privacy), fine-tuned for specific tasks, shared within the research community, and integrated into reproducible workflows. This study demonstrates that small, fine-tuned open-source LLMs can achieve equal or superior performance to models such as ChatGPT-4. We further explore the relationship between training set size and fine-tuning efficacy in open-source models. Finally, we propose a hybrid workflow that leverages the strengths of both open and closed models, offering a balanced approach to performance, transparency, and reproducibility.


How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence

Zaphir, Luke, Lodge, Jason M., Lisec, Jacinta, McGrath, Dom, Khosravi, Hassan

arXiv.org Artificial Intelligence

Generative AI such as those with large language models have created opportunities for innovative assessment design practices. Due to recent technological developments, there is a need to know the limits and capabilities of generative AI in terms of simulating cognitive skills. Assessing student critical thinking skills has been a feature of assessment for time immemorial, but the demands of digital assessment create unique challenges for equity, academic integrity and assessment authorship. Educators need a framework for determining their assessments vulnerability to generative AI to inform assessment design practices. This paper presents a framework that explores the capabilities of the LLM ChatGPT4 application, which is the current industry benchmark. This paper presents the Mapping of questions, AI vulnerability testing, Grading, Evaluation (MAGE) framework to methodically critique their assessments within their own disciplinary contexts. This critique will provide specific and targeted indications of their questions vulnerabilities in terms of the critical thinking skills. This can go on to form the basis of assessment design for their tasks.


Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Plaza, Irene, Melero, Nina, del Pozo, Cristina, Conde, Javier, Reviriego, Pedro, Mayor-Rocher, Marina, Grandury, María

arXiv.org Artificial Intelligence

The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.


AI is already more creative than YOU: ChatGPT outperformed humans in creative thinking experiment

Daily Mail - Science & tech

Artificial intelligence outperforms humans in strategy games, website design and data processing, but now the tech can add creative thinking to the list. AI chatbots surpassed humans when asked to devise alternative uses for everyday objects. Researchers said the AI used a skill known as divergent thinking, a thought process or method used to generate creative ideas by exploring many possible solutions. The study by the University of Stavanger in Norway involved 256 human volunteers and three AI chatbots – ChatGPT3, ChatGPT4, and Copy.Ai - that were asked to provide multiple uses for a rope, box, pencil and candle. When assessed with a type of divergent thinking exercise known as alternate uses tasks, which asks a person to think of as many uses as possible for a simple object, chatbots, on average, performed better than humans.


ChatGPT is a Remarkable Tool -- For Experts

Azaria, Amos, Azoulay, Rina, Reches, Shulamit

arXiv.org Artificial Intelligence

This paper investigates the capabilities of ChatGPT as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style. Furthermore, we highlight the potential risks associated with excessive reliance on ChatGPT in these fields. These limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyrights and privacy violation. We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. In light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, ChatGPT should be used with a strategic methodology. By drawing from comprehensive experimental studies, we offer methods and flow charts for effectively using ChatGPT. Our recommendations emphasize iterative interaction with ChatGPT and independent verification of its outputs. Considering the importance of utilizing ChatGPT judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.


How many uses for a FORK can you think of? ChatGPT comes up with more ideas than 90% of humans

Daily Mail - Science & tech

As a staple of every cutlery draw, it is probably safe to say that most of us use forks at dinner time without batting an eyelid. Yet bots like ChatGPT have flipped this on its head, suggesting forks could also be used for playing'I spy', fighting zombies and digging trenches. Artificial intelligence and humans went head-to-head in a new study that sought to find out which was better at coming up with the most imaginative ideas. As it turns out, new bots are more creative than 90 per cent of humans - thinking of bizarre uses for everyday items like toothbrushes, pants, forks and tyres. ChatGPT was among six state-of-the-art bots tested by scientists at Berlin's Humboldt University and the University of Essex.


StackRoof Technologies on LinkedIn: #stackrooftechnologies #servinginnovation #chatgpt4 #chatgpt3 #chatbot…

#artificialintelligence

It is an AI-supported chatbot that helps users to generate content. As the world is slowly turning towards artificial intelligence, the ask or demand for quality AI content generators is increasing. For sure, OpenAi's ChatGPT4 has proved to be a great AI asset.