chatgpt-4
Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance
This study positions large language models (LLMs) as "subjective literary critics" to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency (α ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports
Bhattacharya, Anindita, Rehman, Tohida, Sanyal, Debarshi Kumar, Chattopadhyay, Samiran
The'findings' section of a radiology report is often detailed and lengthy, whereas the'impression' section is comparatively more compact and captures key diagnostic conclusions. This research explores the use of advanced abstractive summarization models to generate the concise'impression' from the'findings' section of a radiology report. We have used the publicly available MIMIC-CXR dataset. A comparative analysis is conducted on leading pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism. To ensure a thorough assessment, multiple evaluation metrics are employed, including ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore. By analyzing the performance of these models, this study identifies their respective strengths and limitations in the summarization of medical text. The findings of this paper provide helpful information for medical professionals who need automated summarization solutions in the healthcare sector.
- Asia > India > West Bengal > Kolkata (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Germany > Berlin (0.04)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.95)
Welp, Nvidia's RTX 5090 can crack an 8-digit password in 3 hours
I have bad news for everyone with weak passwords. A hacker can guess your laziest random passwords in the same amount of time it takes to watch a movie. It turns out when you put the most brutally fast consumer graphics card on the task of, uh, brute-forcing 8-character passwords, it can crack a numbers-only string in 3 hours. Such is the finding of Hive Systems, a cybersecurity firm based in Virginia, as part of the research that went into its 2025 password table. The chart shows how fast a "consumer budget" hacker could brute-force passwords of varying lengths (4 to 18 characters) and compositions (e.g., numbers only, lowercase letters, uppercase and lowercase letters, etc.).
Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information
Kuznetsova, Elizaveta, Vitulano, Ilaria, Makhortykh, Mykola, Stolze, Martha, Nagy, Tomas, Vziatysheva, Victoria
The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.
- North America > United States (1.00)
- Asia > Russia (0.46)
- Europe > Switzerland (0.14)
- Europe > Germany (0.14)
- Media > News (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.95)
ChatGPT for President! Presupposed content in politicians versus GPT-generated texts
Garassino, Davide, Brocca, Nicola, Masia, Viviana
This study examines ChatGPT-4's capability to replicate linguistic strategies used in political discourse, focusing on its potential for manipulative language generation. As large language models become increasingly popular for text generation, concerns have grown regarding their role in spreading fake news and propaganda. This research compares real political speeches with those generated by ChatGPT, emphasizing presuppositions (a rhetorical device that subtly influences audiences by packaging some content as already known at the moment of utterance, thus swaying opinions without explicit argumentation). Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.
- Europe > France (0.68)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
- Media > News (0.66)
- Government > Regional Government > Europe Government > France Government (0.46)
How well can LLMs Grade Essays in Arabic?
Ghazawi, Rayed, Simpson, Edwin
This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (0.67)
- Education > Educational Setting (1.00)
- Education > Assessment & Standards > Student Performance (1.00)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.67)
Generative Artificial Intelligence: Implications for Biomedical and Health Professions Education
Generative AI has had a profound impact on biomedicine and health, both in professional work and in education. Based on large language models (LLMs), generative AI has been found to perform as well as humans in simulated situations taking medical board exams, answering clinical questions, solving clinical cases, applying clinical reasoning, and summarizing information. Generative AI is also being used widely in education, performing well in academic courses and their assessments. This review summarizes the successes of LLMs and highlights some of their challenges in the context of education, most notably aspects that may undermines the acquisition of knowledge and skills for professional work. It then provides recommendations for best practices overcoming shortcomings for LLM use in education. Although there are challenges for use of generative AI in education, all students and faculty, in biomedicine and health and beyond, must have understanding and be competent in its use.
- North America > United States > New York > Monroe County > Rochester (0.05)
- Asia > Singapore (0.04)
- Asia > Middle East > Israel (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- (2 more...)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- (7 more...)
Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations
Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.
- North America > United States (0.46)
- Oceania > Australia (0.04)
- Asia > Pakistan (0.04)
A Prompt Engineering Approach and a Knowledge Graph based Framework for Tackling Legal Implications of Large Language Model Answers
Hannah, George, Sousa, Rita T., Dasoulas, Ioannis, d'Amato, Claudia
With the recent surge in popularity of Large Language Models (LLMs), there is the rising risk of users blindly trusting the information in the response, even in cases where the LLM recommends actions that have potential legal implications and this may put the user in danger. We provide an empirical analysis on multiple existing LLMs showing the urgency of the problem. Hence, we propose a short-term solution consisting in an approach for isolating these legal issues through prompt re-engineering. We further analyse the outcomes but also the limitations of the prompt engineering based approach and we highlight the need of additional resources for fully solving the problem We also propose a framework powered by a legal knowledge graph (KG) to generate legal citations for these legal issues, enriching the response of the LLM.
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (6 more...)
RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code
Chen, Jiachi, Zhong, Qingyuan, Wang, Yanlin, Ning, Kaiwen, Liu, Yongkun, Xu, Zenan, Zhao, Zhe, Chen, Ting, Zheng, Zibin
The emergence of Large Language Models (LLMs) has significantly influenced various aspects of software development activities. Despite their benefits, LLMs also pose notable risks, including the potential to generate harmful content and being abused by malicious developers to create malicious code. Several previous studies have focused on the ability of LLMs to resist the generation of harmful content that violates human ethical standards, such as biased or offensive content. However, there is no research evaluating the ability of LLMs to resist malicious code generation. To fill this gap, we propose RMCBench, the first benchmark comprising 473 prompts designed to assess the ability of LLMs to resist malicious code generation. This benchmark employs two scenarios: a text-to-code scenario, where LLMs are prompted with descriptions to generate code, and a code-to-code scenario, where LLMs translate or complete existing malicious code. Based on RMCBench, we conduct an empirical study on 11 representative LLMs to assess their ability to resist malicious code generation. Our findings indicate that current LLMs have a limited ability to resist malicious code generation with an average refusal rate of 40.36% in text-to-code scenario and 11.52% in code-to-code scenario. The average refusal rate of all LLMs in RMCBench is only 28.71%; ChatGPT-4 has a refusal rate of only 35.73%. We also analyze the factors that affect LLMs' ability to resist malicious code generation and provide implications for developers to enhance model robustness.
- North America > United States > California > Sacramento County > Sacramento (0.05)
- Asia > China > Guangdong Province > Zhuhai (0.04)