Generative AI
OpenAI Upgrades Its Smartest AI Model With Improved Reasoning Skills
OpenAI today announced an improved version of its most capable artificial intelligence model to date--one that takes even more time to deliberate over questions--just a day after Google announced its first model of this type. OpenAI's new model, called o3, replaces o1, which the company introduced in September. Like o1, the new model spends time ruminating over a problem in order to deliver better answers to questions that require step-by-step logical reasoning. The o3 model scores much higher on several measures than its predecessor, OpenAI says, including ones that measure complex coding-related skills and advanced math and science competency. It is three times better than o1 at answering questions posed by ARC-AGI, a benchmark designed to test an AI models' ability to reason over problems they're encountering for the first time.
I Used AI to Do All of My Holiday Shopping
One of the promises of the next era of generative AI is that the technology will be agentic, or have the ability to perform tasks autonomously on behalf of us chaotic humans. That means AI agents will theoretically be able to "reason" about the next steps they should take, allowing them to execute multiple actions from a single query. The possibilities are endless, if you believe the hype--think maximum efficiency and productivity, plus a host of other buzz word-latent phrases that one might hear during a tech giant's quarterly earnings call. All I want AI to do for me, however, is to shop. I understand that some people find shopping to be a pleasurable act, but the options overwhelm me, whether I'm in an actual store or stuck in an endless scroll.
Generative AI Still Needs to Prove Its Usefulness
Generative AI took the world by storm in November 2022, with the release of OpenAI's service ChatGPT. One hundred million people started using it, practically overnight. Sam Altman, the CEO of OpenAI, the company that created ChatGPT, became a household name. And at least half a dozen companies raced OpenAI in an effort to build a better system. OpenAI itself sought to outdo GPT-4, its flagship model, introduced in March 2023, with a successor, presumably to be called GPT-5.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Chan, Jun Shern, Chowdhury, Neil, Jaffe, Oliver, Aung, James, Sherburn, Dane, Mays, Evan, Starace, Giulio, Liu, Kevin, Maksin, Leon, Patwardhan, Tejal, Weng, Lilian, Mądry, Aleksander
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.
Combining Domain-Specific Models and LLMs for Automated Disease Phenotyping from Survey Data
Beeri, Gal, Chamot, Benoit, Latchem, Elena, Venkatesh, Shruthi, Whalan, Sarah, Kruger, Van Zyl, Martino, David
Funding and support: The Generative AI Challenge is funded by grants from the Future Health Research and Innovation Fund (FHRIF), Grant ID IC2023-GAIA/11. Conflict of interest statement: The authors declare no conflicts of interest. Abstract This exploratory pilot study investigated the potential of combining a domain-specific model, BERN2, with large language models (LLMs) to enhance automated disease phenotyping from research survey data. Motivated by the need for efficient and accurate methods to harmonize the growing volume of survey data with standardized disease ontologies, we employed BERN2, a biomedical named entity recognition and normalization model, to extract disease information from the ORIGINS birth cohort survey data. After rigorously evaluating BERN2's performance against a manually curated ground truth dataset, we integrated various LLMs using prompt engineering, Retrieval-Augmented Generation (RAG), and Instructional Fine-Tuning (IFT) to refine the model's outputs. BERN2 demonstrated high performance in extracting and normalizing disease mentions, and the integration of LLMs, particularly with Few Shot Inference and RAG orchestration, further improved accuracy. This approach, especially when incorporating structured examples, logical reasoning prompts, and detailed context, offers a promising avenue for developing tools to enable efficient cohort profiling and data harmonization across large, heterogeneous research datasets. Introduction The increasing availability of research survey data from cohort studies and clinical trials offers unprecedented opportunities to advance biomedical research and improve healthcare (1).
Ethics and Technical Aspects of Generative AI Models in Digital Content Creation
Generative AI models like GPT-4o and DALL-E 3 are reshaping digital content creation, offering industries tools to generate diverse and sophisticated text and images with remarkable creativity and efficiency. This paper examines both the capabilities and challenges of these models within creative workflows. While they deliver high performance in generating content with creativity, diversity, and technical precision, they also raise significant ethical concerns. Our study addresses two key research questions: (a) how these models perform in terms of creativity, diversity, accuracy, and computational efficiency, and (b) the ethical risks they present, particularly concerning bias, authenticity, and potential misuse. Through a structured series of experiments, we analyze their technical performance and assess the ethical implications of their outputs, revealing that although generative models enhance creative processes, they often reflect biases from their training data and carry ethical vulnerabilities that require careful oversight. This research proposes ethical guidelines to support responsible AI integration into industry practices, fostering a balance between innovation and ethical integrity.
Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data
Raza, Shaina, Paulen-Patterson, Drai, Ding, Chen
Fake news poses a significant threat to public opinion and social stability in modern society. This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection. We introduce a dataset of news articles labeled with GPT-4 assistance (an AI-labeling method) and verified by human experts to ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned on this dataset. Additionally, we developed an instruction-tuned LLM approach with majority voting during inference for label generation. Our analysis reveals that BERT-like models generally outperform LLMs in classification tasks, while LLMs demonstrate superior robustness against text perturbations. Compared to weak labels (distant supervision) data, the results show that AI labels with human supervision achieve better classification results. This study highlights the effectiveness of combining AI-based annotation with human oversight and demonstrates the performance of different families of machine learning models for fake news detection
Evaluating the Propensity of Generative AI for Producing Harmful Disinformation During an Election Cycle
Generative Artificial Intelligence offers a powerful tool for adversaries who wish to engage in influence operations, such as the Chinese Spamouflage operation and the Russian Internet Research Agency effort that both sought to interfere with recent US election cycles. Therefore, this study seeks to investigate the propensity of current generative AI models for producing harmful disinformation during an election cycle. The probability that different generative AI models produced disinformation when given adversarial prompts was evaluated, in addition the associated harm. This allows for the expected harm for each model to be computed and it was discovered that Copilot and Gemini tied for the overall safest performance by realizing the lowest expected harm, while GPT-4o produced the greatest rates of harmful disinformation, resulting in much higher expected harm scores. The impact of disinformation category was also investigated and Gemini was safest within the political category of disinformation due to mitigation attempts made by developers during the election, while Copilot was safest for topics related to health. Moreover, characteristics of adversarial roles were discovered that led to greater expected harm across all models. Finally, classification models were developed that predicted disinformation production based on the conditions considered in this study, which offers insight into factors important for predicting disinformation production. Based on all of these insights, recommendations are provided that seek to mitigate factors that lead to harmful disinformation being produced by generative AI models. It is hoped that developers will use these insights to improve future models.
The Evolving Usage of GenAI by Computing Students
Hou, Irene, Nguyen, Hannah Vy, Man, Owen, MacNeil, Stephen
Help-seeking is a critical aspect of learning and problem-solving for computing students. Recent research has shown that many students are aware of generative AI (GenAI) tools; however, there are gaps in the extent and effectiveness of how students use them. With over two years of widespread GenAI usage, it is crucial to understand whether students' help-seeking behaviors with these tools have evolved and how. This paper presents findings from a repeated cross-sectional survey conducted among computing students across North American universities (n=95). Our results indicate shifts in GenAI usage patterns. In 2023, 34.1% of students (n=47) reported never using ChatGPT for help, ranking it fourth after online searches, peer support, and class forums. By 2024, this figure dropped sharply to 6.3% (n=48), with ChatGPT nearly matching online search as the most commonly used help resource. Despite this growing prevalence, there has been a decline in students' hourly and daily usage of GenAI tools, which may be attributed to a common tendency to underestimate usage frequency. These findings offer new insights into the evolving role of GenAI in computing education, highlighting its increasing acceptance and solidifying its position as a key help resource.
Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring
The integration of Large Vision-Language Models (LVLMs) such as OpenAI's GPT-4 Vision into various sectors has marked a significant evolution in the field of artificial intelligence, particularly in the analysis and interpretation of visual data. This paper explores the practical application of GPT-4 Vision in the construction industry, focusing on its capabilities in monitoring and tracking the progress of construction projects. Utilizing high-resolution aerial imagery of construction sites, the study examines how GPT-4 Vision performs detailed scene analysis and tracks developmental changes over time. The findings demonstrate that while GPT-4 Vision is proficient in identifying construction stages, materials, and machinery, it faces challenges with precise object localization and segmentation. Despite these limitations, the potential for future advancements in this technology is considerable. This research not only highlights the current state and opportunities of using LVLMs in construction but also discusses future directions for enhancing the model's utility through domain-specific training and integration with other computer vision techniques and digital twins.