knowledge cutoff date
RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
Liang, Linxi, Gong, Jing, Liu, Mingwei, Wang, Chong, Ou, Guangsheng, Wang, Yanlin, Peng, Xin, Zheng, Zibin
Large Language Models (LLMs) have become pivotal tools for automating code generation in software development. However, these models face significant challenges in producing version-aware code for rapidly evolving languages like Rust, where frequent Application Programming Interfaces (API) changes across versions lead to compatibility issues and correctness errors. Existing benchmarks lack systematic evaluation of how models navigate API transitions, relying on labor-intensive manual curation and offering limited version-specific insights. To address this gap, we present RustEvo, a novel framework for constructing dynamic benchmarks that evaluate the ability of LLMs to adapt to evolving Rust APIs. RustEvo automates dataset creation by synthesizing 588 API changes (380 from Rust standard libraries, 208 from 15 third-party crates) into programming tasks mirroring real-world challenges. These tasks cover four API evolution categories: Stabilizations, Signature Changes, Behavioral Changes, and Deprecations, reflecting their actual distribution in the Rust ecosystem. Experiments on state-of-the-art (SOTA) LLMs reveal significant performance variations: models achieve a 65.8% average success rate on stabilized APIs but only 38.0% on behavioral changes, highlighting difficulties in detecting semantic shifts without signature alterations. Knowledge cutoff dates strongly influence performance, with models scoring 56.1% on before-cutoff APIs versus 32.5% on after-cutoff tasks. Retrieval-Augmented Generation (RAG) mitigates this gap, improving success rates by 13.5% on average for APIs released after model training. Our findings underscore the necessity of our evolution-aware benchmarks to advance the adaptability of LLMs in fast-paced software ecosystems. The framework and the benchmarks are publicly released at https://github.com/SYSUSELab/RustEvo.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- Asia > China (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Dai, Hui, Teehan, Ryan, Ren, Mengye
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
- Asia > South Korea (0.28)
- Asia > North Korea (0.14)
- North America > United States > New York (0.04)
- (6 more...)
- Government > Regional Government (0.92)
- Media (0.68)
- Leisure & Entertainment > Sports > Olympic Games (0.67)
LLM Agents can Autonomously Exploit One-day Vulnerabilities
Fang, Richard, Bindu, Rohan, Gupta, Akul, Kang, Daniel
LLMs have becoming increasingly powerful, both in their benign and malicious uses. With the increase in capabilities, researchers have been increasingly interested in their ability to exploit cybersecurity vulnerabilities. In particular, recent work has conducted preliminary studies on the ability of LLM agents to autonomously hack websites. However, these studies are limited to simple vulnerabilities. In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.
- Asia > Nepal (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.50)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Vu, Tu, Iyyer, Mohit, Wang, Xuezhi, Constant, Noah, Wei, Jerry, Wei, Jason, Tar, Chris, Sung, Yun-Hsuan, Zhou, Denny, Le, Quoc, Luong, Thang
Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
- North America > United States > Virginia (0.04)
- Europe > Ukraine (0.04)
- North America > United States > Minnesota (0.04)
- (4 more...)
- Leisure & Entertainment > Sports > Motorsports > Formula One (1.00)
- Government (0.68)
- Leisure & Entertainment > Sports > Olympic Games (0.67)
Can ChatGPT discuss current events? Chatbot has clear knowledge cutoff date
During an appearance on "The Ingraham Angle," Jimmy Failla shares his thoughts on the latest interesting development in the world of artificial intelligence. ChatGPT has been a game changer for artificial intelligence, catapulting earlier this year to the fastest-growing web platform ever as millions of people across the world rushed to communicate with a system that can mimic human conversation. The system, however, is unable to respond to current events questions due to having a knowledge cutoff date of September 2021. When Fox News Digital, for example, attempted to ask ChatGPT questions about current events, such as if the Titan submersible implosion could have been prevented or what charges Hunter Biden was hit with this month, the chatbot responded that it does not have knowledge of current events after September 2021. "As an AI language model, I have a knowledge cutoff date because my training data only goes up until September 2021," ChatGPT responded when asked why it does not possess language beyond September 2021.