NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating Large Language Models in Offensive Security Motivation

Feb-15-2026, 14:00:32 GMT–Neural Information Processing Systems

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? The dataset was created to evaluate the effectiveness of large language models (LLMs) in solving Capture the Flag (CTF) challenges within the domain of offensive security. There was a specific need to thoroughly assess the capabilities of LLMs in this context, as their potential for handling such tasks had not been systematically evaluated. The goal was to develop a scalable, open-source benchmark database specifically designed for these applications. This dataset includes diverse CTF challenges from popular competitions, with metadata to support LLM testing and adaptive learning. The dataset addresses a critical gap by providing a comprehensive resource for the systematic evaluation of LLMs' performance in real-world cybersecurity tasks. The development of this dataset and the accompanying automated framework allows for the continuous improvement and refinement of LLM-based approaches to vulnerability detection and resolution. By making the dataset open-source, the project aims to foster further research and development in this area, providing an ideal platform for developing, testing, and refining LLM-based approaches to cybersecurity challenges. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The students listed above compiled and validated these challenges from all previous global CSAW competitions by manually checking their setup and ensuring they remain solvable despite software changes. This work was conducted in collaboration with the OSIRIS Lab and the Center for Cybersecurity at NYU, which organize CSAW and attract global participation[1].

artificial intelligence, large language model, natural language, (18 more...)

Neural Information Processing Systems

Feb-15-2026, 14:00:32 GMT

Conferences PDF

Add feedback

Country:
- Europe > United Kingdom (0.04)
- North America > United States
  - New York (0.04)
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government > Military
  - Cyberwarfare (0.90)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
69d97a6493fbf016fff0a751f253ad18-Supplemental-Datasets_and_Benchmarks_Track.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found