AITopics | Ray, Baishakhi

Collaborating Authors

Ray, Baishakhi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

Chen, Simin, Pusarla, Pranav, Ray, Baishakhi

arXiv.org Artificial IntelligenceMar-6-2025

The rapid evolution of code largelanguage models underscores the need for effective and transparent benchmarking of their reasoning capabilities. However, the current benchmarking approach heavily depends on publicly available, human-created datasets. The widespread use of these fixed benchmark datasets makes the benchmarking process to be static and thus particularly susceptible to data contamination, an unavoidable consequence of the extensive data collection processes used to train Code LLMs. Existing approaches that address data contamination often suffer from human effort limitations and imbalanced problem complexity. To tackle these challenges, we propose \tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. Given a seed programming problem, \tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. We introduce a dynamic data generation methods and conduct empirical studies on two seed datasets across 21 Code LLMs. Results show that \tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.04149

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

Chen, Simin, Chen, Yiming, Li, Zexin, Jiang, Yifan, Wan, Zhongwei, He, Yixin, Ran, Dezhi, Gu, Tianle, Li, Haizhou, Xie, Tao, Ray, Baishakhi

arXiv.org Artificial IntelligenceFeb-23-2025

Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.17521

Country:

Asia (0.68)
North America > United States > California (0.28)

Genre: Overview (1.00)

Industry:

Education (0.46)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

AI Software Engineer: Programming with Trust

Roychoudhury, Abhik, Pasareanu, Corina, Pradel, Michael, Ray, Baishakhi

arXiv.org Artificial IntelligenceFeb-19-2025

Columbia University, USA Large Language Models (LLMs) have shown surprising proficie ncy in generating code snippets, promising to automate large parts of software engineering via artifici al intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust eq ual to or even greater than the trust established by human-driven software engineering practices. The recen t trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the powe r of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust. Software engineering is undergoing a significant phase of greater au tomation owing to the emergence of Large Language Models (LLMs) for code.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2502.13767

Country: North America > United States (0.68)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Peng, Jinjun, Cui, Leyi, Huang, Kele, Yang, Junfeng, Ray, Baishakhi

arXiv.org Artificial IntelligenceJan-14-2025

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.082

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection

Ceka, Ira, Qiao, Feitong, Dey, Anik, Valechia, Aastha, Kaiser, Gail, Ray, Baishakhi

arXiv.org Artificial IntelligenceDec-16-2024

Despite their remarkable success, large language models (LLMs) have shown limited ability on applied tasks such as vulnerability detection. We investigate various prompting strategies for vulnerability detection and, as part of this exploration, propose a prompting strategy that integrates natural language descriptions of vulnerabilities with a contrastive chain-of-thought reasoning approach, augmented using contrastive samples from a synthetic dataset. Our study highlights the potential of LLMs to detect vulnerabilities by integrating natural language descriptions, contrastive reasoning, and synthetic examples into a comprehensive prompting framework. Our results show that this approach can enhance LLM understanding of vulnerabilities. On a high-quality vulnerability detection dataset such as SVEN, our prompting strategies can improve accuracies, F1-scores, and pairwise accuracies by 23%, 11%, and 14%, respectively.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.12039

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Kuhar, Sachit, Ahmad, Wasi Uddin, Wang, Zijian, Jain, Nihal, Qian, Haifeng, Ray, Baishakhi, Ramanathan, Murali Krishna, Ma, Xiaofei, Deoras, Anoop

arXiv.org Artificial IntelligenceNov-19-2024

Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To fill the gap, we introduce LibEvolutionEval, a detailed study requiring an understanding of library evolution to perform in-line code completion accurately. LibEvolutionEval provides a version-specific code-completion task comprised of eight libraries (torch, torchvision, scipy, pil, tqdm, pyyaml, matplotlib, and pandas) as they evolve over the year along with a detailed analysis of the evolution of two popular and well-maintained public libraries: PyTorch and Matplotlib. We evaluate popular public models and find that public library evolution significantly influences model performance. We explored mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model's capability in handling these fast-evolving packages, paving a promising future path in better handling fast-evolving libraries.

completion, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.04478

Country: North America > Canada (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

On Mitigating Code LLM Hallucinations with API Documentation

Jain, Nihal, Kwiatkowski, Robert, Ray, Baishakhi, Ramanathan, Murali Krishna, Kumar, Varun

arXiv.org Artificial IntelligenceJul-12-2024

In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.09726

Country: North America > Canada (0.14)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Vulnerability Detection with Code Language Models: How Far Are We?

Ding, Yangruibo, Fu, Yanjun, Ibrahim, Omniyyah, Sitawarin, Chawin, Chen, Xinyun, Alomair, Basel, Wagner, David, Ray, Baishakhi, Chen, Yizheng

arXiv.org Artificial IntelligenceJul-10-2024

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

large language model, machine learning, vulnerability detection, (5 more...)

arXiv.org Artificial Intelligence

2403.18624

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.44)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.44)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Berman, Shmuel, McKeown, Kathleen, Ray, Baishakhi

arXiv.org Artificial IntelligenceJul-9-2024

Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.

large language model, machine learning, puzzle, (19 more...)

arXiv.org Artificial Intelligence

2407.03956

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies

Wang, Junlin, Jain, Siddhartha, Zhang, Dejiao, Ray, Baishakhi, Kumar, Varun, Athiwaratkun, Ben

arXiv.org Artificial IntelligenceJun-14-2024

A diverse array of reasoning strategies has been proposed to elicit the capabilities of large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: the increased effectiveness due to additional compute. By overlooking this aspect, a skewed view of strategy efficiency is often presented. This paper introduces a framework that incorporates the compute budget into the evaluation, providing a more informative comparison that takes into account both performance metrics and computational cost. In this budget-aware perspective, we find that complex reasoning strategies often don't surpass simpler baselines purely due to algorithmic ingenuity, but rather due to the larger computational resources allocated. When we provide a simple baseline like chain-of-thought self-consistency with comparable compute resources, it frequently outperforms reasoning strategies proposed in the literature. In this scale-aware perspective, we find that unlike self-consistency, certain strategies such as multi-agent debate or Reflexion can become worse if more compute budget is utilized.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.06461

Country:

North America > United States > Minnesota (0.14)
Europe > Belgium (0.14)

Genre: Research Report > New Finding (0.93)

Industry: Banking & Finance > Trading (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback