Rovaniemi
- North America > United States > California > San Francisco County > San Francisco (0.15)
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Stockholm > Stockholm (0.06)
- (23 more...)
- Health & Medicine (0.94)
- Transportation > Ground > Rail (0.93)
- Information Technology > Information Management > Search (0.69)
- Information Technology > Artificial Intelligence > Natural Language (0.69)
- Information Technology > Sensing and Signal Processing > Image Processing (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.47)
- North America > United States > California > San Francisco County > San Francisco (0.15)
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Stockholm > Stockholm (0.06)
- (23 more...)
- Health & Medicine (0.94)
- Transportation > Ground > Rail (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.70)
- Information Technology > Information Management > Search (0.70)
- Information Technology > Sensing and Signal Processing > Image Processing (0.49)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.47)
Large Language Models for Software Testing: A Research Roadmap
Augusto, Cristian, Bertolino, Antonia, De Angelis, Guglielmo, Lonetti, Francesca, Morán, Jesús
Large Language Models (LLMs) are starting to be profiled as one of the most significant disruptions in the Software Testing field. Specifically, they have been successfully applied in software testing tasks such as generating test code, or summarizing documentation. This potential has attracted hundreds of researchers, resulting in dozens of new contributions every month, hardening researchers to stay at the forefront of the wave. Still, to the best of our knowledge, no prior work has provided a structured vision of the progress and most relevant research trends in LLM-based testing. In this article, we aim to provide a roadmap that illustrates its current state, grouping the contributions into different categories, and also sketching the most promising and active research directions for the field. To achieve this objective, we have conducted a semi-systematic literature review, collecting articles and mapping them into the most prominent categories, reviewing the current and ongoing status, and analyzing the open challenges of LLM-based software testing. Lastly, we have outlined several expected long-term impacts of LLMs over the whole software testing field.
- Oceania > Australia > Victoria > Melbourne (0.28)
- North America > United States > California > Sacramento County > Sacramento (0.14)
- Europe > Austria > Vienna (0.14)
- (51 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.45)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.67)
Automated Privacy Information Annotation in Large Language Model Interactions
Zeng, Hang, Liu, Xiangyu, Hu, Yong, Niu, Chaoyue, Wu, Fan, Tang, Shaojie, Chen, Guihai
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application domains, typically tagging personally identifiable information (PII) in anonymous content, which is insufficient in real-name interaction scenarios with LLMs. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > China > Shanghai > Shanghai (0.05)
- (25 more...)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.93)
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
Štorek, Adam, Gupta, Mukur, Bhatt, Noopur, Gupta, Aditya, Kim, Janie, Srivastava, Prashast, Jana, Suman
AI coding assistants are widely used for tasks like code generation, bug detection, and comprehension. These tools now require large and complex contexts, automatically sourced from various origins$\unicode{x2014}$across files, projects, and contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code, overlooking flaws, or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is particularly challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these correlations since the semantics of the code remain correct, making it appear legitimate. This allows attackers to manipulate code assistants into producing incorrect outputs, including vulnerabilities or backdoors, while shifting the blame to the victim developer or tester. We introduce a novel, task-agnostic black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving an 83.09% attack success rate on average across five tasks and eleven models, including GPT-4o and Claude 3.5 Sonnet v2 used by many popular AI coding assistants. Furthermore, existing defenses, including adversarial fine-tuning, are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (4 more...)
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Cao, Jialun, Chan, Yuk-Kit, Ling, Zixuan, Wang, Wenxuan, Li, Shuqing, Liu, Mingwei, Qiao, Ruixi, Han, Yuting, Wang, Chaozheng, Yu, Boxi, He, Pinjia, Wang, Shuai, Zheng, Zibin, Lyu, Michael R., Cheung, Shing-Chi
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
- North America > United States > California > San Francisco County > San Francisco (0.28)
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (47 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.45)
- Education (1.00)
- Information Technology > Security & Privacy (0.67)
GS-GVINS: A Tightly-integrated GNSS-Visual-Inertial Navigation System Augmented by 3D Gaussian Splatting
Zhou, Zelin, Uprety, Saurav, Nie, Shichuang, Yang, Hongzhou
Recently, the emergence of 3D Gaussian Splatting (3DGS) has drawn significant attention in the area of 3D map reconstruction and visual SLAM. While extensive research has explored 3DGS for indoor trajectory tracking using visual sensor alone or in combination with Light Detection and Ranging (LiDAR) and Inertial Measurement Unit (IMU), its integration with GNSS for large-scale outdoor navigation remains underexplored. To address these concerns, we proposed GS-GVINS: a tightly-integrated GNSS-Visual-Inertial Navigation System augmented by 3DGS. This system leverages 3D Gaussian as a continuous differentiable scene representation in largescale outdoor environments, enhancing navigation performance through the constructed 3D Gaussian map. Notably, GS-GVINS is the first GNSS-Visual-Inertial navigation application that directly utilizes the analytical jacobians of SE3 camera pose with respect to 3D Gaussians. To maintain the quality of 3DGS rendering in extreme dynamic states, we introduce a motionaware 3D Gaussian pruning mechanism, updating the map based on relative pose translation and the accumulated opacity along the camera ray. For validation, we test our system under different driving environments: open-sky, sub-urban, and urban. Both self-collected and public datasets are used for evaluation. The results demonstrate the effectiveness of GS-GVINS in enhancing navigation accuracy across diverse driving environments.
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- Europe > Finland > Lapland > Rovaniemi (0.04)
- Europe > Austria > Styria > Graz (0.04)
Using Machine Learning to Distinguish Human-written from Machine-generated Creative Fiction
McGlinchey, Andrea Cristina, Barclay, Peter J
Following the universal availability of generative AI systems with the release of ChatGPT, automatic detection of deceptive text created by Large Language Models has focused on domains such as academic plagiarism and "fake news". However, generative AI also poses a threat to the livelihood of creative writers, and perhaps to literary culture in general, through reduction in quality of published material. Training a Large Language Model on writers' output to generate "sham books" in a particular style seems to constitute a new form of plagiarism. This problem has been little researched. In this study, we trained Machine Learning classifier models to distinguish short samples of human-written from machine-generated creative fiction, focusing on classic detective novels. Our results show that a Naive Bayes and a Multi-Layer Perceptron classifier achieved a high degree of success (accuracy > 95%), significantly outperforming human judges (accuracy < 55%). This approach worked well with short text samples (around 100 words), which previous research has shown to be difficult to classify. We have deployed an online proof-of-concept classifier tool, AI Detective, as a first step towards developing lightweight and reliable applications for use by editors and publishers, with the aim of protecting the economic and cultural contribution of human authors.
- Europe > United Kingdom > Scotland (0.04)
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- Europe > Finland > Lapland > Rovaniemi (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.57)
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Zhao, Yuwei, Luo, Ziyang, Tian, Yuchen, Lin, Hongzhan, Yan, Weixiang, Li, Annan, Ma, Jing
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (10 more...)
Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures
The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Italy > Molise > Campobasso Province > Campobasso (0.04)
- Europe > Finland > Lapland > Rovaniemi (0.04)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.88)