codebert
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe (0.04)
- Semiconductors & Electronics (0.47)
- Information Technology (0.46)
The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
Tihanyi, Norbert, Cherif, Bilel, Dubniczky, Richard A., Ferrag, Mohamed Amine, Bisztray, Tamás
In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (12 more...)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe (0.04)
- Semiconductors & Electronics (0.47)
- Information Technology (0.46)
Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection
Safdar, Rijha, Mateen, Danyail, Ali, Syed Taha, Ashfaq, M. Umer, Hussain, Wajahat
Abstract--AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize. For this purpose, we introduce V ulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mislabeled and duplicate samples, updating new vulnerabilities, incorporating additional metadata, integrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detection. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other models in terms of accuracy and generalization. Our model achieves 6.8% improvement in recall on the benchmark BigV ul dataset and outperforms others on unseen projects, demonstrating enhanced generalizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our findings suggest a direction for future systems with high cross-project effectiveness. With the rapid growth in digitization and software applications and systems in recent years, the issue of software vulnerabilities has become a critical concern. In 2024, a record-breaking 40,000 Common Vulnerabilities and Exposures (CVEs) were published--an average of 108 per day--marking a 38% increase over 2023 (with 28,818 CVEs) [1]. This number is already dramatically increasing: the first half of 2025 has witnessed an average of 131 CVEs per day [2]. In the open-source software ecosystem, which underpins a wide range of industries, including finance, energy, aerospace, and healthcare, a recent study found a surge of 98% per year in reported vulnerabilities [3]. R. Safdar, S.T. Ali and W . Hussain are with School of Electrical Engineering and Computer Science, National University of Sciences and Technology, Islamabad, Pakistan, 44000.
- Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.24)
- North America > United States > Massachusetts (0.04)
- Asia > Vietnam > Long An Province > Tân An (0.04)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Analyzing Latent Concepts in Code Language Models
Sharma, Arushi, Pungliya, Vedant, Quinn, Christopher J., Jannesari, Ali
Interpreting the internal behavior of large language models trained on code remains a critical challenge, particularly for applications demanding trust, transparency, and semantic robustness. We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework that uncovers emergent lexical, syntactic, and semantic structures in a code language model's representation space by clustering contextualized token embeddings into human-interpretable concept groups. We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models (LLMs), enabling scalable labeling of latent concepts across abstraction levels. We analyse the distribution of concepts across layers and across three finetuning tasks. Emergent concept clusters can help identify unexpected latent interactions and be used to identify trends and biases within the model's learned representations. We further integrate LCA with local attribution methods to produce concept-grounded explanations, improving the coherence and interpretability of token-level saliency. Empirical evaluations across multiple models and tasks show that LCA discovers concepts that remain stable under semantic-preserving perturbations (average Cluster Sensitivity Index, CSI = 0.288) and evolve predictably with fine-tuning. In a user study on the programming-language classification task, concept-augmented explanations disambiguated token roles and improved human-centric explainability by 37 percentage points compared with token-level attributions using Integrated Gradients.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Iowa (0.04)
- North America > United States > California > Sacramento County > Sacramento (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Detection of security smells in IaC scripts through semantics-aware code and language processing
War, Aicha, Rawass, Adnan A., Kabore, Abdoul K., Samhi, Jordan, Klein, Jacques, Bissyande, Tegawende F.
Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code. In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input, and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)
Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset
Nam, Doha, Kim, Taehyoun, Ryu, Duksan, Baik, Jongmoon
Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources. Beyond dataset construction, we provide the first systematic evaluation of how pre-trained language models (PLMs) reason about code modifications -- specifically, which input encodings most effectively expose change information, and whether models genuinely capture edit semantics. We fine-tune CodeBERT, CodeT5+, and UniXcoder under five encoding strategies, and further probe their sensitivity through counterfactual perturbations that swap added/deleted blocks, invert diff polarity, or inject spurious markers. Our results show that compact diff-style encodings consistently outperform whole-function formats across all PLMs, with statistical tests confirming large, model-independent effects. However, under counterfactual tests, performance degrades little or not at all -- revealing that what appears to be robustness in fact reflects reliance on superficial cues rather than true semantic understanding. These findings indicate that, unlike in snapshot-based tasks, current PLMs remain limited in their ability to genuinely comprehend code modifications.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Code Vulnerability Detection Across Different Programming Languages with AI Models
Humran, Hael Abdulhakim Ali, Sonmez, Ferdi
Security vulnerabilities present in a code that has been written in diverse programming languages are among the most critical yet complicated aspects of source code to detect. Static analysis tools based on rule-based patterns usually do not work well at detecting the context-dependent bugs and lead to high false positive rates. Recent developments in artificial intelligence, specifically the use of transformer-based models like CodeBERT and CodeLlama, provide light to this problem, as they show potential in finding such flaws better. This paper presents the implementations of these models on various datasets of code vulnerability, showing how off-the-shelf models can successfully produce predictive capacity in models through dynamic fine-tuning of the models on vulnerable and safe code fragments. The methodology comprises the gathering of the dataset, normalization of the language, fine-tuning of the model, and incorporation of ensemble learning and explainable AI. Experiments show that a well-trained CodeBERT can be as good as or even better than some existing static analyzers in terms of accuracy greater than 97%. Further study has indicated that although language models can achieve close-to-perfect recall, the precision can decrease. A solution to this is given by hybrid models and validation procedures, which will reduce false positives. According to the results, the AI-based solutions generalize to different programming languages and classes of vulnerability. Nevertheless, robustness, interpretability, and deployment readiness are still being developed. The results illustrate the probabilities that AI will enhance the trustworthiness in the usability and scalability of machine-learning-based detectors of vulnerabilities.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Poisoned Source Code Detection in Code Models
Ghannoum, Ehab, Ghafari, Mohammad
Deep learning models have gained popularity for conducting various tasks involving source code. However, their black-box nature raises concerns about potential risks. One such risk is a poisoning attack, where an attacker intentionally contaminates the training set with malicious samples to mislead the model's predictions in specific scenarios. To protect source code models from poisoning attacks, we introduce CodeGarrison (CG), a hybrid deep-learning model that relies on code embeddings to identify poisoned code samples. We evaluated CG against the state-of-the-art technique ONION for detecting poisoned samples generated by DAMP, MHM, ALERT, as well as a novel poisoning technique named CodeFooler. Results showed that CG significantly outperformed ONION with an accuracy of 93.5%. We also tested CG's robustness against unknown attacks, achieving an average accuracy of 85.6% in identifying poisoned samples across the four attacks mentioned above.
- North America > Dominican Republic (0.04)
- Europe > Germany (0.04)
- Europe > France (0.04)