AITopics | unixcoder

Collaborating Authors

unixcoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection

Yang, Guang, Zhu, Yujie

arXiv.org Artificial IntelligenceNov-27-2025

Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.

correlation, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-emnlp.800

2510.17591

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

TCProF: Time-Complexity Prediction SSL Framework

Hahn, Joonghyuk, Ahn, Hyeseon, Kim, Jungin, Lim, Soohan, Han, Yo-Sub

arXiv.org Artificial IntelligenceMar-20-2025

Time complexity is a theoretic measure to determine the amount of time the algorithm needs for its execution. In reality, developers write algorithms into code snippets within limited resources, making the calculation of a code's time complexity a fundamental task. However, determining the precise time complexity of a code is theoretically undecidable. In response, recent advancements have leaned toward deploying datasets for code time complexity prediction and initiating preliminary experiments for this challenge. We investigate the challenge in low-resource scenarios where only a few labeled instances are given for training. Remarkably, we are the first to introduce TCProF: a Time-Complexity Prediction SSL Framework as an effective solution for code time complexity prediction in low-resource settings. TCProF significantly boosts performance by integrating our augmentation, symbolic modules, and a co-training mechanism, achieving a more than 60% improvement over self-training approaches. We further provide an extensive comparative analysis between TCProF, ChatGPT, and Gemini-Pro, offering a detailed evaluation of our approach. Our code is at https://github.com/peer0/few-shot-tc.

large language model, machine learning, tcprof, (19 more...)

arXiv.org Artificial Intelligence

2502.15749

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

LoRACode: LoRA Adapters for Code Embeddings

Chaturvedi, Saumya, Chadha, Aman, Bindschaedler, Laurent

arXiv.org Artificial IntelligenceMar-7-2025

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.05315

Country:

Europe > Germany (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

URECA: The Chain of Two Minimum Set Cover Problems exists behind Adaptation to Shifts in Semantic Code Search

Choi, Seok-Ung, Hahn, Joonghyuk, Han, Yo-Sub

arXiv.org Artificial IntelligenceFeb-11-2025

Adaptation is to make model learn the patterns shifted from the training distribution. In general, this adaptation is formulated as the minimum entropy problem. However, the minimum entropy problem has inherent limitation -- shifted initialization cascade phenomenon. We extend the relationship between the minimum entropy problem and the minimum set cover problem via Lebesgue integral. This extension reveals that internal mechanism of the minimum entropy problem ignores the relationship between disentangled representations, which leads to shifted initialization cascade. From the analysis, we introduce a new clustering algorithm, Union-find based Recursive Clustering Algorithm~(URECA). URECA is an efficient clustering algorithm for the leverage of the relationships between disentangled representations. The update rule of URECA depends on Thresholdly-Updatable Stationary Assumption to dynamics as a released version of Stationary Assumption. This assumption helps URECA to transport disentangled representations with no errors based on the relationships between disentangled representations. URECA also utilize simulation trick to efficiently cluster disentangled representations. The wide range of evaluations show that URECA achieves consistent performance gains for the few-shot adaptation to diverse types of shifts along with advancement to State-of-The-Art performance in CoSQA in the scenario of query shift.

artificial intelligence, disentangled representation, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.07494

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.74)

Add feedback

OCEAN: Open-World Contrastive Authorship Identification

Mächtle, Felix, Serr, Jan-Niclas, Loose, Nils, Sander, Jonas, Eisenbarth, Thomas

arXiv.org Artificial IntelligenceDec-6-2024

In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.05049

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(14 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.88)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software (0.88)

Add feedback

EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

Ridoy, Shahriyar Zaman, Shaon, Md. Shazzad Hossain, Cuzzocrea, Alfredo, Akter, Mst Shapna

arXiv.org Artificial IntelligenceNov-25-2024

Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall, and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.16561

Country:

Europe > France > Île-de-France > Paris > Paris (0.14)
North America > United States > Michigan > Oakland County > Rochester (0.04)
Europe > Italy > Calabria (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization

Fang, Chunrong, Sun, Weisong, Chen, Yuchen, Chen, Xiao, Wei, Zhao, Zhang, Quanjun, You, Yudu, Luo, Bin, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceJun-30-2024

(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on code summarization. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. This paper proposes a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. Additionally, we introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. The extensive experiments on four datasets demonstrate that our approach, called ESALE significantly outperforms baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L.

code snippet, encoder, unixcoder, (14 more...)

arXiv.org Artificial Intelligence

2407.01646

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(33 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.86)

Add feedback

A Critical Study of What Code-LLMs (Do Not) Learn

Anand, Abhinav, Verma, Shweta, Narasimhan, Krishna, Mezini, Mira

arXiv.org Artificial IntelligenceJun-17-2024

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.

graph, relation, representation, (15 more...)

arXiv.org Artificial Intelligence

2406.1193

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(16 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

Du, Xiaohu, Wen, Ming, Zhu, Jiahao, Xie, Zifan, Ji, Bin, Liu, Huijun, Shi, Xuanhua, Jin, Hai

arXiv.org Artificial IntelligenceJun-5-2024

Code Pre-trained Models (CodePTMs) based vulnerability detection have achieved promising results over recent years. However, these models struggle to generalize as they typically learn superficial mapping from source code to labels instead of understanding the root causes of code vulnerabilities, resulting in poor performance in real-world scenarios beyond the training instances. To tackle this challenge, we introduce VulLLM, a novel framework that integrates multi-task learning with Large Language Models (LLMs) to effectively mine deep-seated vulnerability features. Specifically, we construct two auxiliary tasks beyond the vulnerability detection task. First, we utilize the vulnerability patches to construct a vulnerability localization task. Second, based on the vulnerability features extracted from patches, we leverage GPT-4 to construct a vulnerability interpretation task. VulLLM innovatively augments vulnerability classification by leveraging generative LLMs to understand complex vulnerability patterns, thus compelling the model to capture the root causes of vulnerabilities rather than overfitting to spurious features of a single task. The experiments conducted on six large datasets demonstrate that VulLLM surpasses seven state-of-the-art models in terms of effectiveness, generalization, and robustness.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.03718

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Hubei Province > Wuhan (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(15 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study

van Dam, Tim, van der Heijden, Frank, de Bekker, Philippe, Nieuwschepen, Berend, Otten, Marc, Izadi, Maliheh

arXiv.org Artificial IntelligenceMar-22-2024

Language model-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages, but that code completion on functional languages is feasible. Consequently, this shows the need for more high-quality Haskell datasets. A manual evaluation on HumanEval-Haskell indicates CodeGPT frequently generates empty predictions and extra comments, while UniXcoder more often produces incomplete or incorrect predictions. Finally, we release HumanEval-Haskell, along with the fine-tuned models and all code required to reproduce our experiments on GitHub (https://github.com/AISE-TUDelft/HaskellCCEval).

completion, dataset, haskell, (15 more...)

arXiv.org Artificial Intelligence

2403.15185

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.05)
Europe > Netherlands > South Holland > Delft (0.05)
(3 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback