AITopics | Sun, Weisong

Collaborating Authors

Sun, Weisong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization

Sun, Weisong, Zhang, Yiran, Zhu, Jie, Wang, Zhihui, Fang, Chunrong, Zhang, Yonglong, Feng, Yebo, Huang, Jiangping, Wang, Xingya, Jin, Zhi, Liu, Yang

arXiv.org Artificial IntelligenceMar-13-2025

Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.10737

Country:

Europe (1.00)
Asia > China (0.68)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)
Health & Medicine > Therapeutic Area > Immunology (0.34)
Health & Medicine > Epidemiology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness

Sun, Weisong, Chen, Yuchen, Yuan, Mengzhe, Fang, Chunrong, Chen, Zhenpeng, Wang, Chong, Liu, Yang, Xu, Baowen, Chen, Zhenyu

arXiv.org Artificial IntelligenceFeb-20-2025

Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight $n$-gram language model. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. The experimental results on two code poisoning attacks and four code intelligence tasks demonstrate that KillBadCode significantly outperforms four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average.

code snippet, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.1583

Country:

Asia (0.68)
North America > United States > Hawaii (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.54)

Add feedback

Benchmarking Bias in Large Language Models during Role-Playing

Li, Xinyue, Chen, Zhenpeng, Zhang, Jie M., Lou, Yiling, Li, Tianlin, Sun, Weisong, Liu, Yang, Liu, Xuanzhe

arXiv.org Artificial IntelligenceNov-1-2024

Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.00585

Country: Asia (0.29)

Genre:

Research Report > New Finding (0.93)
Overview (0.93)

Industry:

Education (0.48)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Source Code Summarization in the Era of Large Language Models

Sun, Weisong, Miao, Yun, Li, Yuekang, Zhang, Hongyu, Fang, Chunrong, Liu, Yi, Deng, Gelei, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceJul-9-2024

To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.07959

Country:

Europe (1.00)
Africa (0.92)
Asia > China (0.68)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization

Fang, Chunrong, Sun, Weisong, Chen, Yuchen, Chen, Xiao, Wei, Zhao, Zhang, Quanjun, You, Yudu, Luo, Bin, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceJun-30-2024

(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on code summarization. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. This paper proposes a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. Additionally, we introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. The extensive experiments on four datasets demonstrate that our approach, called ESALE significantly outperforms baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2407.01646

Country:

Europe (1.00)
Asia (1.00)
North America > Canada (0.68)
(2 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.86)

Add feedback

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Han, Tingxu, Sun, Weisong, Ding, Ziqi, Fang, Chunrong, Qian, Hanwei, Li, Jiaxun, Chen, Zhenyu, Zhang, Xiangyu

arXiv.org Artificial IntelligenceJun-11-2024

Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for downstream task models. However, their effectiveness is impaired and limited when adapted to pre-trained encoders, due to the lack of label information when pre-training. To address backdoor attacks against pre-trained encoders, in this paper, we innovatively propose a mutual information guided backdoor mitigation technique, named MIMIC. MIMIC treats the potentially backdoored encoder as the teacher net and employs knowledge distillation to distill a clean student encoder from the teacher net. Different from existing knowledge distillation approaches, MIMIC initializes the student with random weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual information between each layer and extracted features to locate where benign knowledge lies in the teacher net, with which distillation is deployed to clone clean features from teacher to student. We craft the distillation loss with two aspects, including clone loss and attention loss, aiming to mitigate backdoors and maintain encoder performance at the same time. Our evaluation conducted on two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce the attack success rate by only utilizing <5% of clean data, surpassing seven state-of-the-art backdoor mitigation techniques.

artificial intelligence, encoder, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2406.03508

Country:

Asia (0.68)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder

Han, Tingxu, Huang, Shenghan, Ding, Ziqi, Sun, Weisong, Feng, Yebo, Fang, Chunrong, Li, Jun, Qian, Hanwei, Wu, Cong, Zhang, Quanjun, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceMar-6-2024

In this paper, we study a defense against poisoned encoders in SSL called distillation, which is a defense used in supervised learning originally. Distillation aims to distill knowledge from a given model (a.k.a the teacher net) and transfer it to another (a.k.a the student net). Now, we use it to distill benign knowledge from poisoned pre-trained encoders and transfer it to a new encoder, resulting in a clean pre-trained encoder. In particular, we conduct an empirical study on the effectiveness and performance of distillation against poisoned encoders. Using two state-of-the-art backdoor attacks against pre-trained image encoders and four commonly used image classification datasets, our experimental results show that distillation can reduce attack success rate from 80.87% to 27.51% while suffering a 6.35% loss in accuracy. Moreover, we investigate the impact of three core components of distillation on performance: teacher net, student net, and distillation loss. By comparing 4 different teacher nets, 3 student nets, and 6 distillation losses, we find that fine-tuned teacher nets, warm-up-training-based student nets, and attention-based distillation loss perform best, respectively.

distillation, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.03846

Country:

Asia > China > Jiangsu Province (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
(3 more...)

Add feedback

Machine Translation Testing via Syntactic Tree Pruning

Zhang, Quanjun, Zhai, Juan, Fang, Chunrong, Liu, Jiawei, Sun, Weisong, Hu, Haichuan, Wang, Qingyu

arXiv.org Artificial IntelligenceJan-1-2024

Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structure and dependency relations on the level of syntactic tree representation; (2) generates source sentence pairs based on the metamorphic relation; (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP can accurately find 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them cannot be found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective to detect translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2401.00751

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.14)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)

Industry:

Information Technology (1.00)
Education (1.00)
Media (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Prompt Learning Framework for Source Code Summarization

Sun, Weisong, Fang, Chunrong, You, Yudu, Chen, Yuchen, Liu, Yi, Wang, Chong, Zhang, Jian, Zhang, Quanjun, Qian, Hanwei, Zhao, Wei, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceDec-26-2023

(Source) code summarization is the task of automatically generating natural language summaries for given code snippets. Such summaries play a key role in helping developers understand and maintain source code. Recently, with the successful application of large language models (LLMs) in numerous fields, software engineering researchers have also attempted to adapt LLMs to solve code summarization tasks. The main adaptation schemes include instruction prompting and task-oriented fine-tuning. However, instruction prompting involves designing crafted prompts for zero-shot learning or selecting appropriate samples for few-shot learning and requires users to have professional domain knowledge, while task-oriented fine-tuning requires high training costs. In this paper, we propose a novel prompt learning framework for code summarization called PromptCS. PromptCS trains a prompt agent that can generate continuous prompts to unleash the potential for LLMs in code summarization. Compared to the human-written discrete prompt, the continuous prompts are produced under the guidance of LLMs and are therefore easier to understand by LLMs. PromptCS freezes the parameters of LLMs when training the prompt agent, which can greatly reduce the requirements for training resources. We evaluate PromptCS on the CodeSearchNet dataset involving multiple programming languages. The results show that PromptCS significantly outperforms instruction prompting schemes on all four widely used metrics. In some base LLMs, e.g., CodeGen-Multi-2B and StarCoderBase-1B and -3B, PromptCS even outperforms the task-oriented fine-tuning scheme. More importantly, the training efficiency of PromptCS is faster than the task-oriented fine-tuning scheme, with a more pronounced advantage on larger LLMs. The results of the human evaluation demonstrate that PromptCS can generate more good summaries compared to baselines.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2312.16066

Country:

Europe (1.00)
Asia (1.00)
Africa (0.67)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Sun, Weisong, Fang, Chunrong, Miao, Yun, You, Yudu, Yuan, Mengzhe, Chen, Yuchen, Zhang, Quanjun, Guo, An, Chen, Xiang, Liu, Yang, Chen, Zhenyu

arXiv.org Artificial IntelligenceDec-1-2023

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

ast, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2312.00413

Country:

Europe (1.00)
North America > United States > Pennsylvania (0.14)
North America > United States > California (0.14)
(6 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)

Add feedback