AITopics | Tan, Lin

Collaborating Authors

Tan, Lin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast

Fan, Wen, Rego, Marilyn, Hu, Xin, Dod, Sanya, Ni, Zhaorui, Xie, Danning, DiVincenzo, Jenna, Tan, Lin

arXiv.org Artificial IntelligenceJan-2-2025

Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.

large language model, machine learning, specification, (17 more...)

arXiv.org Artificial Intelligence

2411.02318

Country:

North America > United States > Indiana > Tippecanoe County (0.16)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

Liang, Shanchao, Hu, Yiran, Jiang, Nan, Tan, Lin

arXiv.org Artificial IntelligenceNov-3-2024

Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.21647

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WAFFLE: Multi-Modal Model for Automated Front-End Development

Liang, Shanchao, Jiang, Nan, Qian, Shangshu, Tan, Lin

arXiv.org Artificial IntelligenceOct-23-2024

Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.18362

Country: North America > United States > California (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

Jiang, Nan, Li, Qi, Tan, Lin, Zhang, Tianyi

arXiv.org Artificial IntelligenceOct-13-2024

While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs' hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs' output, token types, and the execution feedback of LLMs' generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 - 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs' hallucinations, and highlight the need for more sophisticated techniques. Despite the great potential and impressive success of LLMs (Touvron et al., 2023; Brown et al., 2020; Li et al., 2022a; OpenAI, 2024), a known issue of LLMs is hallucination, a phenomenon where the model generates fluent and plausible-sounding but unfaithful or fabricated content (Ji et al., 2023). The hallucination issue poses a significant risk when deploying LLMs in real-world applications that require precise information (Puchert et al., 2023). Due to this importance, researchers have developed benchmarks such as TruthfulQA (Lin et al., 2022), FELM (chen et al., 2023), and HaluEval (Li et al., 2023b) to understand and predict hallucinations of LLMs. Additionally, researchers are actively exploring methods to mitigate hallucinations (Liu et al., 2024b; Elaraby et al., 2023; Dhuliawala et al., 2023; Yan et al., 2024). Another domain where LLMs have been widely applied is source code.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.09997

Country: Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models

Wu, Yi, Xiong, Zikang, Hu, Yiran, Iyengar, Shreyash S., Jiang, Nan, Bera, Aniket, Tan, Lin, Jagannathan, Suresh

arXiv.org Artificial IntelligenceSep-28-2024

Despite significant advancements in large language models (LLMs) that enhance robot agents' understanding and execution of natural language (NL) commands, ensuring the agents adhere to user-specified constraints remains challenging, particularly for complex commands and long-horizon tasks. To address this challenge, we present three key insights, equivalence voting, constrained decoding, and domain-specific fine-tuning, which significantly enhance LLM planners' capability in handling complex tasks. Equivalence voting ensures consistency by generating and sampling multiple Linear Temporal Logic (LTL) formulas from NL commands, grouping equivalent LTL formulas, and selecting the majority group of formulas as the final LTL formula. Constrained decoding then uses the generated LTL formula to enforce the autoregressive inference of plans, ensuring the generated plans conform to the LTL. Domain-specific fine-tuning customizes LLMs to produce safe and efficient plans within specific task domains. Our approach, Safe Efficient LLM Planner (SELP), combines these insights to create LLM planners to generate plans adhering to user commands with high confidence. We demonstrate the effectiveness and generalizability of SELP across different robot agents and tasks, including drone navigation and robot manipulation. For drone navigation tasks, SELP outperforms state-of-the-art planners by 10.8% in safety rate (i.e., finishing tasks conforming to NL commands) and by 19.8% in plan efficiency. For robot manipulation tasks, SELP achieves 20.4% improvement in safety rate. Our datasets for evaluating NL-to-LTL and robot task planning will be released in github.com/lt-asset/selp.

large language model, natural language, specification, (14 more...)

arXiv.org Artificial Intelligence

2409.19471

Country: North America > United States > Pennsylvania (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ENCORE: Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair

Lutellier, Thibaud, Pang, Lawrence, Pham, Viet Hung, Wei, Moshi, Tan, Lin

arXiv.org Artificial IntelligenceMar-9-2024

Automated generate-and-validate (G&V) program repair techniques typically rely on hard-coded rules, only fix bugs following specific patterns, and are hard to adapt to different programming languages. We propose ENCORE, a new G&V technique, which uses ensemble learning on convolutional neural machine translation (NMT) models to automatically fix bugs in multiple programming languages. We take advantage of the randomness in hyper-parameter tuning to build multiple models that fix different bugs and combine them using ensemble learning. This new convolutional NMT approach outperforms the standard long short-term memory (LSTM) approach used in previous work, as it better captures both local and long-distance connections between tokens. Our evaluation on two popular benchmarks, Defects4J and QuixBugs, shows that ENCORE fixed 42 bugs, including 16 that have not been fixed by existing techniques. In addition, ENCORE is the first G&V repair technique to be applied to four popular programming languages (Java, C++, Python, and JavaScript), fixing a total of 67 bugs across five benchmarks.

machine learning, natural language, programming language, (18 more...)

arXiv.org Artificial Intelligence

1906.08691

Country: North America > Canada (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Nova$^+$: Generative Language Models for Binaries

Jiang, Nan, Wang, Chengxiao, Liu, Kevin, Xu, Xiangzhe, Tan, Lin, Zhang, Xiangyu

arXiv.org Artificial IntelligenceNov-27-2023

Generative large language models (LLMs) pre-trained on code have shown impressive effectiveness in code generation, program repair, and document analysis. However, existing generative LLMs focus on source code and are not specialized for binaries. There are three main challenges for LLMs to model and learn binary code: hex-decimal values, complex global dependencies, and compiler optimization levels. To bring the benefit of LLMs to the binary domain, we develop Nova and Nova$^+$, which are LLMs pre-trained on binary corpora. Nova is pre-trained with the standard language modeling task, showing significantly better capability on five benchmarks for three downstream tasks: binary code similarity detection (BCSD), binary code translation (BCT), and binary code recovery (BCR), over GPT-3.5 and other existing techniques. We build Nova$^+$ to further boost Nova using two new pre-training tasks, i.e., optimization generation and optimization level prediction, which are designed to learn binary optimization and align equivalent binaries. Nova$^+$ shows overall the best performance for all three downstream tasks on five benchmarks, demonstrating the contributions of the new pre-training tasks.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2311.13721

Country: North America > United States (0.93)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

How Effective Are Neural Networks for Fixing Security Vulnerabilities

Wu, Yi, Jiang, Nan, Pham, Hung Viet, Lutellier, Thibaud, Davis, Jordan, Tan, Lin, Babkin, Petr, Shah, Sameena

arXiv.org Artificial IntelligenceMay-29-2023

Security vulnerability repair is a difficult task that is in dire need of automation. Two groups of techniques have shown promise: (1) large code language models (LLMs) that have been pre-trained on source code for tasks such as code completion, and (2) automated program repair (APR) techniques that use deep learning (DL) models to automatically fix software bugs. This paper is the first to study and compare Java vulnerability repair capabilities of LLMs and DL-based APR models. The contributions include that we (1) apply and evaluate five LLMs (Codex, CodeGen, CodeT5, PLBART and InCoder), four fine-tuned LLMs, and four DL-based APR techniques on two real-world Java vulnerability benchmarks (Vul4J and VJBench), (2) design code transformations to address the training and test data overlapping threat to Codex, (3) create a new Java vulnerability repair benchmark VJBench, and its transformed version VJBench-trans and (4) evaluate LLMs and APR techniques on the transformed vulnerabilities in VJBench-trans. Our findings include that (1) existing LLMs and APR models fix very few Java vulnerabilities. Codex fixes 10.2 (20.4%), the most number of vulnerabilities. (2) Fine-tuning with general APR data improves LLMs' vulnerability-fixing capabilities. (3) Our new VJBench reveals that LLMs and APR models fail to fix many Common Weakness Enumeration (CWE) types, such as CWE-325 Missing cryptographic step and CWE-444 HTTP request smuggling. (4) Codex still fixes 8.3 transformed vulnerabilities, outperforming all the other LLMs and APR models on transformed vulnerabilities. The results call for innovations to enhance automated Java vulnerability repair such as creating larger vulnerability repair training data, tuning LLMs with such data, and applying code simplification transformation to facilitate vulnerability repair.

machine learning, natural language, vulnerability, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3597926.3598135

2305.18607

Country:

North America > United States > New York (0.14)
North America > Canada > Alberta (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair

Jiang, Nan, Lutellier, Thibaud, Lou, Yiling, Tan, Lin, Goldwasser, Dan, Zhang, Xiangyu

arXiv.org Artificial IntelligenceApr-16-2023

Automated Program Repair (APR) improves software reliability by generating patches for a buggy program automatically. Recent APR techniques leverage deep learning (DL) to build models to learn to generate patches from existing patches and code corpora. While promising, DL-based APR techniques suffer from the abundant syntactically or semantically incorrect patches in the patch space. These patches often disobey the syntactic and semantic domain knowledge of source code and thus cannot be the correct patches to fix a bug. We propose a DL-based APR approach KNOD, which incorporates domain knowledge to guide patch generation in a direct and comprehensive way. KNOD has two major novelties, including (1) a novel three-stage tree decoder, which directly generates Abstract Syntax Trees of patched code according to the inherent tree structure, and (2) a novel domain-rule distillation, which leverages syntactic and semantic rules and teacher-student distributions to explicitly inject the domain knowledge into the decoding procedure during both the training and inference phases. We evaluate KNOD on three widely-used benchmarks. KNOD fixes 72 bugs on the Defects4J v1.2, 25 bugs on the QuixBugs, and 50 bugs on the additional Defects4J v2.0 benchmarks, outperforming all existing APR tools.

decoder, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2302.01857

Country:

Europe (1.00)
North America > United States (0.46)
North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Self-supervised vision-language pretraining for Medical visual question answering

Li, Pengfei, Liu, Gang, Tan, Lin, Liao, Jinying, Zhong, Shenjun

arXiv.org Artificial IntelligenceNov-24-2022

Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. To solve medical VQA problems with a limited number of training data, pretrain-finetune paradigm is widely used to improve the model generalization. In this paper, we propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining on medical image caption dataset, and finetunes to downstream medical VQA tasks. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets. Our codes and models are available at https://github.com/pengfeiliHEU/M2I2.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2211.13594

Country: Asia > China > Heilongjiang Province (0.14)

Genre: Research Report (0.51)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)

Add feedback