AITopics | Hu, Wenxiang

Collaborating Authors

Hu, Wenxiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Wang, Yaoxiang, Li, Haoling, Zhang, Xin, Wu, Jie, Liu, Xiao, Hu, Wenxiang, Guo, Zhongxin, Huang, Yangyu, Xin, Ying, Yang, Yujiu, Su, Jinsong, Chen, Qi, Li, Scarlett

arXiv.org Artificial IntelligenceJan-8-2025

Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.04694

Country:

North America > United States (0.28)
Asia (0.28)
Europe > Austria > Vienna (0.14)
Africa > Cameroon > Gulf of Guinea (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent

Xu, Weijie, Huang, Zicheng, Hu, Wenxiang, Fang, Xi, Cherukuri, Rajesh Kumar, Nayyar, Naumaan, Malandri, Lorenzo, Sengamedu, Srinivasan H.

arXiv.org Artificial IntelligenceFeb-1-2024

Recent advancements in Large Language Models (LLMs) have been reshaping Natural Language Processing (NLP) task in several domains. Their use in the field of Human Resources (HR) has still room for expansions and could be beneficial for several time consuming tasks. Examples such as time-off submissions, medical claims filing, and access requests are noteworthy, but they are by no means the sole instances. However, the aforementioned developments must grapple with the pivotal challenge of constructing a high-quality training dataset. On one hand, most conversation datasets are solving problems for customers not employees. On the other hand, gathering conversations with HR could raise privacy concerns. To solve it, we introduce HR-Multiwoz, a fully-labeled dataset of 550 conversations spanning 10 HR domains to evaluate LLM Agent. Our work has the following contributions: (1) It is the first labeled open-sourced conversation dataset in the HR domain for NLP research. (2) It provides a detailed recipe for the data generation procedure along with data analysis and human evaluations. The data generation pipeline is transferable and can be easily adapted for labeled conversation data generation in other domains. (3) The proposed data-collection pipeline is mostly based on LLMs with minimal human involvement for annotation, which is time and cost-efficient.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2402.01018

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada (0.14)

Genre: Research Report (0.84)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Yu, Zhaojian, Zhang, Xin, Shang, Ning, Huang, Yangyu, Xu, Can, Zhao, Yishujie, Hu, Wenxiang, Yin, Qiufeng

arXiv.org Artificial IntelligenceJan-11-2024

Recent work demonstrates that, after being fine-tuned on a high-quality instruction dataset, the resulting model can obtain impressive capabilities to address a wide range of tasks. However, existing methods for instruction data generation often produce duplicate data and are not controllable enough on data quality. In this paper, we extend the generalization of instruction tuning by classifying the instruction data to 4 code-related tasks and propose a LLM-based Generator-Discriminator data process framework to generate diverse, high-quality instruction data from open source code. Hence, we introduce CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal code-related tasks,which is aimed at augmenting the effectiveness of instruction tuning and improving the generalization ability of fine-tuned model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. This model is specifically designed for enhancing instruction tuning of Code Language Models (LLMs). Our experiments demonstrate that Wavecoder models outperform other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, Wavecoder exhibits high efficiency in previous code generation tasks. This paper thus offers a significant contribution to the field of instruction data generation and fine-tuning models, providing new insights and tools for enhancing performance in code-related tasks.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2312.14187

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report (0.64)
Instructional Material (0.60)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

Xu, Weijie, Hu, Wenxiang, Wu, Fanyou, Sengamedu, Srinivasan

arXiv.org Artificial IntelligenceDec-23-2023

In the burgeoning field of natural language processing (NLP), Neural Topic Models (NTMs) , Large Language Models (LLMs) and Diffusion model have emerged as areas of significant research interest. Despite this, NTMs primarily utilize contextual embeddings from LLMs, which are not optimal for clustering or capable for topic based text generation. NTMs have never been combined with diffusion model for text generation. Our study addresses these gaps by introducing a novel framework named Diffusion-Enhanced Topic Modeling using Encoder-Decoder-based LLMs (DeTiME). DeTiME leverages Encoder-Decoder-based LLMs to produce highly clusterable embeddings that could generate topics that exhibit both superior clusterability and enhanced semantic coherence compared to existing methods. Additionally, by exploiting the power of diffusion model, our framework also provides the capability to do topic based text generation. This dual functionality allows users to efficiently produce highly clustered topics and topic based text generation simultaneously. DeTiME's potential extends to generating clustered embeddings as well. Notably, our proposed framework(both encoder-decoder based LLM and diffusion model) proves to be efficient to train and exhibits high adaptability to other LLMs and diffusion model, demonstrating its potential for a wide array of applications.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.findings-emnlp.606

2310.15296

Country:

Asia > Middle East > UAE (0.14)
North America > United States > Colorado (0.14)
North America > United States > California (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Luo, Ziyang, Xu, Can, Zhao, Pu, Sun, Qingfeng, Geng, Xiubo, Hu, Wenxiang, Tao, Chongyang, Ma, Jing, Lin, Qingwei, Jiang, Daxin

arXiv.org Artificial IntelligenceJun-14-2023

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.08568

Country: Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback