Du, Yuyang
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility?
Zhang, Yiyi, Chen, Xingyu, Chen, Kexin, Du, Yuyang, Dang, Xilin, Heng, Pheng-Ann
Recent years have witnessed extensive efforts to enhance Large Language Models (LLMs) across various domains, alongside growing attention to their ethical implications. However, a critical challenge remains largely overlooked: LLMs must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance by addressing this ethical-utility trade-off, using chemical domain applications as a proof-of-concept. Our alignment pipeline starts with a GPT-assisted three-phase data generation scheme, in which we create LibraChemQA, a chemical question-answering dataset comprising 31.6k triplet instances. By incorporating an innovative balanced seed in the data generation process, our framework systematically considers both legitimate and illegitimate requests. The framework also introduces a rephrasing mechanism for efficient data augmentation that enhances the model's chemical comprehension. We further develop a novel hybrid evaluation scheme with LLM judges for precise assessment of both safety and utility. Experimental results demonstrate our model's substantial improvements in overall performance where both safety and utility are considered - our resulting model, LibraChem, outperforms leading LLMs including Claude-3, GPT-4o, and LLaMA-3 by margins of 13.44%, 7.16%, and 7.10% respectively on our released benchmark.
Rephrase and Contrast: Fine-Tuning Language Models for Enhanced Understanding of Communication and Computer Networks
Wang, Liujianfu, Du, Yuyang, Lin, Jingqi, Chen, Kexin, Liew, Soung Chang
Large language models (LLMs) are being widely researched across various disciplines, with significant recent efforts focusing on adapting LLMs for understanding of how communication networks operate. However, over-reliance on prompting techniques hinders the full exploitation of the generalization ability of these models, and the lack of efficient fine-tuning methods prevents the full realization of lightweight LLMs' potential. This paper addresses these challenges by introducing our Rephrase and Contrast (RaC) framework, an efficient fine-tuning framework. RaC enhances LLMs' comprehension and critical thinking abilities by incorporating question reformulation and contrastive analysis of correct and incorrect answers during the fine-tuning process. Experimental results demonstrate a 63.73% accuracy improvement over the foundational model when tested on a comprehensive networking problem set. Moreover, to efficiently construct the dataset for RaC fine-tuning, we develop a GPT-assisted data mining method for generating high-quality question-answer (QA) pairs; furthermore, we introduce ChoiceBoost, a data augmentation technique that expands dataset size while reducing answer-order bias. Apart from these technical innovations, we contribute to the networking community by open-sourcing valuable research resources, including: 1) the fine-tuned networking model referred to as RaC-Net, 2) the training dataset used for fine-tuning the model, 3) three testing problem sets of different difficulties to serve as benchmarks for future research, and 4) code associated with the above resources.
An Autonomous Large Language Model Agent for Chemical Literature Data Mining
Chen, Kexin, Cao, Hanqun, Li, Junyou, Du, Yuyang, Guo, Menghao, Zeng, Xin, Li, Lanqing, Qiu, Jiezhong, Heng, Pheng Ann, Chen, Guangyong
Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase yields. However, AI faces challenges in processing literature data due to the unstructured format and diverse writing style of chemical literature. To overcome these difficulties, we introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. This AI agent employs large language models (LLMs) for prompt generation and iterative optimization. It functions as a chemistry assistant, automating data collection and analysis, thereby saving manpower and enhancing performance. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data, and we compared our method with human experts in terms of content correctness and time efficiency. The proposed approach marks a significant advancement in automating chemical literature extraction and demonstrates the potential for AI to revolutionize data management and utilization in chemistry.
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis
Chen, Kexin, Li, Junyou, Wang, Kunyi, Du, Yuyang, Yu, Jiahui, Lu, Jiamin, Li, Lanqing, Qiu, Jiezhong, Pan, Jianzhang, Huang, Yi, Fang, Qun, Heng, Pheng Ann, Chen, Guangyong
Recent AI research plots a promising future of automatic chemical reactions within the chemistry society. This study proposes Chemist-X, a transformative AI agent that automates the reaction condition recommendation (RCR) task in chemical synthesis with retrieval-augmented generation (RAG) technology. To emulate expert chemists' strategies when solving RCR tasks, Chemist-X utilizes advanced RAG schemes to interrogate online molecular databases and distill critical data from the latest literature database. Further, the agent leverages state-of-the-art computer-aided design (CAD) tools with a large language model (LLM) supervised programming interface. With the ability to utilize updated chemical knowledge and CAD tools, our agent significantly outperforms conventional synthesis AIs confined to the fixed knowledge within its training data. Chemist-X considerably reduces chemists' workload and allows them to focus on more fundamental and creative problems, thereby bringing closer computational techniques and chemical research and making a remarkable leap toward harnessing AI's full capabilities in scientific discovery.
LLMind: Orchestrating AI and IoT with LLMs for Complex Task Execution
Cui, Hongwei, Du, Yuyang, Yang, Qun, Shao, Yulin, Liew, Soung Chang
In this paper, we introduce LLMind, an AI framework that utilizes large language models (LLMs) as a central orchestrator. The framework integrates LLMs with domain-specific AI modules, enabling IoT devices to collaborate effectively in executing complex tasks. The LLM engages in natural conversations with human users via a user-friendly social media platform to come up with a plan to execute complex tasks. In particular, the execution of a complex task, which may involve the collaborations of multiple domain-specific AI modules and IoT devices, is realized through a control script. The LLM generates the control script using a Language-Code transformation approach based on finite-state machines (FSMs). The framework also incorporates semantic analysis and response optimization techniques to enhance speed and effectiveness. Ultimately, this framework is designed not only to innovate IoT device control and enrich user experiences but also to foster an intelligent and integrated IoT device ecosystem that evolves and becomes more sophisticated through continuing user and machine interactions.
AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density
Wang, Ziqing, Ye, Zhirong, Du, Yuyang, Mao, Yi, Liu, Yanying, Wu, Ziling, Wang, Jun
DBSCAN has been widely used in density-based clustering algorithms. However, with the increasing demand for Multi-density clustering, previous traditional DSBCAN can not have good clustering results on Multi-density datasets. In order to address this problem, an adaptive Multi-density DBSCAN algorithm (AMD-DBSCAN) is proposed in this paper. An improved parameter adaptation method is proposed in AMD-DBSCAN to search for multiple parameter pairs (i.e., Eps and MinPts), which are the key parameters to determine the clustering results and performance, therefore allowing the model to be applied to Multi-density datasets. Moreover, only one hyperparameter is required for AMD-DBSCAN to avoid the complicated repetitive initialization operations. Furthermore, the variance of the number of neighbors (VNN) is proposed to measure the difference in density between each cluster. The experimental results show that our AMD-DBSCAN reduces execution time by an average of 75% due to lower algorithm complexity compared with the traditional adaptive algorithm. In addition, AMD-DBSCAN improves accuracy by 24.7% on average over the state-of-the-art design on Multi-density datasets of extremely variable density, while having no performance loss in Single-density scenarios. Our code and datasets are available at https://github.com/AlexandreWANG915/AMD-DBSCAN.