Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Lin, Yu-Chen, Kumar, Akhilesh, Chang, Norman, Zhang, Wenliang, Zakir, Muhammad, Apte, Rucha, He, Haiyang, Wang, Chao, Jang, Jyh-Shing Roger

Jan-30-2024–arXiv.org Artificial Intelligence

We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.

information, llm, rag method, (14 more...)

arXiv.org Artificial Intelligence

Jan-30-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Monaco (0.04)
- North America > United States
  - California > Santa Clara County > San Jose (0.04)
- Asia
  - Taiwan (0.04)
  - Middle East > Jordan (0.04)

Genre:
- Research Report (1.00)
- Workflow (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)