Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing

Jul-22-2025–arXiv.org Artificial Intelligence

This work does not provide legal advice, and do not claims that any legal opinion provided are correct 1 and block the model's output accordingly. This technique might not completely block partial matches and does not work for open source development as reproduction of the original code is expected. Some models address this issue by using curated datasets with appropriate licensed contents. For example, the Stack v2 dataset [2] and Starcoder2 model limits data to permissively licensed sources and contents with unknown license. The Common Pile dataset [3] and the accompanied Comma model improves on this by limiting the dataset to permissive licensed contents only. Most permissive licenses only have attribution as its primary sole licensing condition and may be easier to comply with than the GPLv2 license. Ideally, models that are trained on public domain contents may be the best in terms of legal compliance as they have no restrictions or requirements, but to our knowledge no such text generation models exist today with reasonable quality.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jul-22-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Industry:
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.98)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found