Goto

Collaborating Authors

 context engineering


From vibe coding to context engineering: 2025 in software development

MIT Technology Review

This year, we've seen a real-time experiment playing out across the technology industry, one in which AI's software engineering capabilities have been put to the test against human technologists. And although 2025 may have started with AI looking strong, the transition from vibe coding to what's being termed context engineering shows that while the work of human developers is evolving, they nevertheless remain absolutely critical. This is captured in the latest volume of the " Thoughtworks Technology Radar," a report on the technologies used by our teams on projects with clients. In it, we see the emergence of techniques and tooling designed to help teams better tackle the problem of managing context when working with LLMs and AI agents. Taken together, there's a clear signal of the direction of travel in software engineering and even AI more broadly. After years of the industry assuming progress in AI is all about scale and speed, we're starting to see that what matters is the ability to handle context effectively.


Context Engineering 2.0: The Context of Context Engineering

Hua, Qishuo, Ye, Lyumanshan, Fu, Dayuan, Xiao, Yang, Cai, Xiaojie, Wu, Yunze, Lin, Jifan, Wang, Junfei, Liu, Pengfei

arXiv.org Artificial Intelligence

Karl Marx once wrote that ``the human essence is the ensemble of social relations'', suggesting that individuals are not isolated entities but are fundamentally shaped by their interactions with other entities, within which contexts play a constitutive and essential role. With the advent of computers and artificial intelligence, these contexts are no longer limited to purely human--human interactions: human--machine interactions are included as well. Then a central question emerges: How can machines better understand our situations and purposes? To address this challenge, researchers have recently introduced the concept of context engineering. Although it is often regarded as a recent innovation of the agent era, we argue that related practices can be traced back more than twenty years. Since the early 1990s, the field has evolved through distinct historical phases, each shaped by the intelligence level of machines: from early human--computer interaction frameworks built around primitive computers, to today's human--agent interaction paradigms driven by intelligent agents, and potentially to human--level or superhuman intelligence in the future. In this paper, we situate context engineering, provide a systematic definition, outline its historical and conceptual landscape, and examine key design considerations for practice. By addressing these questions, we aim to offer a conceptual foundation for context engineering and sketch its promising future. This paper is a stepping stone for a broader community effort toward systematic context engineering in AI systems.


Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration

Wei, Zheng, Li, Mingchen, Zhang, Zeqian, Yuan, Ruibin, Hui, Pan, Qu, Huamin, Evans, James, Agrawala, Maneesh, Rao, Anyi

arXiv.org Artificial Intelligence

Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enable modular specialization and scalable inter-agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi-agent systems in creative tasks.


Directed Information $γ$-covering: An Information-Theoretic Framework for Context Engineering

Huang, Hai

arXiv.org Machine Learning

We introduce \textbf{Directed Information $γ$-covering}, a simple but general framework for redundancy-aware context engineering. Directed information (DI), a causal analogue of mutual information, measures asymmetric predictiveness between chunks. If $\operatorname{DI}_{i \to j} \ge H(C_j) - γ$, then $C_i$ suffices to represent $C_j$ up to $γ$ bits. Building on this criterion, we formulate context selection as a $γ$-cover problem and propose a greedy algorithm with provable guarantees: it preserves query information within bounded slack, inherits $(1+\ln n)$ and $(1-1/e)$ approximations from submodular set cover, and enforces a diversity margin. Importantly, building the $γ$-cover is \emph{query-agnostic}: it incurs no online cost and can be computed once offline and amortized across all queries. Experiments on HotpotQA show that $γ$-covering consistently improves over BM25, a competitive baseline, and provides clear advantages in hard-decision regimes such as context compression and single-slot prompt selection. These results establish DI $γ$-covering as a principled, self-organizing backbone for modern LLM pipelines.


OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

Dai, Sunhao, Tang, Jiakai, Wu, Jiahua, Wang, Kun, Zhu, Yuxuan, Chen, Bingjun, Hong, Bangyang, Zhao, Yu, Fu, Cong, Wu, Kangle, Ni, Yabo, Zeng, Anxiang, Wang, Wenjie, Chen, Xu, Xu, Jun, Ng, See-Kiong

arXiv.org Artificial Intelligence

Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over $+2\%$ GMV/UU and a $+2.90\%$ increase in advertising revenue.


Prescriptive Agents based on RAG for Automated Maintenance (PARAM)

Harbola, Chitranshu, Purwar, Anupam

arXiv.org Artificial Intelligence

--Industrial machinery maintenance requires timely Iintervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi-agentic generation for intelligent maintenance planning. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance.


Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code

Haseeb, Muhammad

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT-5) for clarifying user requirements, an Elicit-powered semantic literature retrieval for injecting domain knowledge, NotebookLM-based document synthesis for contextual understanding, and a Claude Code multi-agent system for code generation and validation. Our integrated approach leverages intent clarification, retrieval-augmented generation, and specialized sub-agents orchestrated via Claude's agent framework. We demonstrate that this method significantly improves the accuracy and reliability of code assistants in real-world repositories, yielding higher single-shot success rates and better adherence to project context than baseline single-agent approaches. Qualitative results on a large Next.js codebase show the multi-agent system effectively plans, edits, and tests complex features with minimal human intervention. We compare our system with recent frameworks like CodePlan, MASAI, and HyperAgent, highlighting how targeted context injection and agent role decomposition lead to state-of-the-art performance. Finally, we discuss the implications for deploying LLM-based coding assistants in production, along with lessons learned on context management and future research directions.


DSBC : Data Science task Benchmarking with Context engineering

Kadiyala, Ram Mohan Rao, Gupta, Siddhant, Purbey, Jebish, Martini, Giulio, Shafique, Ali, Debnath, Suman, Farooq, Hamza

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.