idx
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths
Li, Zhening, Solar-Lezama, Armando, Yue, Yisong, Zheng, Stephan
We introduce a new approach to agent programming, the development of LLM-based agents. Current approaches to agent programming often entangle two aspects of agent design: the core workflow logic and the inference-time strategy (e.g., tree search). We introduce "probabilistic angelic nondeterminism" ("PAN"), a programming model that disentangles these two concerns, allowing the programmer to describe the agent workflow and independently experiment with different inference-time strategies by simply changing a few inputs. We provide an implementation of PAN in Python as the EnCompass framework, which uses a Python decorator to compile agent workflow programs into a search space. We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Sweden (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Modeling and Control of Magnetic Forces between Microrobots
Seguel, Amelia Fernández, Maass, Alejandro I.
The independent control of multiple magnetic microrobots under a shared global signal presents critical challenges in biomedical applications such as targeted drug delivery and microsurgeries. Most existing systems only allow all agents to move synchronously, limiting their use in applications that require differentiated actuation. This research aims to design a controller capable of regulating the radial distance between micro-agents using only the angle ψof a global magnetic field as the actuation parameter, demonstrating potential for practical applications. The proposed cascade control approach enables faster and more precise adjustment of the inter-agent distance than a proportional controller, while maintaining smooth transitions and avoiding abrupt changes in the orientation of the magnetic field, making it suitable for real-world implementation. A bibliographic review was conducted to develop the physical model, considering magnetic dipole-dipole interactions and velocities in viscous media. A PID controller was implemented to regulate the radial distance, followed by a PD controller in cascade to smooth changes in field orientation. These controllers were simulated in MATLAB, showing that the PID controller reduced convergence time to the desired radius by about 40%. When adding the second controller, the combined PID+PD scheme achieved smooth angular trajectories within similar timeframes, with fluctuations of only \pm 5^\circ. These results validate the feasibility of controlling the radial distance of two microrobots using a shared magnetic field in a fast and precise manner, without abrupt variations in the control angle. However, the model is limited to a 2D environment and two agents, suggesting future research to extend the controller to 3D systems and multiple agents.
- North America > Canada > Ontario > Toronto (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Sul, Stuart H., Arora, Simran, Spector, Benjamin F., Ré, Christopher
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \times$ speedup for data- and tensor-parallel workloads, $4.08 \times$ for sequence-parallel workloads, and $1.22 \times$ for expert-parallel workloads.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
The Anatomy of a Triton Attention Kernel
Ringlein, Burkhard, van Lunteren, Jan, Stoica, Radu, Parnell, Thomas
A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 105.9%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- Oceania > Australia (0.04)
- (2 more...)
- Law (0.46)
- Media (0.38)
- Leisure & Entertainment (0.38)
Table 4: Influence of the optimal grouping on Group Window Attention. Group size g
We provide a diagram in Figure 6 for an intuitive illustration of our method. ViT to obtain representations for each visible patch. Illustration of the Group Window Attention scheme with shifted windows. B.2 Group Window Attention scheme with shifted windows. We provide a Python implementation of the Dynamic-Programming-based Optimal Grouping algorithm in Algorithm 2. As we can see, the two components of the Optimal Grouping algorithm Note the padding operations are omitted here for simplicity.
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Zhu, Hanlin, Hao, Shibo, Hu, Zhiting, Jiao, Jiantao, Russell, Stuart, Tian, Yuandong
Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D