addr
Architect in the Loop Agentic Hardware Design and Verification
The ever increasing complexity of the hardware design process demands improved hardware design and verification methodologies. With the advent of generative AI various attempts have been made to automate parts of the design and verification process. Large language models (LLMs) as well as specialized models generate hdl and testbenches for small components, having a few leaf level components. However, there are only a few attempts to automate the entire processor design process. Hardware design demands hierarchical and modular design processes. We utilized this best practice systematically and effectively. We propose agentic automated processor design and verification with engineers in the loop. The agent with optional specification tries to break down the design into sub-components, generate HDL and cocotb tests, and verifies the components involving engineer guidance, especially during debugging and synthesis. We designed various digital systems using this approach. However, we selected two simple processors for demonstration purposes in this work. The first one is a LEGv8 like a simple processor verified, synthesized and programmed for the DE-10 Lite FPGA. The second one is a RISC-V like 32-bit processor designed and verified in similar manner and synthesized. However, it is not programmed into the DE-10 Lite. This process is accomplished usually using around a million inference tokens per processor, using a combination of reasoning (e.g gemini-pro) and non-reasoning models (eg. gpt-5-mini) based on the complexity of the task. This indicates that hardware design and verification experimentation can be done cost effectively without using any specialized hardware. The approach is scalable, we even attempted system-on-chip, which we want to experiment in our future work.
Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Hong, Charles, Bhatia, Sahil, Cheung, Alvin, Shao, Yakun Sophia
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report (0.64)
- Workflow (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Appendix A A Stochastic Markov Model of a 2 Server Load Balancing Problem
Similar to the proof of Proposition 12, given the stability constraint in Eq. Eq. (4), we have C 0, l Theorem 14. Multi-agent load balancing is MPG with the VBF Solid and dashed arrows represent deterministic and non-deterministic procedures respectively. Real-world network applications can be CPU-bound or IO-bound [47, 48]. The simulator allows configuring applications that require multi-stage processes switching between CPU/IO queues (Figure 1b). Two different processing models are used for CPU and IO queues, respectively.
- Information Technology (0.68)
- Energy > Power Industry (0.62)
LLM-Aided Compilation for Tensor Accelerators
Hong, Charles, Bhatia, Sahil, Haan, Altan, Dong, Shengjun Kris, Nikiforov, Dima, Cheung, Alvin, Shao, Yakun Sophia
Hardware accelerators, in particular accelerators for tensor processing, have many potential application domains. However, they currently lack the software infrastructure to support the majority of domains outside of deep learning. Furthermore, a compiler that can easily be updated to reflect changes at both application and hardware levels would enable more agile development and design space exploration of accelerators, allowing hardware designers to realize closer-to-optimal performance. In this work, we discuss how large language models (LLMs) could be leveraged to build such a compiler. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator, and prototype a technique for decomposing translation into smaller, more LLM-friendly steps. Additionally, we propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (2 more...)
Learning Distributed and Fair Policies for Network Load Balancing as Markov Potential Game
This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed, using the multi-agent reinforcement learning (MARL) framework. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments, as well as limited and partial observability of each LB agent in distributed networking systems, which can largely degrade the performance of in-production load balancing algorithms in real-world setups. Centralised-training-decentralised-execution (CTDE) RL scheme has been proposed to improve MARL performance, yet it incurs -- especially in distributed networking systems, which prefer distributed and plug-and-play design scheme -- additional communication and management overhead among agents. We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function. A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game. Experimental evaluations involve both an event-driven simulator and real-world system, where the proposed MARL load balancing algorithm shows close-to-optimal performance in simulations, and superior results over in-production LBs in the real-world system.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology (1.00)
- Energy > Power Industry (1.00)
Closed Predicates in Description Logics: Results on Combined Complexity
Ngo, Nhung (Free University of Bozen-Bolzano) | Ortiz, Magdalena ( Vienna University of Technology ) | Simkus, Mantas ( Vienna University of Technology )
Some applications of Description Logic (DL) ontologies combine complete information (e.g., stemming from relational databases) with incomplete, open-world knowledge. Several research efforts in the last years have advocated closed predicates, which are predicates whose extension is interpreted as complete, as a suitable way to leverage partial completeness within the standard open-world semantics of DLs. These works have also studied the data complexity of query answering in the presence of closed predicates, which is generally intractable. In this paper we contribute to the understanding the combined complexity of the problem, by establishing tight complexity results for a range of DLs and query answering problems. In summary, our results show that consistency testing and instance query answering in the presence of closed predicates are feasible in NP even for rich dialects of the DL-Lite family; this is the lowest complexity that could be expected. For EL, in contrast, they are EXPTIME-complete, thus as hard as for ALC and some of its extensions. If unions of conjunctive queries (UCQs) are considered, the picture is even bleaker: we can show 2EXPTIME-hardness even for DL-Lite_R and EL. This is in sharp contrast to the NP-upper bound in the standard setting without closed predicates, and coincides with known upper bounds for much richer DLs. We note that our results imply 2EXPTIME-hardness of query answering in ALCO for the standard setting, where all predicates are interpreted under the open-world semantics. This singles out nominals as a previously unidentified source of complexity when answering queries over expressive DLs. Despite these negative results, we can still identify several useful classes of queries for which the increase in hardness is not as drastic, and the combined complexity of query answering remains between NP and coNEXPTIME.
A nonclassical symbolic theory of working memory, mental computations, and mental set
The paper tackles four basic questions associated with human brain as a learning system. How can the brain learn to (1) mentally simulate different external memory aids, (2) perform, in principle, any mental computations using imaginary memory aids, (3) recall the real sensory and motor events and synthesize a combinatorial number of imaginary events, (4) dynamically change its mental set to match a combinatorial number of contexts? We propose a uniform answer to (1)-(4) based on the general postulate that the human neocortex processes symbolic information in a "nonclassical" way. Instead of manipulating symbols in a read/write memory, as the classical symbolic systems do, it manipulates the states of dynamical memory representing different temporary attributes of immovable symbolic structures stored in a long-term memory. The approach is formalized as the concept of E-machine. Intuitively, an E-machine is a system that deals mainly with characteristic functions representing subsets of memory pointers rather than the pointers themselves. This nonclassical symbolic paradigm is Turing universal, and, unlike the classical one, is efficiently implementable in homogeneous neural networks with temporal modulation topologically resembling that of the neocortex.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > New York (0.04)
- North America > United States > New Jersey (0.04)
- (3 more...)