compile
c182ec594f38926b7fcb827635b9a8f4-Supplemental-Conference.pdf
Let q(Y;Θ) and cK(Y,X) be two smooth, decomposable circuits that are compatible overY then computing their product as a circuit rΘ,K(X,Y) = q(Y;Θ) cK(Y,X) that is decomposable overY can be done inO(|q||c|). Letr(X,Y)beacircuitthat is smooth and decomposable and deterministic overY then for a configurationx its MAP state argmaxyr(x,y)canbecomputedintimeO(|r|). For our experiments we use standard compilation tools toobtain aconstraint circuit starting from a propositional logical formula in conjunctive normal form. We now illustrate step-by-step one example of such a compilation for a simple logical formula. Deterministic sum units representdisjoint solutions to the logical formula, meaning there exists distinct assignments, characterized by the children, that satisfy the logical constraint e.g.
SupplementaryMaterial
A sitting is a meeting of parliament members. While in the virtual environment, you will need to install the specific Gensim1 version needed for theCompassapproach. Inotherinstances,thebeginning of the line that specifies the speaker consists of the role of the parliament member, for example "SPEAKEROFTHEPARLIAMENT" (meaning the member of parliament presiding), followed, but not always, by the actual full name of the person in parenthesis. Theidisa unique number we assigned to each file. Themainchallenge of translating the files from Greek to English was the conversion of the Greek alphabetic numeralstoindo-arabicnumerals.
- Europe > Greece (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Florida > Hillsborough County > Tampa (0.04)
- (2 more...)
Pogosim -- a Simulator for Pogobot robots
Cazenille, Leo, Macabre, Loona, Bredeche, Nicolas
Pogobots are a new type of open-source/open-hardware robots specifically designed for swarm robotics research. Their cost-effective and modular design, complemented by vibration-based and wheel-based locomotion, fast infrared communication and extensive software architecture facilitate the implementation of swarm intelligence algorithms. However, testing even simple distributed algorithms directly on robots is particularly labor-intensive. Scaling to more complex problems or calibrate user code parameters will have a prohibitively high strain on available resources. In this article we present Pogosim, a fast and scalable simulator for Pogobots, designed to reduce as much as possible algorithm development costs. The exact same code will be used in both simulation and to experimentally drive real robots. This article details the software architecture of Pogosim, explain how to write configuration files and user programs and how simulations approximate or differ from experiments. We describe how a large set of simulations can be launched in parallel, how to retrieve and analyze the simulation results, and how to optimize user code parameters using optimization algorithms.
Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification
This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.
A Proofs
The proof directly follows from Theorem 3.2 from V ergari et al. [75]. Note that O ( |q ||c|) is a loose upperbound and the size of r is in practice smaller [75]. Analogously, the second statement of Theorem 3.1 follows from Proposition A.1 and by recalling For our experiments we use standard compilation tools to obtain a constraint circuit starting from a propositional logical formula in conjunctive normal form. We now illustrate step-by-step one example of such a compilation for a simple logical formula. Deterministic sum units represent disjoint solutions to the logical formula, meaning there exists distinct assignments, characterized by the children, that satisfy the logical constraint e.g.
Supplementary Material
The dataset includes 1,280,918 speech fragments of Greek parliament members in debate order exported from 5,355 parliamentary sitting record files, with a total volume of 2.12 GB. The speeches extend chronologically from July 1989 up to July 2020. Table 1 shows the contents of the dataset. The names of the speakers are provided in the format "last_name patronym first_name (nickname)". In cases with more than one first or last names, the names that belong to the same category (first or last) are connected with a dash, e.g., "merk-ouri stamatiou amalia-maria (melina)". A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. A session is a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution.
- Europe > Greece (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Florida > Hillsborough County > Tampa (0.04)
- (4 more...)
Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming
Biberstein, Paul, Li, Ziyang, Devietti, Joseph, Naik, Mayur
Neurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component with a slower symbolic component that runs on CPUs. We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by APM allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes. We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 8 applications, Lobster achieves an average speedup of 5.3x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
Zhang, Ruisi, Liu, Tianyu, Feng, Will, Gu, Andrew, Purandare, Sanket, Liang, Wanchao, Massa, Francisco
Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique $torch.compile$-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > Middle East > Jordan (0.04)
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Lee, Yejin, Sun, Anna, Hosmer, Basil, Acun, Bilge, Balioglu, Can, Wang, Changhan, Hernandez, Charles David, Puhrsch, Christian, Haziza, Daniel, Guessous, Driss, Massa, Francisco, Kahn, Jacob, Wan, Jeffrey, Reizenstein, Jeremy, Zhai, Jiaqi, Isaacson, Joe, Schlosser, Joel, Pino, Juan, Sadagopan, Kaushik Ram, Shamis, Leonid, Ma, Linjian, Hwang, Min-Jae, Chen, Mingda, Elhoushi, Mostafa, Rodriguez, Pedro, Pasunuru, Ram, Yih, Scott, Popuri, Sravya, Liu, Xing, Wu, Carole-Jean
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.
Flexible and Efficient Surrogate Gradient Modeling with Forward Gradient Injection
Automatic differentiation is a key feature of present deep learning frameworks. Moreover, they typically provide various ways to specify custom gradients within the computation graph, which is of particular importance for defining surrogate gradients in the realms of non-differentiable operations such as the Heaviside function in spiking neural networks (SNNs). PyTorch, for example, allows the custom specification of the backward pass of an operation by overriding its backward method. Other frameworks provide comparable options. While these methods are common practice and usually work well, they also have several disadvantages such as limited flexibility, additional source code overhead, poor usability, or a potentially strong negative impact on the effectiveness of automatic model optimization procedures. In this paper, an alternative way to formulate surrogate gradients is presented, namely, forward gradient injection (FGI). FGI applies a simple but effective combination of basic standard operations to inject an arbitrary gradient shape into the computational graph directly within the forward pass. It is demonstrated that using FGI is straightforward and convenient. Moreover, it is shown that FGI can significantly increase the model performance in comparison to custom backward methods in SNNs when using TorchScript. These results are complemented with a general performance study on recurrent SNNs with TorchScript and torch.compile, revealing the potential for a training speedup of more than 7x and an inference speedup of more than 16x in comparison with pure PyTorch.
- North America > United States (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany (0.04)
- Europe > Austria (0.04)