AITopics | gpu implementation

Collaborating Authors

gpu implementation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Enforcing convex constraints in Graph Neural Networks

Rashwan, Ahmed, Briggs, Keith, Budd, Chris, Kreusser, Lisa

arXiv.org Artificial IntelligenceOct-14-2025

Many machine learning applications require outputs that satisfy complex, dynamic constraints. This task is particularly challenging in Graph Neural Network models due to the variable output sizes of graph-structured data. In this paper, we introduce ProjNet, a Graph Neural Network framework which satisfies input-dependant constraints. ProjNet combines a sparse vector clipping method with the Component-Averaged Dykstra (CAD) algorithm, an iterative scheme for solving the best-approximation problem. We establish a convergence result for CAD and develop a GPU-accelerated implementation capable of handling large-scale inputs efficiently. To enable end-to-end training, we introduce a surrogate gradient for CAD that is both computationally efficient and better suited for optimization than the exact gradient. We validate ProjNet on four classes of constrained optimisation problems: linear programming, two classes of non-convex quadratic programs, and radio transmit power optimization, demonstrating its effectiveness across diverse problem settings.

artificial intelligence, constraint, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.11227

Genre: Research Report (0.50)

Industry: Energy (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

Giasemis, Fotis I., Lončar, Vladimir, Granado, Bertrand, Gligorov, Vladimir Vava

arXiv.org Artificial IntelligenceFeb-16-2025

In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipeline$\unicode{x2013}$an implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.

artificial intelligence, implementation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.02304

Country:

North America > United States (0.47)
Europe (0.29)

Genre:

Research Report (0.64)
Workflow (0.49)

Industry:

Information Technology (0.50)
Energy > Power Industry (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Kino-PAX: Highly Parallel Kinodynamic Sampling-based Planner

Perrault, Nicolas, Ho, Qi Heng, Lahijanian, Morteza

arXiv.org Artificial IntelligenceSep-10-2024

Sampling-based motion planners (SBMPs) are effective for planning with complex kinodynamic constraints in high-dimensional spaces, but they still struggle to achieve real-time performance, which is mainly due to their serial computation design. We present Kinodynamic Parallel Accelerated eXpansion (Kino-PAX), a novel highly parallel kinodynamic SBMP designed for parallel devices such as GPUs. Kino-PAX grows a tree of trajectory segments directly in parallel. Our key insight is how to decompose the iterative tree growth process into three massively parallel subroutines. Kino-PAX is designed to align with the parallel device execution hierarchies, through ensuring that threads are largely independent, share equal workloads, and take advantage of low-latency resources while minimizing high-latency data transfers and process synchronization. This design results in a very efficient GPU implementation. We prove that Kino-PAX is probabilistically complete and analyze its scalability with compute hardware improvements. Empirical evaluations demonstrate solutions in the order of 10 ms on a desktop GPU and in the order of 100 ms on an embedded GPU, representing up to 1000 times improvement compared to coarse-grained CPU parallelization of state-of-the-art sequential algorithms over a range of complex environments and systems.

algorithm, kino-pax, node, (14 more...)

arXiv.org Artificial Intelligence

2409.06807

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Colorado > Boulder County > Boulder (0.04)
North America > Canada > Quebec > Capitale-Nationale Region > Québec (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Hardware (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.31)

Add feedback

GPU-Accelerated Counterfactual Regret Minimization

Kim, Juho

arXiv.org Artificial IntelligenceSep-6-2024

Counterfactual regret minimization is a family of algorithms of no-regret learning dynamics capable of solving large-scale imperfect information games. We propose implementing this algorithm as a series of dense and sparse matrix and vector operations, thereby making it highly parallelizable for a graphical processing unit, at a cost of higher memory usages. Our experiments show that our implementation performs up to about 352.5 times faster than OpenSpiel's Python implementation and up to about 22.2 times faster than OpenSpiel's C++ implementation and the speedup becomes more pronounced as the size of the game being solved grows. Counterfactual regret minimization (CFR) (Zinkevich et al., 2007) is a family of algorithms of noregret learning dynamics capable of solving large-scale imperfect information games. Its variants dominated the development of AI agents for large imperfect information games like Poker (Tammelin et al., 2015; Moravčík et al., 2017; Brown & Sandholm, 2018; 2019b) and The Resistance: Avalon (Serrino et al., 2019) and were components of ReBeL (Brown et al., 2020) and student of games (Schmid et al., 2023).

implementation, openspiel, pvq, (16 more...)

arXiv.org Artificial Intelligence

2408.14778

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Texas (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.34)

Add feedback

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Randall, Thomas, Allen, Tyler, Ge, Rong

arXiv.org Artificial IntelligenceDec-12-2023

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

architecture, full-w2v, implementation, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3447818.3460373

2312.07743

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy (0.04)

Genre: Research Report (0.50)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Virtual reservoir acceleration for CPU and GPU: Case study for coupled spin-torque oscillator reservoir

de Jong, Thomas Geert, Akashi, Nozomi, Taniguchi, Tomohiro, Notsu, Hirofumi, Nakajima, Kohei

arXiv.org Artificial IntelligenceDec-2-2023

We provide high-speed implementations for simulating reservoirs described by $N$-coupled spin-torque oscillators. Here $N$ also corresponds to the number of reservoir nodes. We benchmark a variety of implementations based on CPU and GPU. Our new methods are at least 2.6 times quicker than the baseline for $N$ in range $1$ to $10^4$. More specifically, over all implementations the best factor is 78.9 for $N=1$ which decreases to 2.6 for $N=10^3$ and finally increases to 23.8 for $N=10^4$. GPU outperforms CPU significantly at $N=2500$. Our results show that GPU implementations should be tested for reservoir simulations. The implementations considered here can be used for any reservoir with evolution that can be approximated using an explicit method.

implementation, reservoir, reservoir computing, (15 more...)

arXiv.org Artificial Intelligence

2312.01121

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.06)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Predicting Surface Texture in Steel Manufacturing at Speed

Milne, Alexander J. M., Xie, Xianghua

arXiv.org Artificial IntelligenceJan-20-2023

Control of the surface texture of steel strip during the galvanizing and temper rolling processes is essential to satisfy customer requirements and is conventionally measured post-production using a stylus. In-production laser reflection measurement is less consistent than physical measurement but enables real time adjustment of processing parameters to optimize product surface characteristics. We propose the use of machine learning to improve accuracy of the transformation from inline laser reflection measurements to a prediction of surface properties. In addition to accuracy, model evaluation speed is important for fast feedback control. The ROCKET model is one of the fastest state of the art models, however it can be sped up by utilizing a GPU. Our contribution is to implement the model in PyTorch for fast GPU kernel transforms and provide a soft version of the Proportion of Positive Values (PPV) nonlinear pooling function, allowing gradient flow. We perform timing and performance experiments comparing the implementations

artificial intelligence, experiment, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2301.08527

Country:

Europe > United Kingdom > Wales > Swansea (0.04)
Africa > Middle East > Morocco > Rabat-Salé-Kénitra Region > Rabat (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Materials > Metals & Mining > Steel (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Tensorflow Plugin - Metal - Apple Developer

#artificialintelligenceOct-19-2022, 21:54:37 GMT

Error: "Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)." A tensorflow installation wheel that matches the current Python environment couldn't be found by the package manager. Check that the Python version used in the environment is supported (Python 3.8, Python 3.9, Python 3.10). Complex data type isn't supported by tensorflow-metal. Error: "Cannot assign a device for operation: Could not satisfy explicit device specification because the node was colocated with a group of nodes that required incompatible device."

computation, error, python 3, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GPU Accelerated Voxel Grid Generation for Fast MAV Exploration

Toumieh, Charbel, Lambert, Alain

arXiv.org Artificial IntelligenceAug-15-2022

Abstract-- Voxel grids are a minimal and efficient environment representation that is used for robot motion planning in numerous tasks. Many state-of-the-art planning algorithms use voxel grids composed of free, occupied and unknown voxels. In this paper we propose a new GPU accelerated algorithm for partitioning the space into a voxel grid with occupied, free and unknown voxels. The proposed approach is low latency and suitable for high speed navigation. I. INTRODUCTION Many sensors (RGB-D cameras, stereo-matching...) output dense pointclouds as measurements and need to be processed and turned into an environment model/representation for motion planning.

artificial intelligence, grid, voxel, (17 more...)

arXiv.org Artificial Intelligence

2112.13169

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (0.30)

Technology: Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.54)

Add feedback

Faiss: A library for efficient similarity search - Facebook Engineering

#artificialintelligenceMay-21-2021, 10:10:30 GMT

This month, we released Facebook AI Similarity Search (Faiss), a library that allows us to quickly search for multimedia documents that are similar to each other -- a challenge where traditional query search engines fall short. We've built nearest-neighbor search implementations for billion-scale data sets that are some 8.5x faster than the previous reported state-of-the-art, along with the fastest k-selection algorithm on the GPU known in the literature. This lets us break some records, including the first k-nearest-neighbor graph constructed on 1 billion high-dimensional vectors. Traditional databases are made up of structured tables containing symbolic information. For example, an image collection would be represented as a table with one row per indexed photo.

Add feedback