AITopics | cuda

Collaborating Authors

cuda

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CUDA Proves Nvidia Is a Software Company

WIREDMay-11-2026, 10:00:00 GMT

There's a deep, forbidding moat that surrounds Nvidia--and it has nothing to do with hardware. Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I'm afraid I must talk about "moats." Popularized decades ago by Warren Buffett to refer to a company's competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled "We Have No Moat, and Neither Does OpenAI," fretted that open-source AI would pillage Big Tech's castle. A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models.

large language model, machine learning, programming language, (17 more...)

WIRED

Country: North America > United States > California (0.49)

Industry:

Information Technology > Software (0.79)
Information Technology > Hardware (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Software > Programming Languages (0.71)

Add feedback

b6edb87876bec4ac2260bffa083cb992-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 16:34:07 GMT

large language model, machine learning, translation, (22 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > Iowa > Story County > Ames (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(3 more...)

Add feedback

PyLO: Towards Accessible Learned Optimizers in PyTorch

Janson, Paul, Therien, Benjamin, Anthony, Quentin, Huang, Xiaolong, Moudgil, Abhinav, Belilovsky, Eugene

arXiv.org Artificial IntelligenceNov-11-2025

Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.10315

Country:

North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

b6edb87876bec4ac2260bffa083cb992-Paper-Conference.pdf

Neural Information Processing SystemsOct-11-2025, 00:37:28 GMT

blockidx, coderosetta, translation, (15 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > Iowa > Story County > Ames (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(3 more...)

Add feedback

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Lange, Robert Tjarko, Sun, Qi, Prasad, Aaditya, Faldor, Maxence, Tang, Yujin, Ha, David

arXiv.org Artificial IntelligenceSep-19-2025

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing low-level CUDA kernel implementations. Additionally, existing kernel generation benchmarks suffer from exploitable loopholes and insufficient diversity in testing conditions, hindering true generalization assessment. To address these limitations, we introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios. Furthermore, we present a comprehensive agentic framework that automates CUDA kernel discovery, verification, and optimization. This pipeline enables frontier LLMs to translate torch code to CUDA kernels and iteratively improve their runtime within our robust evaluation setting. Our sequential workflow first translates PyTorch code into equivalent CUDA kernels. It then optimizes their runtime using a novel evolutionary meta-generation procedure tailored to the CUDA ecosystem, guided by LLM-based verifiers for correctness and efficient filtering. Evaluated on robust-kbench, our approach produces CUDA kernels outperforming torch implementations for practical applications, including forward and backward passes. It can fuse operations and deploy various runtime optimization strategies. The verifier workflow accurately classifies incorrect kernels, enhancing hardware verification efficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.14279

Genre:

Research Report (0.81)
Workflow (0.54)

Industry: Information Technology > Hardware (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CelloAI: Leveraging Large Language Models for HPC Software Development in High Energy Physics

Atif, Mohammad, Chopra, Kriti, Kilic, Ozgur, Wang, Tianle, Dong, Zhihua, Leggett, Charles, Lin, Meifeng, Calafiura, Paolo, Habib, Salman

arXiv.org Artificial IntelligenceAug-26-2025

Next-generation High Energy Physics (HEP) experiments will generate unprecedented data volumes, necessitating High Performance Computing (HPC) integration alongside traditional high-throughput computing. However, HPC adoption in HEP is hindered by the challenge of porting legacy software to heterogeneous architectures and the sparse documentation of these complex scientific codebases. We present CelloAI, a locally hosted coding assistant that leverages Large Language Models (LLMs) with retrieval-augmented generation (RAG) to support HEP code documentation and generation. This local deployment ensures data privacy, eliminates recurring costs and provides access to large context windows without external dependencies. CelloAI addresses two primary use cases, code documentation and code generation, through specialized components. For code documentation, the assistant provides: (a) Doxygen style comment generation for all functions and classes by retrieving relevant information from RAG sources (papers, posters, presentations), (b) file-level summary generation, and (c) an interactive chatbot for code comprehension queries. For code generation, CelloAI employs syntax-aware chunking strategies that preserve syntactic boundaries during embedding, improving retrieval accuracy in large codebases. The system integrates callgraph knowledge to maintain dependency awareness during code modifications and provides AI-generated suggestions for performance optimization and accurate refactoring. We evaluate CelloAI using real-world HEP applications from ATLAS, CMS, and DUNE experiments, comparing different embedding models for code retrieval effectiveness. Our results demonstrate the AI assistant's capability to enhance code understanding and support reliable code generation while maintaining the transparency and safety requirements essential for scientific computing environments.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.16713

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems

Xu, Ruilin, Xie, Zongxuan, Chen, Pengfei

arXiv.org Artificial IntelligenceJul-2-2025

--We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF . Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors. T o evaluate eACGM's effectiveness and practicality, we conducted extensive empirical studies and case analyses in multi-node distributed training scenarios. The results demonstrate that eACGM, while maintaining a non-intrusive and low-overhead profile, successfully captures critical performance anomalies during model training and inference.

accessed, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.02007

Country: Asia > China (0.14)

Genre: Research Report (0.84)

Industry: Information Technology (0.33)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

Concept-Based Unsupervised Domain Adaptation

Xu, Xinyue, Hu, Yueying, Tang, Hui, Qin, Yi, Mi, Lu, Wang, Hao, Li, Xiaomeng

arXiv.org Artificial IntelligenceMay-9-2025

Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.

artificial intelligence, machine learning, target domain, (17 more...)

arXiv.org Artificial Intelligence

2505.05195

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The Thinking Machine: Jensen Huang, Nvidia and the World's Most Coveted microchip – review

The GuardianApr-20-2025, 14:00:09 GMT

This is the latest confirmation that the "great man" theory of history continues to thrive in Silicon Valley. As such, it joins a genre that includes Walter Isaacson's twin tomes on Steve Jobs and Elon Musk, Brad Stone's book on Jeff Bezos, Michael Becraft's on Bill Gates, Max Chafkin's on Peter Thiel and Michael Lewis's on Sam Bankman-Fried. Notable characteristics of the genre include a tendency towards founder worship, discreet hagiography and a Whiggish interpretation of the life under examination. The great man under Witt's microscope is the co-founder and chief executive of Nvidia, a chip design company that went from being a small but plucky purveyor of graphics processing units (GPUs) for computer gaming to its current position as the third most valuable company in the world. Two things drove this astonishing transition.

jensen huang, nvidia, thinking machine, (9 more...)

The Guardian

Country:

North America > United States > California (0.27)
North America > United States > Oregon (0.05)
North America > United States > Kentucky (0.05)
(2 more...)

Industry:

Semiconductors & Electronics (1.00)
Information Technology > Hardware (0.96)
Leisure & Entertainment > Games > Computer Games (0.36)

Technology:

Information Technology > Artificial Intelligence > Issues > Turing's Test (0.41)
Information Technology > Artificial Intelligence > Issues > Philosophy (0.41)

Add feedback

FlashRNN: Optimizing Traditional RNNs on Modern Hardware

Pöppel, Korbinian, Beck, Maximilian, Hochreiter, Sepp

arXiv.org Artificial IntelligenceJan-13-2025

While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. Our open-source kernels and the optimization library are released here to boost research in the direction of state-tracking enabled RNNs and sequence modeling: \url{https://github.com/NX-AI/flashrnn}

dimension, implementation, kernel, (13 more...)

arXiv.org Artificial Intelligence

2412.07752

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy (0.04)
Europe > Finland (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback