gpus
CUDA Proves Nvidia Is a Software Company
There's a deep, forbidding moat that surrounds Nvidia--and it has nothing to do with hardware. Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I'm afraid I must talk about "moats." Popularized decades ago by Warren Buffett to refer to a company's competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled "We Have No Moat, and Neither Does OpenAI," fretted that open-source AI would pillage Big Tech's castle. A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models.
- Information Technology > Software (0.79)
- Information Technology > Hardware (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
- Information Technology > Software > Programming Languages (0.71)
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Zelenin, Alexandra, Zhuravlyova, Alexandra
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
AI is changing PC graphics. Microsoft wants DirectX ready
PCWorld reports Microsoft is embedding AI into DirectX with new tools called DirectX Linear Algebra and DirectX Compute Graph Compiler to revolutionize game rendering. Major chip makers AMD, Intel, and Nvidia support these AI initiatives, potentially allowing integrated GPUs to compete with discrete graphics cards in gaming performance. These technologies enable dynamic shader creation, neural texture compression, and advanced upscaling that could democratize high-end graphics features like path tracing across different hardware. Games are increasingly being rendered using AI, so Microsoft is bringing AI into the way future graphics chips will render games. Microsoft introduced DirectX Linear Algebra as well as the DirectX Compute Graph Compiler into its DirectX programming interface on Thursday, with previews of each technology due later this year.
- Leisure & Entertainment > Games > Computer Games (1.00)
- Information Technology > Security & Privacy (0.75)
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence (1.00)
Orbital AI data centers could work, but they might ruin Earth in the process
Samsung Galaxy Unpacked 2026 is Feb. 25 A single collision could cause a cascading effect in orbit. Elon Musk's plan to launch millions of AI satellites could be disastrous for the planet. At the start of the month, Elon Musk announced that two of his companies -- SpaceX and xAI -- were merging, and would jointly launch a constellation of 1 million satellites to operate as orbital data centers. Musk's reputation might suggest otherwise, but according to experts, such a plan isn't a complete fantasy. However, if executed at the scale suggested, some of them believe it would have devastating effects on the environment and the sustainability of low Earth Earth orbit.
- North America > United States > Pennsylvania (0.04)
- North America > United States > California (0.04)
- North America > Canada > British Columbia (0.04)
- Asia > China (0.04)
- Information Technology > Services (1.00)
- Government > Regional Government > North America Government > United States Government (0.95)
- Aerospace & Defense (0.75)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.89)
Nvidia's Deal With Meta Signals a New Era in Computing Power
The days of tech giants buying up discrete chips are over. AI companies now need GPUs, CPUs, and everything in between. Ask anyone what Nvidia makes, and they're likely to first say "GPUs." For decades, the chipmaker has been defined by advanced parallel computing, and the emergence of generative AI and the resulting surge in demand for GPUs has been a boon for the company . But Nvidia's recent moves signal that it's looking to lock in more customers at the less compute-intensive end of the AI market--customers who don't necessarily need the beefiest, most powerful GPUs to train AI models, but instead are looking for the most efficient ways to run agentic AI software.
- North America > United States > California (0.15)
- Europe > Slovakia (0.05)
- Europe > Czechia (0.05)
- Asia > China (0.05)
Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks
Hardware-specific optimizations in machine learning (ML) frameworks can cause numerical deviations of inference results. Quite surprisingly, despite using a fixed trained model and fixed input data, inference results are not consistent across platforms, and sometimes not even deterministic on the same platform. We study the causes of these numerical deviations for convolutional neural networks (CNN) on realistic end-to-end inference pipelines and in isolated experiments. Results from 75 distinct platforms suggest that the main causes of deviations on CPUs are differences in SIMD use, and the selection of convolution algorithms at runtime on GPUs. We link the causes and propagation effects to properties of the ML model and evaluate potential mitigations. We make our research code publicly available.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Austria > Tyrol > Innsbruck (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training
Therefore, from a system-level perspective, the design ethos of a system-efficient communication-compression algorithm is that we should guarantee that the compression/decompression of the algorithm is computationally light and takes less time, and it should also be friendly to efficient collective communication primitives.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Louisiana (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)