shard
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > India (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (3 more...)
AdaptiveMachineUnlearning
However,for sequences ofdeletions, most prior work inthe non-convexsetting gives valid guarantees only for sequences that are chosenindependently of the models that are published. If people choose to delete their data as a function of the published models (because they don't like what the models reveal about them, for example), then the update sequence isadaptive.
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
Anthony, Quentin, Tokpanov, Yury, Szot, Skyler, Rajagopal, Srivatsan, Medepalli, Praneeth, Golubeva, Anna, Shyam, Vasu, Washbourne, Robert, Iyer, Rishi, Chaurasia, Ansh, Figliolia, Tomas, Yang, Xiao, Sarje, Abhinav, Thorstensen, Drew, Pearson, Amartey, Grossbart, Zack, van Patten, Jason, Barsoum, Emad, Gu, Zhenyu, Fu, Yao, Millidge, Beren
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Communications > Networks (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Provenance-Driven Reliable Semantic Medical Image Vector Reconstruction via Lightweight Blockchain-Verified Latent Fingerprints
Rasheed, Mohsin, Al-Mamun, Abdullah
Medical imaging is essential for clinical diagnosis, yet real-world data frequently suffers from corruption, noise, and potential tampering, challenging the reliability of AI-assisted interpretation. Conventional reconstruction techniques prioritize pixel-level recovery and may produce visually plausible outputs while compromising anatomical fidelity, an issue that can directly impact clinical outcomes. We propose a semantic-aware medical image reconstruction framework that integrates high-level latent embeddings with a hybrid U-Net architecture to preserve clinically relevant structures during restoration. To ensure trust and accountability, we incorporate a lightweight blockchain-based provenance layer using scale-free graph design, enabling verifiable recording of each reconstruction event without imposing significant overhead. Extensive evaluation across multiple datasets and corruption types demonstrates improved structural consistency, restoration accuracy, and provenance integrity compared with existing approaches. By uniting semantic-guided reconstruction with secure traceability, our solution advances dependable AI for medical imaging, enhancing both diagnostic confidence and regulatory compliance in healthcare environments.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.34)
FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness
Wen, Siyuan, Zhang, Meng, Yang, Yang, Ding, Ningning
To protect clients' right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard's fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.
- Asia > China > Hong Kong (0.04)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Surrey > Guildford (0.04)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs
Trifan, Octavian Alexandru, Sangaiah, Karthik, Awad, Muhammad, Osama, Muhammad, Gudaparthi, Sumanth, Nicolau, Alexandru, Veidenbaum, Alexander, Dasika, Ganesh
As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
- Research Report (0.64)
- Workflow (0.46)
Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
Ranjan, Aditya K., Singh, Siddharth, Wei, Cunyang, Bhatele, Abhinav
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > United States > Missouri > St. Louis County > St. Louis (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (10 more...)
- Overview (0.67)
- Research Report (0.51)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Architecture > Distributed Systems (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Data Science > Data Mining (0.93)