AITopics | Shah, Jay

Collaborating Authors

Shah, Jay

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Shah, Jay, Bikshandi, Ganesh, Zhang, Ying, Thakkar, Vijay, Ramani, Pradeep, Dao, Tri

arXiv.org Artificial IntelligenceJul-12-2024

For the Transformer architecture [59], the attention mechanism constitutes the primary computational bottleneck, since computing the self-attention scores of queries and keys has quadratic scaling in the sequence length. Scaling attention to longer context will unlock new capabilities (modeling and reasoning over multiple long documents [24, 43, 50] and files in large codebases [30, 48]), new modalities (high-resolution images [11], audio [23], video [25]), and new applications (user interaction with long history [53], agent workflow with long horizon [62]). This has generated significant interest in making attention faster in the long-context regime, including by approximation [14, 27, 56] and software optimization ([17, 29, 45]), or even alternative architectures [22, 42, 55]. In this work, we build on the work of Dao et al. [17] on developing exact-attention algorithms that integrate knowledge of the GPU's execution model and hardware characteristics into their high-level design. In [17], Dao et al. introduced FlashAttention, a novel tiling strategy for parallelizing attention that eliminates intermediate reads/writes to slow global memory through fusing all of the attention operations into a single GPU kernel. Dao [15] restructured the algorithm as FlashAttention-2 to also parallelize over the sequence length dimension and perform the inner loop of the forward pass over blocks of the key and value matrices, thus improving the occupancy and distribution of work on the GPU. However, we observe that FlashAttention-2 nonetheless achieves poor utilization on newer GPUs relative to optimized matrix-multiplication (GEMM) kernels, such as 35% vs. 80-90% on the Hopper H100 GPU. Partially, this may be attributed to implementation-level differences, such as not using Hopper-specific instructions in place of Ampere ones when targeting the Tensor Cores. Several work such as ThunkerKitten [52] and cuDNN 9 [39] has shown that with Hopper-specific instructions and tile-based abstractions, one can speedup attention computation and simplify the implementation.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.08608

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

Bikshandi, Ganesh, Shah, Jay

arXiv.org Artificial IntelligenceDec-19-2023

We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations, and choosing optimal tile sizes for the Q, K and V attention matrices while balancing the register pressure and shared memory utilization. In head-to-head benchmarks on a single H100 PCIe GPU for some common choices of hyperparameters, we observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2312.11918

Genre: Research Report (0.41)

Industry: Information Technology > Hardware (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Body-mounted MR-conditional Robot for Minimally Invasive Liver Intervention

Huang, Zhefeng, Gunderman, Anthony L., Wilcox, Samuel E., Sengupta, Saikat, Lu, Aiming, Woodrum, David, Shah, Jay, Chen, Yue

arXiv.org Artificial IntelligenceOct-11-2023

MR-guided microwave ablation (MWA) has proven effective in treating hepatocellular carcinoma (HCC) with small-sized tumors, but the state-of-the-art technique suffers from sub-optimal workflow due to speed and accuracy of needle placement. This paper presents a compact body-mounted MR-conditional robot that can operate in closed-bore MR scanners for accurate needle guidance. The robotic platform consists of two stacked Cartesian XY stages, each with two degrees of freedom, that facilitate needle guidance. The robot is actuated using 3D-printed pneumatic turbines with MR-conditional bevel gear transmission systems. Pneumatic valves and control mechatronics are located inside the MRI control room and are connected to the robot with pneumatic transmission lines and optical fibers. Free space experiments indicated robot-assisted needle insertion error of 2.6$\pm$1.3 mm at an insertion depth of 80 mm. The MR-guided phantom studies were conducted to verify the MR-conditionality and targeting performance of the robot. Future work will focus on the system optimization and validations in animal trials.

artificial intelligence, body-mounted mr-conditional robot, minimally invasive liver intervention

arXiv.org Artificial Intelligence

2310.07822

Genre: Research Report (0.69)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.53)
Energy > Power Industry (0.53)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Brainomaly: Unsupervised Neurologic Disease Detection Utilizing Unannotated T1-weighted Brain MR Images

Siddiquee, Md Mahfuzur Rahman, Shah, Jay, Wu, Teresa, Chong, Catherine, Schwedt, Todd J., Dumkrieger, Gina, Nikolova, Simona, Li, Baoxin

arXiv.org Artificial IntelligenceAug-16-2023

Harnessing the power of deep neural networks in the medical imaging domain is challenging due to the difficulties in acquiring large annotated datasets, especially for rare diseases, which involve high costs, time, and effort for annotation. Unsupervised disease detection methods, such as anomaly detection, can significantly reduce human effort in these scenarios. While anomaly detection typically focuses on learning from images of healthy subjects only, real-world situations often present unannotated datasets with a mixture of healthy and diseased subjects. Recent studies have demonstrated that utilizing such unannotated images can improve unsupervised disease and anomaly detection. However, these methods do not utilize knowledge specific to registered neuroimages, resulting in a subpar performance in neurologic disease detection. To address this limitation, we propose Brainomaly, a GAN-based image-to-image translation method specifically designed for neurologic disease detection. Brainomaly not only offers tailored image-to-image translation suitable for neuroimages but also leverages unannotated mixed images to achieve superior neurologic disease detection. Additionally, we address the issue of model selection for inference without annotated samples by proposing a pseudo-AUC metric, further enhancing Brainomaly's detection performance. Extensive experiments and ablation studies demonstrate that Brainomaly outperforms existing state-of-the-art unsupervised disease and anomaly detection methods by significant margins in Alzheimer's disease detection using a publicly available dataset and headache detection using an institutional dataset. The code is available from https://github.com/mahfuzmohammad/Brainomaly.

data mining, detection, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.092

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.77)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback