Goto

Collaborating Authors

 mma



Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models Gen Luo

Neural Information Processing Systems

Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language


LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving

arXiv.org Artificial Intelligence

Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.


FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

arXiv.org Artificial Intelligence

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).


Online Relocating and Matching of Ride-Hailing Services: A Model-Based Modular Approach

arXiv.org Artificial Intelligence

This study proposes an innovative model-based modular approach (MMA) to dynamically optimize order matching and vehicle relocation in a ride-hailing platform. MMA utilizes a two-layer and modular modeling structure. The upper layer determines the spatial transfer patterns of vehicle flow within the system to maximize the total revenue of the current and future stages. With the guidance provided by the upper layer, the lower layer performs rapid vehicle-to-order matching and vehicle relocation. MMA is interpretable, and equipped with the customized and polynomial-time algorithm, which, as an online order-matching and vehicle-relocation algorithm, can scale past thousands of vehicles. We theoretically prove that the proposed algorithm can achieve the global optimum in stylized networks, while the numerical experiments based on both the toy network and realistic dataset demonstrate that MMA is capable of achieving superior systematic performance compared to batch matching and reinforcement-learning based methods. Moreover, its modular and lightweight modeling structure further enables it to achieve a high level of robustness against demand variation while maintaining a relatively low computational cost.


Abstract Interpretation in Formal Argumentation: with a Galois Connection for Abstract Dialectical Frameworks and May-Must Argumentation (First Report)

arXiv.org Artificial Intelligence

Labelling-based formal argumentation relies on labelling functions that typically assign one of 3 labels to indicate either acceptance, rejection, or else undecided-to-be-either, to each argument. While a classical labelling-based approach applies globally uniform conditions as to how an argument is to be labelled, they can be determined more locally per argument. Abstract dialectical frameworks (ADF) is a well-known argumentation formalism that belongs to this category, offering a greater labelling flexibility. As the size of an argumentation increases in the numbers of arguments and argument-to-argument relations, however, it becomes increasingly more costly to check whether a labelling function satisfies those local conditions or even whether the conditions are as per the intention of those who had specified them. Some compromise is thus required for reasoning about a larger argumentation. In this context, there is a more recently proposed formalism of may-must argumentation (MMA) that enforces still local but more abstract labelling conditions. We identify how they link to each other in this work. We prove that there is a Galois connection between them, in which ADF is a concretisation of MMA and MMA is an abstraction of ADF. We explore the consequence of abstract interpretation at play in formal argumentation, demonstrating a sound reasoning about the judgement of acceptability/rejectability in ADF from within MMA. As far as we are aware, there is seldom any work that incorporates abstract interpretation into formal argumentation in the literature, and, in the stated context, this work is the first to demonstrate its use and relevance.


Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

arXiv.org Machine Learning

We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an empirical analysis of the cost trade-off between incrementally gathering more labeled versus unlabeled data. This analysis can be used to measure the relative value of labeled/unlabeled data at different points of the learning curve, where we find that although the incremental value of labeled data can be as much as 20x that of unlabeled, it quickly diminishes to less than 3x once more than 2,000 labeled example are observed. Code can be found at https://github.com/google-research/mma.


Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training

arXiv.org Machine Learning

Despite their impressive performance on various learning tasks, neural networks have been shown to be vulnerable. An otherwise highly accurate network can be completely fooled by an artificially constructed perturbationimperceptible to human perception, known as the adversarial attack (Szegedy et al., 2013; Biggio et al., 2013). Not surprisingly, numerous algorithms in defending adversarial attacks have already been proposed in the literature which, arguably, can be interpreted as different ways in increasing the margins, i.e. the smallest distance from the sample point to the decision boundary induced by the network. Obviously, adversarial robustness is equivalent to large margins. Onetype of the algorithms is to use regularization in the learning to enforce the Lipschitz constant of the network (Cisse et al., 2017; Ross and Doshi-Velez, 2017; Hein and Andriushchenko, 2017; Tsuzuku et al., 2018), thus a small loss sample point would have a large margin since the loss cannot increase too fast. If the Lipschitz constant is regularized on data points, it is usually too local and not accurate in a neighborhood; if it is controlled globally, the constraint on the model is often too strong that it harms accuracy. So far, such methods seem not able to achieve very robust models. There are also efforts using first-order approximation to estimate and maximize input space margin (Elsayed et al., 2018; Sokolic et al., 2017; Matyasko and Chau, 2017). Similarly tolocal Lipschitz regularization, the reliance on local information might not provide accurate margin estimation and efficient maximization.


Azure/aml-real-time-ai

#artificialintelligence

Easily create and train a model using ResNet 50 as a featurizer for deployment on Azure for ultra-low latency inferencing. Azure ML Hardware Accelerated Models is currently in preview. Go to the Azure Portal and create an Azure ML Model Management Account (MMA). Learn how to create a MMA. If you already have an existing S1, S2, or S3 account in the East US 2 location, you may skip this step.


The UFC's big bet to keep fighters fighting

Engadget

This article contains images of violence that may upset or offend. If you're curious about what the world's preeminent mixed martial arts competition is like but did not pay as much as several thousand dollars for Madison Square Garden seats last Saturday, here are some of the sights UFC 217 offered: Corey Anderson's head hits the canvas with a resounding whump. The culprit: a crashing left kick to the jaw from a now-strutting Ovince Saint Preux. M.O.P.'s "Ante Up" plays after the bout. "Elbow him in the face!" a man yells as Walt Harris of Alabama and Mark Godbeer of England face off. There's no decisive elbow, but the six-foot-five-inch, 250-pound Harris does plant a solid -- and illegal -- knee into Godbeer's groin. As the Brit backs off, Harris sends a kick to his face.