Goto

Collaborating Authors

 attention block





Exploring Fusion Strategies for Multimodal Vision-Language Systems

Willis, Regan, Bakos, Jason

arXiv.org Artificial Intelligence

Modern machine learning models often combine multiple input streams of data to more accurately capture the information that informs their decisions. In multimodal machine learning, choosing the strategy for fusing data together requires careful consideration of the application's accuracy and latency requirements, as fusing the data at earlier or later stages in the model architecture can lead to performance changes in accuracy and latency. T o demonstrate this trade-off, we investigate different fusion strategies using a hybrid BERT and vision network framework that integrates image and text data. W e explore two different vision networks: MobileNetV2 and ViT. W e propose three models for each vision network, which fuse data at late, intermediate, and early stages in the architecture. W e evaluate the proposed models on the CMU-MOSI dataset and benchmark their latency on an NVIDIA Jetson Orin AGX. Our experimental results demonstrate that while late fusion yields the highest accuracy, early fusion offers the lowest inference latency. W e describe the three proposed model architectures and discuss the accuracy and latency trade-offs, concluding that data fusion earlier in the model architecture results in faster inference times at the cost of accuracy.


FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Ren, Yuxin, Collins, Maxwell D, Hu, Miao, Yang, Huanrui

arXiv.org Artificial Intelligence

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.


ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation

Riera, Carlos Boned, Sanchez, David Romero, Terrades, Oriol Ramos

arXiv.org Artificial Intelligence

In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.


Machine-Learning Accelerated Calculations of Reduced Density Matrices

Azam, Awwab A., Zhao, Lexu, Yu, Jiabin

arXiv.org Artificial Intelligence

$n$-particle reduced density matrices ($n$-RDMs) play a central role in understanding correlated phases of matter. Yet the calculation of $n$-RDMs is often computationally inefficient for strongly-correlated states, particularly when the system sizes are large. In this work, we propose to use neural network (NN) architectures to accelerate the calculation of, and even predict, the $n$-RDMs for large-size systems. The underlying intuition is that $n$-RDMs are often smooth functions over the Brillouin zone (BZ) (certainly true for gapped states) and are thus interpolable, allowing NNs trained on small-size $n$-RDMs to predict large-size ones. Building on this intuition, we devise two NNs: (i) a self-attention NN that maps random RDMs to physical ones, and (ii) a Sinusoidal Representation Network (SIREN) that directly maps momentum-space coordinates to RDM values. We test the NNs in three 2D models: the pair-pair correlation functions of the Richardson model of superconductivity, the translationally-invariant 1-RDM in a four-band model with short-range repulsion, and the translation-breaking 1-RDM in the half-filled Hubbard model. We find that a SIREN trained on a $6\times 6$ momentum mesh can predict the $18\times 18$ pair-pair correlation function with a relative accuracy of $0.839$. The NNs trained on $6\times 6 \sim 8\times 8$ meshes can provide high-quality initial guesses for $50\times 50$ translation-invariant Hartree-Fock (HF) and $30\times 30$ fully translation-breaking-allowed HF, reducing the number of iterations required for convergence by up to $91.63\%$ and $92.78\%$, respectively, compared to random initializations. Our results illustrate the potential of using NN-based methods for interpolable $n$-RDMs, which might open a new avenue for future research on strongly correlated phases.


ST-GRIT: Spatio-Temporal Graph Transformer For Internal Ice Layer Thickness Prediction

Liu, Zesheng, Rahnemoonfar, Maryam

arXiv.org Artificial Intelligence

Understanding the thickness and variability of internal ice layers in radar imagery is crucial for monitoring snow accumulation, assessing ice dynamics, and reducing uncertainties in climate models. Radar sensors, capable of penetrating ice, provide detailed radargram images of these internal layers. In this work, we present ST-GRIT, a spatio-temporal graph transformer for ice layer thickness, designed to process these radargrams and capture the spatiotemporal relationships between shallow and deep ice layers. ST-GRIT leverages an inductive geometric graph learning framework to extract local spatial features as feature embeddings and employs a series of temporal and spatial attention blocks separately to model long-range dependencies effectively in both dimensions. Experimental evaluation on radargram data from the Greenland ice sheet demonstrates that ST-GRIT consistently outperforms current state-of-the-art methods and other baseline graph neural networks by achieving lower root mean-squared error. These results highlight the advantages of self-attention mechanisms on graphs over pure graph neural networks, including the ability to handle noise, avoid oversmoothing, and capture long-range dependencies. Moreover, the use of separate spatial and temporal attention blocks allows for distinct and robust learning of spatial relationships and temporal patterns, providing a more comprehensive and effective approach.


An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications

Nazir, Danish, Inti, Gowtham Sai, Bartels, Timo, Piewek, Jan, Bagdonat, Thorsten, Fingscheidt, Tim

arXiv.org Artificial Intelligence

Modern automotive systems leverage deep neural networks (DNNs) for semantic segmentation and operate in two key application areas: (1) In-car, where the DNN solely operates in the vehicle without strict constraints on the data rate. (2) Distributed, where one DNN part operates in the vehicle and the other part typically on a large-scale cloud platform with a particular constraint on transmission bitrate efficiency. Typically, both applications share an image and source encoder, while each uses distinct (joint) source and task decoders. Prior work utilized convolutional neural networks for joint source and task decoding but did not investigate transformer-based alternatives such as SegDeformer, which offer superior performance at the cost of higher computational complexity. In this work, we propose joint feature and task decoding for SegDeformer, thereby enabling lower computational complexity in both in-car and distributed applications, despite SegDeformer's computational demands. This improves scalability in the cloud while reducing in-car computational complexity. For the in-car application, we increased the frames per second (fps) by up to a factor of $11.7$ ($1.4$ fps to $16.5$ fps) on Cityscapes and by up to a factor of $3.5$ ($43.3$ fps to $154.3$ fps) on ADE20K, while being on-par w.r.t.\ the mean intersection over union (mIoU) of the transformer-based baseline that doesn't compress by a source codec. For the distributed application, we achieve state-of-the-art (SOTA) over a wide range of bitrates on the mIoU metric, while using only $0.14$\% ($0.04$\%) of cloud DNN parameters used in previous SOTA, reported on ADE20K (Cityscapes).