dla
HistoART: Histopathology Artifact Detection and Reporting Tool
Kahaki, Seyed, Webber, Alexander R., Zamzmi, Ghada, Subbaswamy, Adarsh, Deshpande, Rucha, Badano, Aldo
In modern cancer diagnostics, Whole Slide Imaging (WSI) is widely used to digitize tissue specimens for detailed, high-resolution examination; however, other diagnostic approaches, such as liquid biopsy and molecular testing, are also utilized based on the cancer type and clinical context. While WSI has revolutionized digital histopathology by enabling automated, precise analysis, it remains vulnerable to artifacts introduced during slide preparation and scanning. These artifacts can compromise downstream image analysis. To address this challenge, we propose and compare three robust artifact detection approaches for WSIs: (1) a foundation model-based approach (FMA) using a fine-tuned Unified Neural Image (UNI) architecture, (2) a deep learning approach (DLA) built on a ResNet50 backbone, and (3) a knowledge-based approach (KBA) leveraging handcrafted features from texture, color, and frequency-based metrics. The methods target six common artifact types: tissue folds, out-of-focus regions, air bubbles, tissue damage, marker traces, and blood contamination. Evaluations were conducted on 50,000+ image patches from diverse scanners (Hamamatsu, Philips, Leica Aperio AT2) across multiple sites. The FMA achieved the highest patch-wise AUROC of 0.995 (95% CI [0.994, 0.995]), outperforming the ResNet50-based method (AUROC: 0.977, 95% CI [0.977, 0.978]) and the KBA (AUROC: 0.940, 95% CI [0.933, 0.946]). To translate detection into actionable insights, we developed a quality report scorecard that quantifies high-quality patches and visualizes artifact distributions.
- North America > United States (0.14)
- Oceania > New Zealand (0.04)
- Europe > Sweden > Skåne County > Malmö (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms
Taufique, Zain, Vyas, Aman, Miele, Antonio, Liljeberg, Pasi, Kanduri, Anil
--Compound AI (cAI) systems chain multiple AI models to solve complex problems. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and Dynamic V oltage/Frequency Scaling (DVFS), while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets. AI applications are rapidly evolving from monolithic models towards Compound Artificial Intelligence (cAI) systems, which integrate multiple task-specific models and components to solve complex problems [1]-[3]. Emerging cAI systems combine Large Language Models (LLMs) with Deep Neural Networks (DNNs) for providing novel services such as conversational language agents [2]-[5], augmented and virtual reality (AR/VR) gear, and interactive autonomous vehicles [6]. In this example, DNN models ( D1: VGG-19 and D2: ResNet-152) are used for image classification, and object detection, transformer models ( T1: Bert-base and T2: Bert-large) are used for text summarizing and classification, and generative transformers ( T3: OPT-350M and LLM: Deepseek-R1) are used for reasoning and report generation. Each model is responsible for extracting key features from the given input and sending the output to the subsequent models to perform collaborative tasks. T1, D1, and D2 are exclusive inference tasks that can run simultaneously, while T2, T3, and LLM are dependent on the outputs of other models. We deployed the exemplar cAI system on the Nvidia Jetson Orin NX platform.
- Europe > Finland > Southwest Finland > Turku (0.05)
- North America > United States (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration
Baobaid, Asma, Meribout, Mahmoud
Cost-effective machine vision systems dedicated to real-time and accurate face detection and recognition in public places are crucial for many modern applications. However, despite their high performance, which could be reached using specialized edge or cloud AI hardware accelerators, there is still room for improvement in throughput and power consumption. This paper aims to suggest a combined hardware-software approach that optimizes face detection and recognition systems on one of the latest edge GPUs, namely NVIDIA Jetson AGX Orin. First, it leverages the simultaneous usage of all its hardware engines to improve processing time. This offers an improvement over previous works where these tasks were mainly allocated automatically and exclusively to the CPU or, to a higher extent, to the GPU core. Additionally, the paper suggests integrating a face tracker module to avoid redundantly running the face recognition algorithm for every frame but only when a new face appears in the scene. The results of extended experiments suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame. Additionally, a substantial saving of power consumption of around 800 mW was achieved when compared to running the task on the CPU/GPU engines only and without integrating a tracker into the Orin GPU\'92s pipeline. This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places where several nearby cameras are usually deployed for a same scene.
- Information Technology > Hardware (0.70)
- Information Technology > Security & Privacy (0.67)
DocFusion: A Unified Framework for Document Parsing Tasks
Chai, Mingxu, Shen, Ziyu, Zhang, Chong, Zhang, Yue, Wang, Xiao, Dou, Shihan, Kang, Jihua, Zhang, Jiazheng, Zhang, Qi
Document parsing is essential for analyzing complex document structures and extracting fine-grained information, supporting numerous downstream applications. However, existing methods often require integrating multiple independent models to handle various parsing tasks, leading to high complexity and maintenance overhead. To address this, we propose DocFusion, a lightweight generative model with only 0.28B parameters. It unifies task representations and achieves collaborative training through an improved objective function. Experiments reveal and leverage the mutually beneficial interaction among recognition tasks, and integrating recognition data significantly enhances detection performance. The final results demonstrate that DocFusion achieves state-of-the-art (SOTA) performance across four key tasks.
Two-Stage Aggregation with Dynamic Local Attention for Irregular Time Series
Chen, Xingyu, Zheng, Xiaochen, Mollaysa, Amina, Schürch, Manuel, Allam, Ahmed, Krauthammer, Michael
Irregular multivariate time series data is characterized by varying time intervals between consecutive observations of measured variables/signals (i.e., features) and varying sampling rates (i.e., recordings/measurement) across these features. Modeling time series while taking into account these irregularities is still a challenging task for machine learning methods. Here, we introduce TADA, a Two-stageAggregation process with Dynamic local Attention to harmonize time-wise and feature-wise irregularities in multivariate time series. In the first stage, the irregular time series undergoes temporal embedding (TE) using all available features at each time step. This process preserves the contribution of each available feature and generates a fixed-dimensional representation per time step. The second stage introduces a dynamic local attention (DLA) mechanism with adaptive window sizes. DLA aggregates time recordings using feature-specific windows to harmonize irregular time intervals capturing feature-specific sampling rates. Then hierarchical MLP mixer layers process the output of DLA through multiscale patching to leverage information at various scales for the downstream tasks. TADA outperforms state-of-the-art methods on three real-world datasets, including the latest MIMIC IV dataset, and highlights its effectiveness in handling irregular multivariate time series and its potential for various real-world applications.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > Middle East > Israel (0.04)
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Chughtai, Bilal, Cooney, Alan, Nanda, Neel
How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.
- South America > Brazil (0.14)
- Europe > Italy (0.05)
- Europe > Germany (0.05)
- (96 more...)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Leisure & Entertainment > Sports > Basketball (0.68)
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack
Gibson, Perry, Cano, José, Crowley, Elliot J., Storkey, Amos, O'Boyle, Michael
Deep Neural Networks (DNNs) are extremely computationally demanding, which presents a large barrier to their deployment on resource-constrained devices. Since such devices are where many emerging deep learning applications lie (e.g., drones, vision-based medical technology), significant bodies of work from both the machine learning and systems communities have attempted to provide optimizations to accelerate DNNs. To help unify these two perspectives, in this paper we combine machine learning and systems techniques within the Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can be tightly dependent on each other with an across-stack perturbation study. We evaluate the impact on accuracy and inference time when varying different parameters of DLAS across two datasets, seven popular DNN architectures, four DNN compression techniques, three algorithmic primitives with sparse and dense variants, untuned and auto-scheduled code generation, and four hardware platforms. Our evaluation highlights how perturbations across DLAS parameters can cause significant variation and across-stack interactions. The highest level observation from our evaluation is that the model size, accuracy, and inference time are not guaranteed to be correlated. Overall we make 13 key observations, including that speedups provided by compression techniques are very hardware dependent, and that compiler auto-tuning can significantly alter what the best algorithm to use for a given configuration is. With DLAS, we aim to provide a reference framework to aid machine learning and systems practitioners in reasoning about the context in which their respective DNN acceleration solutions exist in. With our evaluation strongly motivating the need for co-design, we believe that DLAS can be a valuable concept for exploring the next generation of co-designed accelerated deep learning solutions.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia > Williamsburg (0.04)
- (7 more...)
- Education (0.68)
- Health & Medicine (0.47)
- Energy (0.46)
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
Dao, James, Lau, Yeu-Tong, Rager, Can, Janiak, Jett
In recent years, large language models (LLMs) have made impressive gains in capability (Vaswani et al. 2017; Devlin et al. 2019; OpenAI 2023; Radford et al. 2019; Brown et al. 2020), often surpassing expectations (Wei et al. 2022). However, these models remain poorly understood, with their successes and failures largely unexplained. Understanding what LLMs learn and how they generate predictions is therefore an increasingly urgent scientific and practical challenge. Mechanistic interpretability (MI) aims to reverse engineer models into human-understandable algorithms or circuits (Geiger et al. 2021; Olah 2022; Wang et al. 2022), attempting to avoid pitfalls such as illusory understanding. With MI, we can identify and fix model errors (Vig et al. 2020; Hernandez et al. 2022; Meng et al. 2023; Hase et al. 2023), steer its outputs (Li et al. 2023), and explain emergent behaviors (Nanda et al. 2023; Barak et al. 2023). The central goals in MI are (a) localization: identifying the specific model components (attention heads, MLP layers) that the circuit is composed of; and (b) explaining the behavior of these components.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Michigan (0.04)
Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips
Dagli, Ismet, Belviranli, Mehmet
Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.
- North America > United States > Colorado > Jefferson County > Golden (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology (1.00)
- Semiconductors & Electronics (0.91)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Learn about Deep Learning Accelerators on the Jetson Orin with NVIDIA
Developers or those of you interested in learning more about the Deep Learning Accelerator on NVIDIA's Jetson Orin mini PC will be pleased to know that NVIDIA has published a new article over on its technical blog providing an overview of the Deep Learning Accelerator (DLA) when used with the Jetson system that combines a CPU and GPU into a single module. Providing developers with an expansive NVIDIA software stack in a small, low-power package that can be deployed at the edge. "Though the DLA doesn't have as many supported layers as the GPU, it still supports a wide variety of layers used in many popular neural network architectures. In many instances, the layer support may cover the requirements of your model. For example, the NVIDIA TAO Toolkit includes a wide variety of pre-trained models that are supported by the DLA, ranging from object detection to action recognition. "While it's important to note that the DLA throughput is typically lower than that of the GPU, it is power-efficient and allows you to offload deep learning workloads, freeing the GPU for other tasks.