Goto

Collaborating Authors

 deep fusion


MDF-MLLM: Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification

Jordan, Jason, Lor, Mohammadreza Akbari, Koulen, Peter, Shyu, Mei-Ling, Chen, Shu-Ching

arXiv.org Artificial Intelligence

This study aimed to enhance disease classification accuracy from retinal fundus images by integrating fine-grained image features and global textual context using a novel multimodal deep learning architecture. Existing multimodal large language models (MLLMs) often struggle to capture low-level spatial details critical for diagnosing retinal diseases such as glaucoma, diabetic retinopathy, and retinitis pigmentosa. This model development and validation study was conducted on 1,305 fundus image-text pairs compiled from three public datasets (FIVES, HRF, and StoneRounds), covering acquired and inherited retinal diseases, and evaluated using classification accuracy and F1-score. The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM. Vision features are patch-wise projected and fused using scaled cross-attention and FiLM-based U-Net modulation. Baseline MLLM achieved 60% accuracy on the dual-type disease classification task. MDF-MLLM, with both U-Net and MLLM components fully fine-tuned during training, achieved a significantly higher accuracy of 94%, representing a 56% improvement. Recall and F1-scores improved by as much as 67% and 35% over baseline, respectively. Ablation studies confirmed that the multi-depth fusion approach contributed to substantial gains in spatial reasoning and classification, particularly for inherited diseases with rich clinical text. MDF-MLLM presents a generalizable, interpretable, and modular framework for fundus image classification, outperforming traditional MLLM baselines through multi-scale feature fusion. The architecture holds promise for real-world deployment in clinical decision support systems. Future work will explore synchronized training techniques, a larger pool of diseases for more generalizability, and extending the model for segmentation tasks.


Deep Fusion: Efficient Network Training via Pre-trained Initializations

Mazzawi, Hanna, Gonzalvo, Xavi, Wunder, Michael

arXiv.org Artificial Intelligence

In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks is the need for large amounts of computational resources and time. In this paper, we present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. % We show that Deep Fusion accelerates the training process, reduces computational requirements, and leads to improved generalization performance on a variety of NLP tasks and T5 model sizes. % Our experiments demonstrate that Deep Fusion is a practical and effective approach to reduce the training time and resource consumption while maintaining, or even surpassing, the performance of traditional training methods.


Deep fusion of gray level co-occurrence matrices for lung nodule classification

Saihood, Ahmed, Karshenas, Hossein, Nilchi, AhmadReza Naghsh

arXiv.org Artificial Intelligence

Lung cancer is a severe menace to human health, due to which millions of people die because of late diagnoses of cancer; thus, it is vital to detect the disease as early as possible. The Computerized chest analysis Tomography of scan is assumed to be one of the efficient solutions for detecting and classifying lung nodules. The necessity of high accuracy of analyzing C.T. scan images of the lung is considered as one of the crucial challenges in detecting and classifying lung cancer. A new long-short-term-memory (LSTM) based deep fusion structure, is introduced, where, the texture features computed from lung nodules through new volumetric grey-level-co-occurrence-matrices (GLCM) computations are applied to classify the nodules into: benign, malignant and ambiguous. An improved Otsu segmentation method combined with the water strider optimization algorithm (WSA) is proposed to detect the lung nodules. Otsu-WSA thresholding can overcome the restrictions present in previous thresholding methods. Extended experiments are run to assess this fusion structure by considering 2D-GLCM computations based 2D-slices fusion, and an approximation of this 3D-GLCM with volumetric 2.5D-GLCM computations-based LSTM fusion structure. The proposed methods are trained and assessed through the LIDC-IDRI dataset, where 94.4%, 91.6%, and 95.8% Accuracy, sensitivity, and specificity are obtained, respectively for 2D-GLCM fusion and 97.33%, 96%, and 98%, accuracy, sensitivity, and specificity, respectively, for 2.5D-GLCM fusion. The yield of the same are 98.7%, 98%, and 99%, for the 3D-GLCM fusion. The obtained results and analysis indicate that the WSA-Otsu method requires less execution time and yields a more accurate thresholding process. It is found that 3D-GLCM based LSTM outperforms its counterparts.


FusionDeepMF: A Dual Embedding based Deep Fusion Model for Recommendation

Mandal, Supriyo, Maiti, Abyayananda

arXiv.org Artificial Intelligence

Traditional Collaborative Filtering (CF) based methods are applied to understand the personal preferences of users/customers for items or products from the rating matrix. Usually, the rating matrix is sparse in nature. So there are some improved variants of the CF method that apply the increasing amount of side information to handle the sparsity problem. Only linear kernel or only non-linear kernel is applied in most of the available recommendation-related work to understand user-item latent feature embeddings from data. Only linear kernel or only non-linear kernel is not sufficient to learn complex user-item features from side information of users. Recently, some researchers have focused on hybrid models that learn some features with non-linear kernels and some other features with linear kernels. But it is very difficult to understand which features can be learned accurately with linear kernels or with non-linear kernels. To overcome this problem, we propose a novel deep fusion model named FusionDeepMF and the novel attempts of this model are i) learning user-item rating matrix and side information through linear and non-linear kernel simultaneously, ii) application of a tuning parameter determining the trade-off between the dual embeddings that are generated from linear and non-linear kernels. Extensive experiments on online review datasets establish that FusionDeepMF can be remarkably futuristic compared to other baseline approaches. Empirical evidence also shows that FusionDeepMF achieves better performances compared to the linear kernels of Matrix Factorization (MF) and the non-linear kernels of Multi-layer Perceptron (MLP).


Apple's Deep Fusion photography comes to iPhone 11 in iOS 13.2 beta (updated)

#artificialintelligence

You now have a chance to try Apple's machine learning-based Deep Fusion photography if you're willing to live on the bleeding edge. It's releasing an iOS 13.2 developer beta (public likely to follow soon) that makes Deep Fusion available to iPhone 11 and iPhone 11 Pro owners. The technique uses machine learning to create highly detailed, sharper and more natural-looking photos on the primary and telephoto lenses by combining the results of multiple shots. Deep Fusion takes an underexposed photo for sharpness, and blends that with three neutral pictures and a long high-exposure image on a per-pixel level to achieve a highly customized result. The machine learning system examines the context of the picture to understand where a pixel sits on the frequency spectrum.


Apple's Deep Fusion hands-on: AI sharpens photos like HDR fixes colors

#artificialintelligence

Digital photographers coined the term "pixel peepers" years ago to denote -- mostly with scorn -- people who focused on flaws in the individual dots that create photos rather than the entirety of the images. Zooming in to 100%, it was said, is nothing but a recipe for perpetual disappointment; instead, judge each camera by the overall quality of the photo it takes, and don't get too mired in the details. Until now, Apple's approach to digital photography has been defined by its commitment to improving the quality of the big picture without further compromising pixel-level quality. I say "further" because there's no getting around the fact that tiny phone camera sensors are physically incapable of matching the pixel-level results of full-frame DSLR camera sensors in a fair fight. Bigger sensors can capture more light and almost invariably more actual pixels than the iPhone's 12-megapixel cameras.


Apple's New iPhone 11 Pro Has the First Artificially Intelligent Camera

#artificialintelligence

Apple's new high-end iPhone will make any traditional camera manufacturer tremble. The iPhone 11 Pro, unveiled at a special Apple event on Tuesday, not only has three cameras in the back--each having its own functions--but also for the first time utilizes artificial intelligence to take a photo. Yes, the next time you feel proud of snapping a perfect pic, it may have actually been the little robot living inside your phone. Here's how it works: On the iPhone 11 Pro, every time you are about to take a picture, the cameras will quickly take eight images of the object before you press the shutter. When you actually take a photo, the phone will compare your image against the eight previously taken ones and merge the best pixels of each image into one final product.


Deep Fusion is the iPhone's take on AI photography

#artificialintelligence

In announcing the iPhones 11 Pro, Phil Schiller tipped us off to a new feature that'll come to the flagship smartphones in the next year. Deep Fusion is a system which Schiller describes as "computational photography mad science," which is likely to be the company's answer, more or less, to Google's Night Sight. As Schiller explained, when you're about to take an image with the new iPhone 11 Pro, the camera will snap 8 images before you press the shutter. When you do, it'll then take one long exposure, and then stitch a new image together, "pixel-by-pixel" to create one with lots of detail and very little noise. It's not specifically designed for shooting in the dark, but it's clear that Apple is parking its tanks on Google's lawn. Night Sight has been one of the strengths of the last few Pixel phones, using machine learning to create well-lit images in dark environments.


A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Toshniwal, Shubham, Kannan, Anjuli, Chiu, Chung-Cheng, Wu, Yonghui, Sainath, Tara N, Livescu, Karen

arXiv.org Artificial Intelligence

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and requires only a parallel corpus of speech and text for training. However, unlike in conventional approaches that combine separate acoustic and language models, it is not clear how to use additional (unpaired) text. While there has been previous work on methods addressing this problem, a thorough comparison among methods is still lacking. In this paper, we compare a suite of past methods and some of our own proposed methods for using unpaired text data to improve encoder-decoder models. For evaluation, we use the medium-sized Switchboard data set and the large-scale Google voice search and dictation data sets. Our results confirm the benefits of using unpaired text across a range of methods and data sets. Surprisingly, for first-pass decoding, the rather simple approach of shallow fusion performs best across data sets. However, for Google data sets we find that cold fusion has a lower oracle error rate and outperforms other approaches after second-pass rescoring on the Google voice search data set.