Goto

Collaborating Authors

 xavier



Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Han, Yankun

arXiv.org Artificial Intelligence

Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.



THOR: A Generic Energy Estimation Approach for On-Device Training

Zhang, Jiaru, Wang, Zesong, Wang, Hao, Song, Tao, Su, Huai-an, Chen, Rui, Hua, Yang, Zhou, Xiangwei, Ma, Ruhui, Pan, Miao, Guan, Haibing

arXiv.org Artificial Intelligence

Battery-powered mobile devices (e.g., smartphones, AR/VR glasses, and various IoT devices) are increasingly being used for AI training due to their growing computational power and easy access to valuable, diverse, and real-time data. On-device training is highly energy-intensive, making accurate energy consumption estimation crucial for effective job scheduling and sustainable AI. However, the heterogeneity of devices and the complexity of models challenge the accuracy and generalizability of existing estimation methods. This paper proposes THOR, a generic approach for energy consumption estimation in deep neural network (DNN) training. First, we examine the layer-wise energy additivity property of DNNs and strategically partition the entire model into layers for fine-grained energy consumption profiling. Then, we fit Gaussian Process (GP) models to learn from layer-wise energy consumption measurements and estimate a DNN's overall energy consumption based on its layer-wise energy additivity property. We conduct extensive experiments with various types of models across different real-world platforms. The results demonstrate that THOR has effectively reduced the Mean Absolute Percentage Error (MAPE) by up to 30%. Moreover, THOR is applied in guiding energy-aware pruning, successfully reducing energy consumption by 50%, thereby further demonstrating its generality and potential.


Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings

Mir, Aya El, Luoga, Lukelo Thadei, Chen, Boyuan, Hanif, Muhammad Abdullah, Shafique, Muhammad

arXiv.org Artificial Intelligence

These MLLMs in healthcare is hindered by their high computational demands integrate Large Language Models (LLMs) with Vision Encoders, and significant memory requirements, which are thus possessing capabilities that extend beyond textual particularly challenging for resource-constrained devices understanding and analysis to include image processing like the Nvidia Jetson Xavier. This problem is particularly capabilities. This enables them to simultaneously interpret evident in remote medical settings where advanced both textual data and medical images, facilitating more accurate diagnostics are needed but resources are limited. In this and comprehensive diagnostics and decision-making in paper, we introduce an optimization method for the generalpurpose healthcare. By rapidly processing and synthesizing diverse MLLM, TinyLLaVA, which we have adapted data types, these models can significantly advance patient and renamed TinyLLaVA-Med. This adaptation involves care, enabling quicker, more precise diagnoses and personalized instruction-tuning and fine-tuning TinyLLaVA on a medical treatment plans, thus, transforming healthcare into a dataset by drawing inspiration from the LLaVA-Med training more efficient, effective, and patient-centered service [5] [6].


On Initializing Transformers with Pre-trained Embeddings

Kim, Ha Young, Balasubramanian, Niranjan, Kang, Byungkon

arXiv.org Artificial Intelligence

It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of values. As argued in the initialization studies, such large value initializations can lead to poor training because of saturated outputs. Further, the larger embedding values can, in effect, absorb the smaller position encoding values when added together, thus losing position information. Standardizing the pre-trained embeddings to a narrow range (e.g. as prescribed by Xavier) leads to substantial gains for Glove, T5, and mT5 embeddings. On the other hand, BERT pre-trained embeddings, while larger, are still relatively closer to Xavier initialization range which may allow it to effectively transfer the pre-trained knowledge.


First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Aoki, Yoichi, Kudo, Keito, Kuribayashi, Tatsuki, Sone, Shusaku, Taniguchi, Masaya, Sakaguchi, Keisuke, Inui, Kentaro

arXiv.org Artificial Intelligence

Multi-step reasoning is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning Figure 1: Illustration of the systematic strategy we discovered when more steps are required to reach an in language models (LMs). When the goal is answer. Conversely, as LMs progress closer distant from the current state in a multi-step reasoning to the final answer, their reliance on heuristics process, the models tend to rely on heuristics, such as decreases. This suggests that LMs track only superficial overlap, which can lead them in the wrong a limited number of future steps and dynamically direction. In contrast, when the goal is within a limited combine heuristic strategies with logical distance, the models are more likely to take rational actions ones in tasks involving multi-step reasoning.


On Limitations of the Transformer Architecture

Peng, Binghui, Narayanan, Srini, Papadimitriou, Christos

arXiv.org Machine Learning

What are the root causes of hallucinations in large language models (LLMs)? We use Communication Complexity to prove that the Transformer layer is incapable of composing functions (e.g., identify a grandparent of a person in a genealogy) if the domains of the functions are large enough; we show through examples that this inability is already empirically present when the domains are quite small. We also point out that several mathematical tasks that are at the core of the so-called compositional tasks thought to be hard for LLMs are unlikely to be solvable by Transformers, for large enough instances and assuming that certain well accepted conjectures in the field of Computational Complexity are true.


Principled Weight Initialization for Hypernetworks

Chang, Oscar, Flokas, Lampros, Lipson, Hod

arXiv.org Artificial Intelligence

Hypernetworks are meta neural networks that generate weights for a main neural network in an end-to-end differentiable manner. Despite extensive applications ranging from multi-task learning to Bayesian deep learning, the problem of optimizing hypernetworks has not been studied to date. We observe that classical weight initialization methods like Glorot & Bengio (2010) and He et al. (2015), when applied directly on a hypernet, fail to produce weights for the mainnet in the correct scale. We develop principled techniques for weight initialization in hypernets, and show that they lead to more stable mainnet weights, lower training loss, and faster convergence. Meta-learning describes a broad family of techniques in machine learning that deals with the problem of learning to learn. An emerging branch of meta-learning involves the use of hypernetworks, which are meta neural networks that generate the weights of a main neural network to solve a given task in an end-to-end differentiable manner. Hypernetworks were originally introduced by Ha et al. (2016) as a way to induce weight-sharing and achieve model compression by training the same meta network to learn the weights belonging to different layers in the main network.


Spanish hospital carries out lung transplant using 4-armed robot dubbed 'Da Vinci'

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. A Spanish hospital carried out a lung transplant using a pioneering technique with a robot and a new access route that no longer requires separating the ribs and opening up the chest, experts said on Monday. Surgeons at Vall d'Hebron hospital in Barcelona used a four-arm robot dubbed "Da Vinci" to cut a small section of the patient's skin, fat and muscle to remove the damaged lung and insert a new one through an eight-centimetre incision in the lower part of the sternum, just above the diaphragm. The new procedure is less painful for the patient, they said, as the wound closes easily, and is safer than the traditional method which requires a 30-centimetre incision and a very delicate post-operative period.