Goto

Collaborating Authors

 fiber





Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Neural Information Processing Systems

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.


LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

Akgül, Ömer Faruk, Kalaycı, Yusuf Hakan, Kannan, Rajgopal, Neiswanger, Willie, Prasanna, Viktor

arXiv.org Artificial Intelligence

Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.


Self-Improving AI Agents through Self-Play

Chojecki, Przemyslaw

arXiv.org Artificial Intelligence

We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.


Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors

Sebastian, Abhishek

arXiv.org Artificial Intelligence

Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presents significant challenges for accurate temperature prediction. This study investigates the use of transformer-based architectures, including Vision Transformers (ViTs), Swin Transformers, and emerging models such as Learnable Importance Non-Symmetric Attention Vision Transformers (LINA-ViT) and Multi-Adaptive Proximity Vision Graph Attention Transformers (MAP-ViGAT), to predict temperature from specklegram data over a range of 0 to 120 Celsius. The results show that ViTs achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models such as CNNs. GAT-ViT and MAP-ViGAT variants also demonstrated competitive accuracy, highlighting the importance of adaptive attention mechanisms and graph-based structures in capturing complex modal interactions and phase shifts in specklegram data. Additionally, this study incorporates Explainable AI (XAI) techniques, including attention maps and saliency maps, to provide insights into the decision-making processes of the transformer models, improving interpretability and transparency. These findings establish transformer architectures as strong benchmarks for optical fiber-based temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.


DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

Guo, Bocheng, Wang, Jin, Li, Yijie, Wang, Junyi, Gao, Mingyu, Feng, Puming, Chen, Yuqian, Rushmore, Jarrett, Makris, Nikos, Rathi, Yogesh, O'Donnell, Lauren J, Zhang, Fan

arXiv.org Artificial Intelligence

Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.


Accelerating Frontier MoE Training with 3D Integrated Optics

Bernadskiy, Mikhail, Carson, Peter, Graham, Thomas, Groves, Taylor, Lee, Ho John, Yeh, Eric

arXiv.org Artificial Intelligence

--The unabated growth in AI workload demands is driving the need for concerted advances in compute, memory, and interconnect performance. As traditional semiconductor scaling slows, high-speed interconnects have emerged as the new scaling engine, enabling the creation of larger logical GPUs by linking many GPUs into a single, low-latency, high-bandwidth compute domain. While initial scale-up fabrics leveraged copper interconnects for their power and cost advantages, the maximum reach of passive electrical interconnects (approximately 1 meter) effectively limits the scale-up domain to within a single rack. The advent of 3D-stacked optics and logic offers a transformative, power-efficient scale-up solution for connecting hundreds of GPU packages (thousands of GPUs) across multiple data center racks. This work explores the design tradeoffs of scale-up technologies and demonstrates how frontier LLMs necessitate novel photonic solutions to achieve aggressive power and performance targets. We model the benefits of 3D CPO (Passage) enabled GPUs and switches within the scale-up domain when training Frontier Mixture of Experts (MoE) models exceeding one trillion parameters. Our results show that the substantial increases in bandwidth and radix enabled by 3D CPO allow for an 8X increase in scale-up capability. The race to build larger, more sophisticated AI models is pushing the limits of existing infrastructure. At the chip and package level, GPUs are constrained by shoreline, yields and power. These challenges have led to the development of large high-bandwidth, low-latency scale-up pods. These pods effectively combine hundreds of GPUs into a single logical GPU to facilitate a variety of parallelism strategies (e.g. Approaches like Mixture of Experts (MoE) [1] have pushed scale-up networks to their limits due to copper reach (1 meter), which constrains the number of GPUs that can be connected within a single network hop. With MoEs, an ensemble of specialized sub-networks work together through sparse activations to increase model capacity without significantly increasing computational requirements. The output of the selected experts are combined to create the final result.


From Minimal Existence to Human Definition: The CES-IMU-HSG Theoretical Framework

Itoh, Kei

arXiv.org Artificial Intelligence

This study presents an inter-universal mathematical-logical framework constructed upon the minimal axiom Cogito, ergo sum (CES), integrating the Intermediate Meta-Universe (IMU) and the Hierarchical State Grid (HSG). The CES defines existence as a reflexive correspondence --'to be' and 'to be sayable'--and positions any formal system, including ZFC or HoTT, as an attachable extension atop this minimal structure. The IMU functions as a registry of axiomatic dependencies that connect heterogeneous theories, employing the Institution-theoretic framework to ensure coherent inter-theoretical linkages. The HSG concretizes these ideas through categorical construction, defined by three orthogonal axes: the state-depth axis, the mapping-hierarchy axis, and the temporal axis incorporating the principle of 'no future reference.' Through these, the identity of 'definition = state' is formally established as a categorical property. Extending this structure to biological systems, the neural system is implemented as a 0-3D complex of neuron-function fields on the HSG, while its categorical extensions via fiberization over the material base enable the parallel integration of multiple physiological universes-neural, endocrine, learning, genetic, and input/output systems-into a coherent adjoint ensemble. Within this framework, human behavior and cognition emerge as temporal compositions of inter-universal algorithms constrained by the material base. Finally, by contrasting human cognition, which relies on external CES, with machine existence, this study introduces the concept of internal CES, wherein a machine grounds its own logic upon the factuality of its operation. This internal self-axiomatization establishes a continuous bridge between philosophical ontology and engineering implementation, providing a new foundation for the autonomous and self-defining existence of artificial intelligence.