Goto

Collaborating Authors

 foundational model


Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks

Duan, Zhuoran, Wei, Yuhao, Nan, Guoshun, Wang, Zijun, Yan, Yan, Xiong, Lihua, Ran, Yuhan, Zhang, Ji, Li, Jian, Cui, Qimei, Tao, Xiaofeng, Quek, Tony Q. S.

arXiv.org Artificial Intelligence

Abstract--Large models (LMs), such as ChatGPT, have made a significant impact across diverse domains and hold great potential to facilitate the evolution of network intelligence. Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data, serving as a key enabler that integrates communication, sensing, and intelligence, and thus they can boost various smart services to billions of users. However, research on WMLMs remains in its infancy, and the construction of domain-specific multi-modal large models for wireless networks is still underexplored. In this paper, we outlines the key characteristics of WMLMs and summarizes existing methods, on the basis of which a wireless-native multimodal training paradigm is proposed. Specifically, we constructed a GPT -style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning. Our approach demonstrates outstanding performance compared with existing small-scale models and large multi-modal models, validating the feasibility of using wireless signals as a universal modality and highlighting WMLM's potential to emerge as a new paradigm for future wireless networks. The advent of large AI models (LMs) such as ChatGPT has propelled network intelligence into a new evolutionary phase. These remarkable enablers are poised to revolutionize future wireless networks through their advanced performance and generalization capability.


On Verifiable Legal Reasoning: A Multi-Agent Framework with Formalized Knowledge Representations

Sadowski, Albert, Chudziak, Jarosław A.

arXiv.org Artificial Intelligence

Legal reasoning requires both precise interpretation of statutory language and consistent application of complex rules, presenting significant challenges for AI systems. This paper introduces a modular multi-agent framework that decomposes legal reasoning into distinct knowledge acquisition and application stages. In the first stage, specialized agents extract legal concepts and formalize rules to create verifiable intermediate representations of statutes. The second stage applies this knowledge to specific cases through three steps: analyzing queries to map case facts onto the ontology schema, performing symbolic inference to derive logically entailed conclusions, and generating final answers using a programmatic implementation that operationalizes the ontological knowledge. This bridging of natural language understanding with symbolic reasoning provides explicit and verifiable inspection points, significantly enhancing transparency compared to end-to-end approaches. Evaluation on statutory tax calculation tasks demonstrates substantial improvements, with foundational models achieving 76.4\% accuracy compared to 18.8\% baseline performance, effectively narrowing the performance gap between reasoning and foundational models. These findings suggest that modular architectures with formalized knowledge representations can make sophisticated legal reasoning more accessible through computationally efficient models while enhancing consistency and explainability in AI legal reasoning, establishing a foundation for future research into more transparent, trustworthy, and effective AI systems for legal domain.


Contextual Augmentation for Entity Linking using Large Language Models

Vollmers, Daniel, Zahera, Hamada M., Moussallem, Diego, Ngomo, Axel-Cyrille Ngonga

arXiv.org Artificial Intelligence

Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better performance in entity disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.


Trust in foundation models and GenAI: A geographic perspective

McKenzie, Grant, Janowicz, Krzysztof, Kessler, Carsten

arXiv.org Artificial Intelligence

Large-scale pre-trained machine learning models have reshaped our understanding of artificial intelligence across numerous domains, including our own field of geography. As with any new technology, trust has taken on an important role in this discussion. In this chapter, we examine the multifaceted concept of trust in foundation models, particularly within a geographic context. As reliance on these models increases and they become relied upon for critical decision-making, trust, while essential, has become a fractured concept. Here we categorize trust into three types: epistemic trust in the training data, operational trust in the model's functionality, and interpersonal trust in the model developers. Each type of trust brings with it unique implications for geographic applications. Topics such as cultural context, data heterogeneity, and spatial relationships are fundamental to the spatial sciences and play an important role in developing trust. The chapter continues with a discussion of the challenges posed by different forms of biases, the importance of transparency and explainability, and ethical responsibilities in model development. Finally, the novel perspective of geographic information scientists is emphasized with a call for further transparency, bias mitigation, and regionally-informed policies. Simply put, this chapter aims to provide a conceptual starting point for researchers, practitioners, and policy-makers to better understand trust in (generative) GeoAI.


SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

Hacioglu, Kadri, E, Manjunath K, Stolcke, Andreas

arXiv.org Artificial Intelligence

Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.


Scattering Transformer: A Training-Free Transformer Architecture for Heart Murmur Detection

Zewail, Rami

arXiv.org Artificial Intelligence

In an attempt to address the need for skilled clinicians in heart sound interpretation, recent research efforts on automating cardiac auscultation have explored deep learning approaches. The majority of these approaches have been based on supervised learning that is always challenged in occasions where training data is limited. More recently, there has been a growing interest in potentials of pre-trained self-supervised audio foundation models for biomedical end tasks. Despite exhibiting promising results, these foundational models are typically computationally intensive. Within the context of automatic cardiac auscultation, this study explores a lightweight alternative to these general-purpose audio foundation models by introducing the Scattering Transformer, a novel, training-free transformer architecture for heart murmur detection. The proposed method leverages standard wavelet scattering networks by introducing contextual dependencies in a transformer-like architecture without any backpropagation. We evaluate our approach on the public CirCor DigiScope dataset, directly comparing it against leading general-purpose foundational models. The Scattering Transformer achieves a Weighted Accuracy(WAR) of 0.786 and an Unweighted Average Recall(UAR) of 0.697, demonstrating performance highly competitive with contemporary state of the art methods. This study establishes the Scattering Transformer as a viable and promising alternative in resource-constrained setups.


MobiGPT: A Foundation Model for Mobile Wireless Networks

Qi, Xiaoqian, Chai, Haoye, Li, Yong

arXiv.org Artificial Intelligence

With the rapid development of mobile communication technologies, future mobile networks will offer vast services and resources for commuting, production, daily life, and entertainment. Accurate and efficient forecasting of mobile data (e.g., cell traffic, user behavior, channel quality) helps operators monitor network state changes, orchestrate wireless resources, and schedule infrastructure and users, thereby improving supply efficiency and service quality. However, current forecasting paradigms rely on customized designs with tailored models for exclusive data types. Such approaches increase complexity and deployment costs under large-scale, heterogeneous networks involving base stations, users, and channels. In this paper, we design a foundation model for mobile data forecasting, MobiGPT, with a unified structure capable of forecasting three data types: base station traffic, user app usage, and channel quality. We propose a soft-prompt learning method to help the model understand features of different data types, and introduce a temporal masking mechanism to guide the model through three forecasting tasks: short-term prediction, long-term prediction, and distribution generation, supporting diverse optimization scenarios. Evaluations on real-world datasets with over 100,000 samples show that MobiGPT achieves accurate multi-type forecasting. Compared to existing models, it improves forecasting accuracy by 27.37%, 20.08%, and 7.27%, reflecting strong generalization. Moreover, MobiGPT exhibits superior zero/few-shot performance in unseen scenarios, with over 21.51% improvement, validating its strong transferability as a foundation model.


UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization

Ai, Xingyu, Wang, Shaoyu, Jia, Zhiyuan, Xu, Ao, Shan, Hongming, Ma, Jianhua, Liu, Qiegen

arXiv.org Artificial Intelligence

-- During raw - data acquisition in CT imaging, diverse factors can degrade the co llected sinograms, with un-dersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but thes e approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sino-gram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes dat a in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across mul tiple subtasks spanning four benchmark datasets. T he code is available at: https://github.com/yqx7150/UniSino . In CT imaging, the sinogram represents the raw pr ojection data before image reconstruction . However, in practical clinical data acquisition, raw sinogram data are frequently degraded by a multitude of factors -- including hardware limitations, environmental variability, and patient - induced factor s -- resulting in complex and heterogeneous data corruption [2]. Without effective preprocessing, such imperfections are readily amplified through the reconstruction process, which lead to severe image artifacts [3], including detector - induced ring patterns [4], beam ha rdening from metal implants [5], geometric distortions from miscalibration [6], and motion - induced inconsistencies [7, 8]. These artifacts not only degrade visual quality but critically compromise diagnostic reliability.


PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

Chen, Xiangyi, Rajesh, Kousik, Lawhon, Matthew, Wang, Zelun, Li, Hanyu, Li, Haomiao, Joshi, Saurabh Vishwas, Eksombatchai, Pong, Yang, Jaewon, Hsu, Yi-Ping, Xu, Jiajing, Rosenberg, Charles

arXiv.org Artificial Intelligence

User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than half a billion users across various applications.


Evaluation of a Foundational Model and Stochastic Models for Forecasting Sporadic or Spiky Production Outages of High-Performance Machine Learning Services

Yim, Keun Soo

arXiv.org Artificial Intelligence

Time series forecasting models have diverse real world applications (e.g., from electricity metrics to software workload). Latest foundational models trained for time series forecasting show strengths (e.g., for long sequences and in zero-shot settings). However, foundational model was not yet used for forecasting rare, spiky events, i.e., a challenging target because those are a corner case of extreme events. In this paper, we optimize a state-of-the-art foundational model to forecast sporadic or spiky production outages of high-performance machine learning services powering billions of client devices. We evaluate the forecasting errors of the foundational model compared with classical stochastic forecasting models (e.g., moving average and autoregressive). The analysis helps us understand how each of the evaluated models performs for the sporadic or spiky events. For example, it identifies the key patterns in the target data that are well tracked by the foundational model vs. each of the stochastic models. We use the models with optimal parameters to estimate a year-long outage statistics of a particular root cause with less than 6% value errors.