data engineering
NAEx: A Plug-and-Play Framework for Explaining Network Alignment
Saxena, Shruti, Khan, Arijit, Chandra, Joydeep
Network alignment (NA) identifies corresponding nodes across multiple networks, with applications in domains like social networks, co-authorship, and biology. Despite advances in alignment models, their interpretability remains limited, making it difficult to understand alignment decisions and posing challenges in building trust, particularly in high-stakes domains. To address this, we introduce NAEx, a plug-and-play, model-agnostic framework that explains alignment models by identifying key subgraphs and features influencing predictions. NAEx addresses the key challenge of preserving the joint cross-network dependencies on alignment decisions by: (1) jointly parameterizing graph structures and feature spaces through learnable edge and feature masks, and (2) introducing an optimization objective that ensures explanations are both faithful to the original predictions and enable meaningful comparisons of structural and feature-based similarities between networks. NAEx is an inductive framework that efficiently generates NA explanations for previously unseen data. We introduce evaluation metrics tailored to alignment explainability and demonstrate NAEx's effectiveness and efficiency on benchmark datasets by integrating it with four representative NA models.
- North America > United States (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- Asia > India > Bihar > Patna (0.04)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.46)
UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs
Xu, Yuanyuan, Zhang, Wenjie, Lin, Xuemin, Zhang, Ying
--Dynamic graphs, which capture time-evolving edges between nodes, are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. T o advance dynamic graph representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs, achieving promising results. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the underlying structural evolution associated with each temporal granularity and thus compromising model effectiveness. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model's robustness. T o better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which can scale to large dynamic graphs. Specifically, we first propose a novel Fourier Graph Attention (FGA T) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGA T is well-suited to capture the underlying structures in both CTDGs and DTDGs. We further enhance FGA T to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our proposed FGA T mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14. 4% over sixteen baselines across nine dynamic graphs while exhibiting superior robustness in noisy scenarios. YNAMIC graphs serve as a crucial data modality for representing time-evolving relationships (edges) between entities (nodes). Y uanyuan Xu and Wenjie Zhang are with the School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW 2052, Australia (e-mail: yuanyuan.xu@unsw.edu.au; Xuemin Lin is with Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai 200052, china (e-mail: xuemin.lin@gmail.com). Ying Zhang is with the School of Statistics and Mathematics, School of Computer Science, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China (e-mail: ying.zhang@zjgsu.edu.cn).
- Information Technology (0.68)
- Education (0.54)
A Comprehensive Survey on Spectral Clustering with Graph Structure Learning
Berahmand, Kamal, Saberi-Movahed, Farid, Sheikhpour, Razieh, Li, Yuefeng, Jalili, Mahdi
Spectral clustering is a powerful technique for clustering high-dimensional data, utilizing graph-based representations to detect complex, non-linear structures and non-convex clusters. The construction of a similarity graph is essential for ensuring accurate and effective clustering, making graph structure learning (GSL) central for enhancing spectral clustering performance in response to the growing demand for scalable solutions. Despite advancements in GSL, there is a lack of comprehensive surveys specifically addressing its role within spectral clustering. To bridge this gap, this survey presents a comprehensive review of spectral clustering methods, emphasizing on the critical role of GSL. We explore various graph construction techniques, including pairwise, anchor, and hypergraph-based methods, in both fixed and adaptive settings. Additionally, we categorize spectral clustering approaches into single-view and multi-view frameworks, examining their applications within one-step and two-step clustering processes. We also discuss multi-view information fusion techniques and their impact on clustering data. By addressing current challenges and proposing future research directions, this survey provides valuable insights for advancing spectral clustering methodologies and highlights the pivotal role of GSL in tackling large-scale and high-dimensional data clustering tasks.
- Asia > Middle East > Jordan (0.14)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Oceania > Australia > Queensland (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Three Pillars Towards Next-Generation Routing System
Li, Lei, Zhang, Mengxuan, Xu, Zizhuo, Xu, Yehong, Zhou, XIaofang
The routing results are playing an increasingly important role in transportation efficiency, but they could generate traffic congestion unintentionally. This is because the traffic condition and routing system are disconnected components in the current routing paradigm. In this paper, we propose a next-generation routing paradigm that could reduce traffic congestion by considering the influence of the routing results in real-time. Specifically, we regard the routing results as the root cause of the future traffic flow, which at the same time is identified as the root cause of traffic conditions. To implement such a system, we identify three essential components: 1) the traffic condition simulation that establishes the relation between traffic flow and traffic condition with guaranteed accuracy; 2) the future route management that supports efficient simulation with dynamic route update; 3) the global routing optimization that improves the overall transportation system efficiency. Preliminary design and experimental results will be presented, and the corresponding challenges and research directions will also be discussed.
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Consumer Products & Services > Travel (1.00)
A Unified Model for Spatio-Temporal Prediction Queries with Arbitrary Modifiable Areal Units
Chen, Liyue, Fang, Jiangyi, Liu, Tengfei, Cao, Shaosheng, Wang, Leye
Spatio-Temporal (ST) prediction is crucial for making informed decisions in urban location-based applications like ride-sharing. However, existing ST models often require region partition as a prerequisite, resulting in two main pitfalls. Firstly, location-based services necessitate ad-hoc regions for various purposes, requiring multiple ST models with varying scales and zones, which can be costly to support. Secondly, different ST models may produce conflicting outputs, resulting in confusing predictions. In this paper, we propose One4All-ST, a framework that can conduct ST prediction for arbitrary modifiable areal units using only one model. To reduce the cost of getting multi-scale predictions, we design an ST network with hierarchical spatial modeling and scale normalization modules to efficiently and equally learn multi-scale representations. To address prediction inconsistencies across scales, we propose a dynamic programming scheme to solve the formulated optimal combination problem, minimizing predicted error through theoretical analysis. Besides, we suggest using an extended quad-tree to index the optimal combinations for quick response to arbitrary modifiable areal units in practical online scenarios. Extensive experiments on two real-world datasets verify the efficiency and effectiveness of One4All-ST in ST prediction for arbitrary modifiable areal units. The source codes and data of this work are available at https://github.com/uctb/One4All-ST.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Alaska > Anchorage Municipality > Anchorage (0.04)
- (5 more...)
- Transportation > Ground > Road (0.88)
- Transportation > Passenger (0.66)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Data Engineering for Scaling Language Models to 128K Context
Fu, Yao, Panda, Rameswar, Niu, Xinyao, Yue, Xiang, Hajishirzi, Hannaneh, Kim, Yoon, Peng, Hao
We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
- North America > United States > Ohio (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
What About the Data? A Mapping Study on Data Engineering for AI Systems
AI systems cannot exist without data. Now that AI models (data science and AI) have matured and are readily available to apply in practice, most organizations struggle with the data infrastructure to do so. There is a growing need for data engineers that know how to prepare data for AI systems or that can setup enterprise-wide data architectures for analytical projects. But until now, the data engineering part of AI engineering has not been getting much attention, in favor of discussing the modeling part. In this paper we aim to change this by perform a mapping study on data engineering for AI systems, i.e., AI data engineering. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. We identify which life cycle phases are covered, which technical solutions or architectures are proposed and which lessons learned are presented. We end by an overall discussion of the papers with implications for practitioners and researchers. This paper creates an overview of the body of knowledge on data engineering for AI. This overview is useful for practitioners to identify solutions and best practices as well as for researchers to identify gaps.
Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey
Zhang, Weixu, Wang, Yifei, Song, Yuanfeng, Wei, Victor Junqiu, Tian, Yuxing, Qi, Yiyan, Chan, Jonathan H., Wong, Raymond Chi-Wing, Yang, Haiqin
The emergence of natural language processing has revolutionized the way users interact with tabular data, enabling a shift from traditional query languages and manual plotting to more intuitive, language-based interfaces. The rise of large language models (LLMs) such as ChatGPT and its successors has further advanced this field, opening new avenues for natural language processing techniques. This survey presents a comprehensive overview of natural language interfaces for tabular data querying and visualization, which allow users to interact with data using natural language queries. We introduce the fundamental concepts and techniques underlying these interfaces with a particular emphasis on semantic parsing, the key technology facilitating the translation from natural language to SQL queries or data visualization commands. We then delve into the recent advancements in Text-to-SQL and Text-to-Vis problems from the perspectives of datasets, methodologies, metrics, and system designs. This includes a deep dive into the influence of LLMs, highlighting their strengths, limitations, and potential for future improvements. Through this survey, we aim to provide a roadmap for researchers and practitioners interested in developing and applying natural language interfaces for data interaction in the era of large language models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (38 more...)
A Survey on Service Route and Time Prediction in Instant Delivery: Taxonomy, Progress, and Prospects
Wen, Haomin, Lin, Youfang, Wu, Lixia, Mao, Xiaowei, Cai, Tianyue, Hou, Yunfeng, Guo, Shengnan, Liang, Yuxuan, Jin, Guangyin, Zhao, Yiji, Zimmermann, Roger, Ye, Jieping, Wan, Huaiyu
Instant delivery services, such as food delivery and package delivery, have achieved explosive growth in recent years by providing customers with daily-life convenience. An emerging research area within these services is service Route\&Time Prediction (RTP), which aims to estimate the future service route as well as the arrival time of a given worker. As one of the most crucial tasks in those service platforms, RTP stands central to enhancing user satisfaction and trimming operational expenditures on these platforms. Despite a plethora of algorithms developed to date, there is no systematic, comprehensive survey to guide researchers in this domain. To fill this gap, our work presents the first comprehensive survey that methodically categorizes recent advances in service route and time prediction. We start by defining the RTP challenge and then delve into the metrics that are often employed. Following that, we scrutinize the existing RTP methodologies, presenting a novel taxonomy of them. We categorize these methods based on three criteria: (i) type of task, subdivided into only-route prediction, only-time prediction, and joint route\&time prediction; (ii) model architecture, which encompasses sequence-based and graph-based models; and (iii) learning paradigm, including Supervised Learning (SL) and Deep Reinforcement Learning (DRL). Conclusively, we highlight the limitations of current research and suggest prospective avenues. We believe that the taxonomy, progress, and prospects introduced in this paper can significantly promote the development of this field.
- Asia > China > Beijing > Beijing (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- South America > Peru > Madre de Dios Department (0.04)
- (7 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Towards Generalising Neural Topical Representations
Yang, Xiaohao, Zhao, He, Phung, Dinh, Du, Lan
Topic models have evolved from conventional Bayesian probabilistic models to Neural Topic Models (NTMs) over the last two decays. Although NTMs have achieved promising performance when trained and tested on a specific corpus, their generalisation ability across corpora is rarely studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation for documents in a different target corpus without retraining. In this work, we aim to improve NTMs further so that their benefits generalise reliably across corpora and tasks. To do so, we propose to model similar documents by minimising their semantical distance when training NTMs. Specifically, similar documents are created by data augmentation during training; The semantical distance between documents is measured by the Hierarchical Topic Transport Distance (HOTT), which computes the Optimal Transport (OT) distance between the topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora.
- Oceania > Australia (0.04)
- South America > Ecuador (0.04)
- Europe > Poland (0.04)
- (4 more...)