Goto

Collaborating Authors

 Oceania


A novel approach to the relationships between data features -- based on comprehensive examination of mathematical, technological, and causal methodology

arXiv.org Artificial Intelligence

The expansion of artificial intelligence (AI) has raised concerns about transparency, accountability, and interpretability, with counterfactual reasoning emerging as a key approach to addressing these issues. However, current mathematical, technological, and causal methodologies rely on externalization techniques that normalize feature relationships within a single coordinate space, often distorting intrinsic interactions. This study proposes the Convergent Fusion Paradigm (CFP) theory, a framework integrating mathematical, technological, and causal perspectives to provide a more precise and comprehensive analysis of feature relationships. CFP theory introduces Hilbert space and backward causation to reinterpret the feature relationships as emergent structures, offering a potential solution to the common cause problem -- a fundamental challenge in causal modeling. From a mathematical -- technical perspective, it utilizes a Riemannian manifold-based framework, thereby improving the structural representation of high- and low-dimensional data interactions. From a causal inference perspective, CFP theory adopts abduction as a methodological foundation, employing Hilbert space for a dynamic causal reasoning approach, where causal relationships are inferred abductively, and feature relationships evolve as emergent properties. Ultimately, CFP theory introduces a novel AI modeling methodology that integrates Hilbert space, backward causation, and Riemannian geometry, strengthening AI governance and transparency in counterfactual reasoning.


Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness

arXiv.org Artificial Intelligence

Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight $n$-gram language model. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. The experimental results on two code poisoning attacks and four code intelligence tasks demonstrate that KillBadCode significantly outperforms four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average.


Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing

arXiv.org Artificial Intelligence

Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.


MONSTER: Monash Scalable Time Series Evaluation Repository

arXiv.org Artificial Intelligence

We introduce Monster--the MONash Scalable Time Series E valuation R epository--a collection of large datasets for time series classification. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.


Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos

arXiv.org Artificial Intelligence

Brice Valentin Kok - Shun Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0001 - 9923 - 5042 Johnny Chan Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0002 - 3535 - 4533 Abstract -- This work - in - progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collect ion of 421 auto - generated and manual transcripts which are then fed into a prompt - engineered GPT - 4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for ca tegory identification . The results revealed a significant prevalence of product - related ads across vari ous educational topics, with ad categories refined using GPT - 4 o into succinct 9 content and 4 advertisement categories . This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. T his study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of ad vertisement strategies in digital media. In recent years, video - sharing platforms like YouTube have become dominant sources of entertainment, education, and information [1] . YouTube is invaluable for content creators, marketers, and advertisers. One of the key features of YouTube's revenue model is the integration of sponsored advertisement (ad) segments, which allows content creators to monetize their videos while providing advertisers a direct route to target specific audiences [2] .


Voter Model Meets Rumour Spreading: A Study of Consensus Protocols on Graphs with Agnostic Nodes [Extended Version]

arXiv.org Artificial Intelligence

Problems of consensus in multi-agent systems are often viewed as a series of independent, simultaneous local decisions made between a limited set of options, all aimed at reaching a global agreement. Key challenges in these protocols include estimating the likelihood of various outcomes and finding bounds for how long it may take to achieve consensus, if it occurs at all. To date, little attention has been given to the case where some agents have no initial opinion. In this paper, we introduce a variant of the consensus problem which includes what we call `agnostic' nodes and frame it as a combination of two known and well-studied processes: voter model and rumour spreading. We show (1) a martingale that describes the probability of consensus for a given colour, (2) bounds on the number of steps for the process to end using results from rumour spreading and voter models, (3) closed formulas for the probability of consensus in a few special cases, and (4) that the computational complexity of estimating the probability with a Markov chain Monte Carlo process is $O(n^2 \log n)$ for general graphs and $O(n\log n)$ for Erd\H{o}s-R\'enyi graphs, which makes it an efficient method for estimating probabilities of consensus. Furthermore, we present experimental results suggesting that the number of runs needed for a given standard error decreases when the number of nodes increases.


TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

arXiv.org Artificial Intelligence

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.


Graph in the Vault: Protecting Edge GNN Inference with Trusted Execution Environment

arXiv.org Artificial Intelligence

--Wide deployment of machine learning models on edge devices has rendered the model intellectual property (IP) and data privacy vulnerable. We propose GNNV ault, the first secure Graph Neural Network (GNN) deployment strategy based on Trusted Execution Environment (TEE). GNNV ault follows the design of "partition-before-training" and includes a private GNN rectifier to complement with a public backbone model. This way, both critical GNN model parameters and the private graph used during inference are protected within secure TEE compartments. Real-world implementations with Intel SGX demonstrate that GNNV ault safeguards GNN inference against state-of-the-art link stealing attacks with a negligible accuracy degradation ( < 2 %). On-device machine learning has emerged as an important paradigm for tasks requiring low latency and high privacy [1]. This trend has also extended to Graph Neural Networks (GNNs) [4], [5], ensuring the privacy of user data during inference for tasks such as community detection [6], e-commerce personaliza-tion [7], and recommender systems [8]. However, local GNN inference grants users significant privileges to local models and data, introducing additional security vulnerabilities [9].


Understanding the Design Principles of Link Prediction in Directed Settings

arXiv.org Artificial Intelligence

Link prediction is a widely studied task in Graph Representation Learning (GRL) for modeling relational data. The early theories in GRL were based on the assumption of a symmetric adjacency matrix, reflecting an undirected setting. As a result, much of the following state-of-the-art research has continued to operate under this symmetry assumption, even though real-world data often involve crucial information conveyed through the direction of relationships. This oversight limits the ability of these models to fully capture the complexity of directed interactions. In this paper, we focus on the challenge of directed link prediction by evaluating key heuristics that have been successful in undirected settings. We propose simple but effective adaptations of these heuristics to the directed link prediction task and demonstrate that these modifications produce competitive performance compared to the leading Graph Neural Networks (GNNs) originally designed for undirected graphs. Through an extensive set of experiments, we derive insights that inform the development of a novel framework for directed link prediction, which not only surpasses baseline methods but also outperforms state-of-the-art GNNs on multiple benchmarks.


LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

arXiv.org Artificial Intelligence

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens -- especially stopwords, articles, and commas -- consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer's embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.