Koniusz, Piotr
Graph Self-Supervised Learning with Learnable Structural and Positional Encodings
Wijesinghe, Asiri, Zhu, Hao, Koniusz, Piotr
Traditional Graph Self-Supervised Learning (GSSL) struggles to capture complex structural properties well. This limitation stems from two main factors: (1) the inadequacy of conventional Graph Neural Networks (GNNs) in representing sophisticated topological features, and (2) the focus of self-supervised learning solely on final graph representations. To address these issues, we introduce GenHopNet, a GNN framework that integrates a k -hop message-passing scheme, enhancing its ability to capture local structural information without explicit substructure extraction. We theoretically demonstrate that GenHopNet surpasses the expressiveness of the classical Weisfeiler-Lehman (WL) test for graph isomorphism. Furthermore, we propose a structural-and positional-aware GSSL framework that incorporates topological information throughout the learning process. This approach enables the learning of representations that are both sensitive to graph topology and invariant to specific structural and feature augmentations. Comprehensive experiments on graph classification datasets, including those designed to test structural sensitivity, show that our method consistently outperforms the existing approaches and maintains computational efficiency. Our work significantly advances GSSL's capability in distinguishing graphs with similar local structures but different global topologies. 1 Introduction Graph Neural Networks (GNNs) are powerful deep learning networks for graph-structured data, employed by various tasks [36, 68, 88, 83, 65, 98, 79, 37, 57, 32, 2, 43, 18]. While most GNNs focus on semi-supervised learning, Self-Supervised Learning (SSL) learns graph representations without human annotations. Graph Self-Supervised Learning (GSSL) often outperforms supervised methods in both node-level and graph-level downstream tasks [82, 106, 100, 94, 104, 102, 95, 93, 97]. In this paper, we focus on graph classification, a crucial graph-level task with significant applications in areas such as molecular property prediction, social network analysis, and protein function classification [25, 83, 22, 96, 103]. Graph classification presents unique challenges compared to node-level tasks as it must capture global structural information across different graphs, not just local neighborhoods. Graphs can vary significantly in size and structure, demanding more flexible and expressive models. To obtain effective graph-level representations, models must aggregate information from all nodes and edges while preserving discriminative structural features. Despite GSSL's success, they often fail to fully leverage the expressive power of GNNs, by not utilizing both topological and positional information for graph classification. This paper is accepted by The World Wide Web Conference (WWW) 2025.
When Spatial meets Temporal in Action Recognition
Chen, Huilin, Wang, Lei, Chen, Yifan, Gedeon, Tom, Koniusz, Piotr
Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal dynamics (e.g., motion), they rarely address the need for a comprehensive integration of both. Capturing the rich temporal evolution of video frames, while preserving their spatial details, is crucial for improving accuracy. In this paper, we introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information. The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding $N^2$ temporally evolving frames into a single spatial grid of size $N \times N$. This transformation creates new frames that balance both spatial and temporal information, making them compatible with existing video models. When $N=1$, the layer captures rich spatial details, similar to existing methods. As $N$ increases ($N\geq2$), temporal information becomes more prominent, while the spatial information decreases to ensure compatibility with model inputs. We demonstrate the effectiveness of the TIME layer by integrating it into popular action recognition models, such as ResNet-50, Vision Transformer, and Video Masked Autoencoders, for both RGB and depth video data. Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.
Inductive Graph Few-shot Class Incremental Learning
Li, Yayong, Moghadam, Peyman, Peng, Can, Ye, Nan, Koniusz, Piotr
Node classification with Graph Neural Networks (GNN) under a fixed set of labels is well known in contrast to Graph Few-Shot Class Incremental Learning (GFSCIL), which involves learning a GNN classifier as graph nodes and classes growing over time sporadically. We introduce inductive GFSCIL that continually learns novel classes with newly emerging nodes while maintaining performance on old classes without accessing previous data. This addresses the practical concern of transductive GFSCIL, which requires storing the entire graph with historical data. Compared to the transductive GFSCIL, the inductive setting exacerbates catastrophic forgetting due to inaccessible previous data during incremental training, in addition to overfitting issue caused by label sparsity. Thus, we propose a novel method, called Topology-based class Augmentation and Prototype calibration (TAP). To be specific, it first creates a triple-branch multi-topology class augmentation method to enhance model generalization ability. As each incremental session receives a disjoint subgraph with nodes of novel classes, the multi-topology class augmentation method helps replicate such a setting in the base session to boost backbone versatility. In incremental learning, given the limited number of novel class samples, we propose an iterative prototype calibration to improve the separation of class prototypes. Furthermore, as backbone fine-tuning poses the feature distribution drift, prototypes of old classes start failing over time, we propose the prototype shift method for old classes to compensate for the drift. We showcase the proposed method on four datasets.
PACE: marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization
Ni, Yao, Zhang, Shan, Koniusz, Piotr
Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained vision transformers to downstream tasks. However, the optimization for tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue, we theoretically connect smaller weight gradient norms during training and larger datasets to the improved model generalization. Motivated by this connection, we propose reducing gradient norms for enhanced generalization and aligning fine-tuned model with the pre-trained counterpart to retain knowledge from large-scale pre-training data. Yet, naive alignment does not guarantee gradient reduction and can potentially cause gradient explosion, complicating efforts to manage gradients. To address such issues, we propose PACE, marrying generalization of PArameter-efficient fine-tuning with Consistency rEgularization. We perturb features learned from the adapter with the multiplicative noise and ensure the fine-tuned model remains consistent for same sample under different perturbations. Theoretical analysis shows that PACE not only implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. Experimental evidence supports our theories. PACE outperforms existing PEFT methods in four visual adaptation tasks: VTAB-1k, FGVC, few-shot learning and domain adaptation. Code will be available at https://github.com/MaxwellYaoNi/PACE
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs
Zhang, Yifei, Zhu, Hao, Liu, Aiwei, Yu, Han, Koniusz, Piotr, King, Irwin
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. However, the enormous size of LLMs poses significant challenges in terms of computational complexity and resource requirements. Low-Rank Adaptation (LoRA) has emerged as a promising solution. However, there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. In this work, we propose eXtreme Gradient Boosting LoRA (XGBLoRA), a novel framework that bridges this gap by leveraging the power of ensemble learning. Inspired by gradient boosting, XGBLoRA iteratively learns and merges a sequence of LoRA adaptations to refine model predictions. It achieves better performance than the standard LoRA, while enjoying the computational efficiency of rank-1 adaptations. We provide theoretical analysis to show the convergence and optimality of our approach, and conduct extensive experiments on a range of natural language processing tasks. The results demonstrate that XGBLoRA consistently outperforms standard LoRA and achieves performance comparable to full fine-tuning with significantly fewer trainable parameters. This work advances parameter-efficient fine-tuning for LLMs, and offers a promising solution for adapting LLMs to downstream tasks while optimizing performance and efficiency.
LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
Ding, Dexuan, Wang, Lei, Zhu, Liyun, Gedeon, Tom, Koniusz, Piotr
In computer vision tasks, features often come from diverse representations, domains, and modalities, such as text, images, and videos. Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing similarity graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise similarity score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.
Motion meets Attention: Video Motion Prompts
Chen, Qixiang, Wang, Lei, Koniusz, Piotr, Gedeon, Tom
Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as 'blind motion extraction' behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose using a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to activate and modulate motion signals derived from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporally continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional 'blind motion extraction' and the extraction of relevant motions of interest.
CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
Ni, Yao, Koniusz, Piotr
Generative Adversarial Networks (GANs) significantly advanced image generation but their performance heavily depends on abundant training data. In scenarios with limited data, GANs often struggle with discriminator overfitting and unstable training. Batch Normalization (BN), despite being known for enhancing generalization and training stability, has rarely been used in the discriminator of Data-Efficient GANs. Our work addresses this gap by identifying a critical flaw in BN: the tendency for gradient explosion during the centering and scaling steps. To tackle this issue, we present CHAIN (lipsCHitz continuity constrAIned Normalization), which replaces the conventional centering step with zero-mean regularization and integrates a Lipschitz continuity constraint in the scaling step. CHAIN further enhances GAN training by adaptively interpolating the normalized and unnormalized features, effectively avoiding discriminator overfitting. Our theoretical analyses firmly establishes CHAIN's effectiveness in reducing gradients in latent features and weights, improving stability and generalization in GAN training. Empirical evidence supports our theory. CHAIN achieves state-of-the-art results in data-limited scenarios on CIFAR-10/100, ImageNet, five low-shot and seven high-resolution few-shot image datasets. Code: https://github.com/MaxwellYaoNi/CHAIN
Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment
Wang, Lei, Liu, Jun, Zheng, Liang, Gedeon, Tom, Koniusz, Piotr
Video sequences exhibit significant nuisance the-art results on NTU-60, NTU-120, Kinetics-skeleton and variations (undesired effects) of speed of actions, temporal UWA3D Multiview Activity II on supervised and unsupervised locations, and subjects' poses, leading to temporalviewpoint FSAR, and their meta-learning inspired fusion. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt 1 Introduction (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be Action recognition is a key topic in computer vision, easily manipulated in 3D. We evaluate JEANIE on skeletal with applications in video surveillance [105, 109, 120], Few-shot Action Recognition (FSAR), where matching well human-computer interaction, sport analysis and robotics. Given a query sequence, we create its several views labeling videos for 3D skeleton sequences is laborious, and by simulating several camera locations. For a support sequence, such pipelines need to be retrained or finetuned for new class we match it with view-simulated query sequences, concepts. Specifically, two-stream neural network [24, 23, 124] and 3D Convolutional each support temporal block can be matched to the Neural Network (3D CNN) [99, 9] aggregate framewise query temporal block with the same or adjacent (next) temporal and temporal block representations, respectively. However, index, and adjacent camera views to achieve joint local such networks are trained on large-scale datasets such temporal-viewpoint warping. JEANIE selects the smallest as Kinetics [9, 116, 110, 118] under a fixed set of training distance among matching paths with different temporalviewpoint classes. We also propose an Few-shot Learning (FSL) models for action recognition, unsupervised FSAR akin to clustering of sequences with termed Few-shot Action Recognition (FSAR), that rapidly JEANIE as a distance measure. JEANIE achieves state-of-adapt to novel classes given few training samples [77, 129, 31, 19, 138, 7, 112]. L. Wang is a Research Fellow at the School of Computing, the Australian J. Liu is an Assistant Professor at the Singapore University of Technology L. Zheng is an Associate Professor in the School of Computing, ANU.
High-order Tensor Pooling with Attention for Action Recognition
Wang, Lei, Sun, Ke, Koniusz, Piotr
We aim at capturing high-order statistics of feature vectors formed by a neural network, and propose end-to-end second- and higher-order pooling to form a tensor descriptor. Tensor descriptors require a robust similarity measure due to low numbers of aggregated vectors and the burstiness phenomenon, when a given feature appears more/less frequently than statistically expected. The Heat Diffusion Process (HDP) on a graph Laplacian is closely related to the Eigenvalue Power Normalization (EPN) of the covariance/autocorrelation matrix, whose inverse forms a loopy graph Laplacian. We show that the HDP and the EPN play the same role, i.e., to boost or dampen the magnitude of the eigenspectrum thus preventing the burstiness. We equip higher-order tensors with EPN which acts as a spectral detector of higher-order occurrences to prevent burstiness. We also prove that for a tensor of order r built from d dimensional feature descriptors, such a detector gives the likelihood if at least one higher-order occurrence is 'projected' into one of binom(d,r) subspaces represented by the tensor; thus forming a tensor power normalization metric endowed with binom(d,r) such 'detectors'. For experimental contributions, we apply several second- and higher-order pooling variants to action recognition, provide previously not presented comparisons of such pooling variants, and show state-of-the-art results on HMDB-51, YUP++ and MPII Cooking Activities.