Oceania
A Misclassification Network-Based Method for Comparative Genomic Analysis
He, Wan, Eliassi-Rad, Tina, Scarpino, Samuel V.
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.
Absolute Risk Prediction for Cannabis Use Disorder Using Bayesian Machine Learning
Wang, Tingfang, Boden, Joseph M., Biswas, Swati, Choudhary, Pankaj K.
Introduction: Substance use disorders (SUDs) have emerged as a pressing public health crisis in the United States, with adolescent substance use often leading to SUDs in adulthood. Effective strategies are needed to prevent this progression. To help in filling this need, we develop a novel and the first-ever absolute risk prediction model for cannabis use disorder (CUD) for adolescent or young adult cannabis users. Methods: We train a Bayesian machine learning model that provides a personalized CUD absolute risk for adolescent or young adult cannabis users using data from the National Longitudinal Study of Adolescent to Adult Health. Model performance is assessed using 5-fold cross-validation (CV) with area under the curve (AUC) and ratio of the expected to observed number of cases (E/O). External validation of the final model is conducted using two independent datasets. Results: The proposed model has five risk factors: biological sex, delinquency, and scores on personality traits of conscientiousness, neuroticism, and openness. For predicting CUD risk within five years of first cannabis use, AUC and E/O, computed via 5-fold CV, were 0.68 and 0.95, respectively. For the same type of prediction in external validation, AUC values were 0.64 and 0.75, with E/O values of 0.98 and 1, indicating good discrimination and calibration performances of the model. Discussion and Conclusion: The proposed model is the first absolute risk prediction model for an SUD. It can aid clinicians in identifying adolescent/youth substance users at a high risk of developing CUD in future for clinically appropriate interventions.
Applying the maximum entropy principle to neural networks enhances multi-species distribution models
Ryckewaert, Maxime, Marcos, Diego, Botella, Christophe, Servajean, Maximilien, Bonnet, Pierre, Joly, Alexis
The rapid expansion of citizen science initiatives has led to a significant growth of biodiversity databases, and particularly presence-only (PO) observations. PO data are invaluable for understanding species distributions and their dynamics, but their use in a Species Distribution Model (SDM) is curtailed by sampling biases and the lack of information on absences. Poisson point processes are widely used for SDMs, with Maxent being one of the most popular methods. Maxent maximises the entropy of a probability distribution across sites as a function of predefined transformations of variables, called features. In contrast, neural networks and deep learning have emerged as a promising technique for automatic feature extraction from complex input variables. Arbitrarily complex transformations of input variables can be learned from the data efficiently through backpropagation and stochastic gradient descent (SGD). In this paper, we propose DeepMaxent, which harnesses neural networks to automatically learn shared features among species, using the maximum entropy principle. To do so, it employs a normalised Poisson loss where for each species, presence probabilities across sites are modelled by a neural network. We evaluate DeepMaxent on a benchmark dataset known for its spatial sampling biases, using PO data for calibration and presence-absence (PA) data for validation across six regions with different biological groups and covariates. Our results indicate that DeepMaxent performs better than Maxent and other leading SDMs across all regions and taxonomic groups. The method performs particularly well in regions of uneven sampling, demonstrating substantial potential to increase SDM performances. In particular, our approach yields more accurate predictions than traditional single-species models, which opens up new possibilities for methodological enhancement.
Free-Knots Kolmogorov-Arnold Network: On the Analysis of Spline Knots and Advancing Stability
Zheng, Liangwewi Nathan, Zhang, Wei Emma, Yue, Lin, Xu, Miao, Maennel, Olaf, Chen, Weitong
Kolmogorov-Arnold Neural Networks (KANs) have gained significant attention in the machine learning community. However, their implementation often suffers from poor training stability and heavy trainable parameter. Furthermore, there is limited understanding of the behavior of the learned activation functions derived from B-splines. In this work, we analyze the behavior of KANs through the lens of spline knots and derive the lower and upper bound for the number of knots in B-spline-based KANs. To address existing limitations, we propose a novel Free Knots KAN that enhances the performance of the original KAN while reducing the number of trainable parameters to match the trainable parameter scale of standard Multi-Layer Perceptrons (MLPs). Additionally, we introduce new a training strategy to ensure $C^2$ continuity of the learnable spline, resulting in smoother activation compared to the original KAN and improve the training stability by range expansion. The proposed method is comprehensively evaluated on 8 datasets spanning various domains, including image, text, time series, multimodal, and function approximation tasks. The promising results demonstrates the feasibility of KAN-based network and the effectiveness of proposed method.
AI-based Identity Fraud Detection: A Systematic Review
Zhang, Chuo Jun, Gill, Asif Q., Liu, Bo, Anwar, Memoona J.
With the rapid development of digital services, a large volume of personally identifiable information (PII) is stored online and is subject to cyberattacks such as Identity fraud. Most recently, the use of Artificial Intelligence (AI) enabled deep fake technologies has significantly increased the complexity of identity fraud. Fraudsters may use these technologies to create highly sophisticated counterfeit personal identification documents, photos and videos. These advancements in the identity fraud landscape pose challenges for identity fraud detection and society at large. There is a pressing need to review and understand identity fraud detection methods, their limitations and potential solutions. This research aims to address this important need by using the well-known systematic literature review method. This paper reviewed a selected set of 43 papers across 4 major academic literature databases. In particular, the review results highlight the two types of identity fraud prevention and detection methods, in-depth and open challenges. The results were also consolidated into a taxonomy of AI-based identity fraud detection and prevention methods including key insights and trends. Overall, this paper provides a foundational knowledge base to researchers and practitioners for further research and development in this important area of digital identity fraud.
Data-Driven Network Neuroscience: On Data Collection and Benchmark
This paper presents a comprehensive and quality collection of functional human brain network data for potential research in the intersection of neuroscience, machine learning, and graph analytics. Anatomical and functional MRI images have been used to understand the functional connectivity of the human brain and are particularly important in identifying underlying neurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism. Recently, the study of the brain in the form of brain networks using machine learning and graph analytics has become increasingly popular, especially to predict the early onset of these conditions. A brain network, represented as a graph, retains rich structural and positional information that traditional examination methods are unable to capture. However, the lack of publicly accessible brain network data prevents researchers from data-driven explorations.
Efficient Traffic Prediction Through Spatio-Temporal Distillation
Zhang, Qianru, Gao, Xinyi, Wang, Haixin, Yiu, Siu-Ming, Yin, Hongzhi
Graph neural networks (GNNs) have gained considerable attention in recent years for traffic flow prediction due to their ability to learn spatio-temporal pattern representations through a graph-based message-passing framework. Although GNNs have shown great promise in handling traffic datasets, their deployment in real-life applications has been hindered by scalability constraints arising from high-order message passing. Additionally, the over-smoothing problem of GNNs may lead to indistinguishable region representations as the number of layers increases, resulting in performance degradation. To address these challenges, we propose a new knowledge distillation paradigm termed LightST that transfers spatial and temporal knowledge from a high-capacity teacher to a lightweight student. Specifically, we introduce a spatio-temporal knowledge distillation framework that helps student MLPs capture graph-structured global spatio-temporal patterns while alleviating the over-smoothing effect with adaptive knowledge distillation. Extensive experiments verify that LightST significantly speeds up traffic flow predictions by 5X to 40X compared to state-of-the-art spatio-temporal GNNs, all while maintaining superior accuracy.
DNN-Powered MLOps Pipeline Optimization for Large Language Models: A Framework for Automated Deployment and Resource Management
Krishnamoorthy, Mahesh Vaijainthymala, Palavesam, Kuppusamy Vellamadam, Arcot, Siva Venkatesh, Kuppuswami, Rajarajeswari Chinniah
The exponential growth in the size and complexity of Large Language Models (LLMs) has introduced unprecedented challenges in their deployment and operational management. Traditional MLOps approaches often fail to efficiently handle the scale, resource requirements, and dynamic nature of these models. This research presents a novel framework that leverages Deep Neural Networks (DNNs) to optimize MLOps pipelines specifically for LLMs. Our approach introduces an intelligent system that automates deployment decisions, resource allocation, and pipeline optimization while maintaining optimal performance and cost efficiency. Through extensive experimentation across multiple cloud environments and deployment scenarios, we demonstrate significant improvements: 40% enhancement in resource utilization, 35% reduction in deployment latency, and 30% decrease in operational costs compared to traditional MLOps approaches. The framework's ability to adapt to varying workloads and automatically optimize deployment strategies represents a significant advancement in automated MLOps management for large-scale language models. Our framework introduces several novel components including a multi-stream neural architecture for processing heterogeneous operational metrics, an adaptive resource allocation system that continuously learns from deployment patterns, and a sophisticated deployment orchestration mechanism that automatically selects optimal strategies based on model characteristics and environmental conditions. The system demonstrates robust performance across various deployment scenarios, including multi-cloud environments, high-throughput production systems, and cost-sensitive deployments. Through rigorous evaluation using production workloads from multiple organizations, we validate our approach's effectiveness in reducing operational complexity while improving system reliability and cost efficiency.
Enhancing the De-identification of Personally Identifiable Information in Educational Data
Shen, Y., Ji, Z., Lin, J., Koedginer, K. R.
Protecting Personally Identifiable Information (PII), such as names, is a critical requirement in learning technologies to safeguard student and teacher privacy and maintain trust. Accurate PII detection is an essential step toward anonymizing sensitive information while preserving the utility of educational data. Motivated by recent advancements in artificial intelligence, our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks. We explore both prompting and fine-tuning approaches and compare GPT-4o-mini's performance against established frameworks, including Microsoft Presidio and Azure AI Language. Our evaluation on two public datasets, CRAPII and TSCC, demonstrates that the fine-tuned GPT-4o-mini model achieves superior performance, with a recall of 0.9589 on CRAPII. Additionally, fine-tuned GPT-4o-mini significantly improves precision scores (a threefold increase) while reducing computational costs to nearly one-tenth of those associated with Azure AI Language. Furthermore, our bias analysis reveals that the fine-tuned GPT-4o-mini model consistently delivers accurate results across diverse cultural backgrounds and genders. The generalizability analysis using the TSCC dataset further highlights its robustness, achieving a recall of 0.9895 with minimal additional training data from TSCC. These results emphasize the potential of fine-tuned GPT-4o-mini as an accurate and cost-effective tool for PII detection in educational data. It offers robust privacy protection while preserving the data's utility for research and pedagogical analysis. Our code is available on GitHub: https://github.com/AnonJD/PrivacyAI
PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning
Andronic, Marta, Li, Jiawen, Constantinides, George A.
Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded these operations inside FPGA lookup tables (LUTs). However, FPGA LUTs can implement a much greater variety of functions. In this paper, we propose a novel approach to training DNNs for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. By using polynomial building blocks, we achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. LUT-based implementations also face a significant challenge: the LUT size grows exponentially with the number of inputs. Prior work relies on a priori fixed sparsity, with results heavily dependent on seed selection. To address this, we propose a structured pruning strategy using a bespoke hardware-aware group regularizer that encourages a particular sparsity pattern that leads to a small number of inputs per neuron. We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and MNIST.