Goto

Collaborating Authors

 Education


Machine learning for fraud detection in digital banking: a systematic literature review REVIEW

arXiv.org Artificial Intelligence

This systematic literature review examines the role of machine learning in fraud detection within digital banking, synthesizing evidence from 118 peer-reviewed studies and institutional reports. Following the PRISMA guidelines, the review applied a structured identification, screening, eligibility, and inclusion process to ensure methodological rigor and transparency. The findings reveal that supervised learning methods, such as decision trees, logistic regression, and support vector machines, remain the dominant paradigm due to their interpretability and established performance, while unsupervised anomaly detection approaches are increasingly adopted to address novel fraud patterns in highly imbalanced datasets. Deep learning architectures, particularly recurrent and convolutional neural networks, have emerged as transformative tools capable of modeling sequential transaction data and detecting complex fraud typologies, though challenges of interpretability and real-time deployment persist. Hybrid models that combine supervised, unsupervised, and deep learning strategies demonstrate superior adaptability and detection accuracy, highlighting their potential as convergent solutions.


Linguistic Characteristics of AI-Generated Text: A Survey

arXiv.org Artificial Intelligence

Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.


Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

arXiv.org Artificial Intelligence

Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.


Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery

arXiv.org Artificial Intelligence

Head Start programs utilizing GoEngage face significant challenges when new or rotating staff attempt to locate appropriate Tasks (modules) on the platform homepage. These difficulties arise from domain-specific jargon (e.g., IFPA, DRDP), system-specific nomenclature (e.g., Application Pool), and the inherent limitations of lexical search in handling typos and varied word ordering. We propose a pragmatic hybrid semantic search system that synergistically combines lightweight typo-tolerant lexical retrieval, embedding-based vector similarity, and constrained large language model (LLM) re-ranking. Our approach leverages the organization's existing Task Repository and Knowledge Base infrastructure while ensuring trustworthiness through low false-positive rates, evolvability to accommodate terminological changes, and economic efficiency via intelligent caching, shortlist generation, and graceful degradation mechanisms. We provide a comprehensive framework detailing required resources, a phased implementation strategy with concrete milestones, an offline evaluation protocol utilizing curated test cases (Hit@K, Precision@K, Recall@K, MRR), and an online measurement methodology incorporating query success metrics, zero-result rates, and dwell-time proxies.


MuFFIN: Multifaceted Pronunciation Feedback Model with Interactive Hierarchical Neural Modeling

arXiv.org Artificial Intelligence

Computer-assisted pronunciation training (CAPT) manages to facilitate second-language (L2) learners to practice pronunciation skills by offering timely and instructive feedback. To examine pronunciation proficiency from multiple facets, existing methods for CAPT broadly fall into two categories: mispronunciation detection and diagnosis (MDD) as well as automatic pronunciation assessment (APA). The former aims to pinpoint phonetic pronunciation errors and provide diagnostic feedback, while the latter seeks instead to quantify pronunciation proficiency pertaining to various aspects. Despite the natural complementarity between MDD and APA, researchers and practitioners, however, often treat them as independent tasks with disparate modeling paradigms. In light of this, we in this paper first introduce MuFFIN, a Multi-Faceted pronunciation Feedback model with an Interactive hierarchical Neural architecture, to jointly address the tasks of MDD and APA. To better capture the nuanced distinctions between phonemes in the feature space, a novel phoneme-contrastive ordinal regularization mechanism is then put forward to optimize the proposed model to generate more phoneme-discriminative features while factoring in the ordinality of the aspect scores. In addition, to address the intricate data imbalance problem in MDD, we design a simple yet effective training objective, which is specifically tailored to perturb the outputs of a phoneme classifier with the phoneme-specific variations, so as to better render the distribution of predicted phonemes meanwhile considering their mispronunciation characteristics. A series of experiments conducted on the Speechocean762 benchmark dataset demonstrates the efficacy of our method in relation to several cutting-edge baselines, showing state-of-the-art performance on both the APA and MDD tasks.


Harnessing LLM for Noise-Robust Cognitive Diagnosis in Web-Based Intelligent Education Systems

arXiv.org Artificial Intelligence

Cognitive diagnostics in the Web-based Intelligent Education System (WIES) aims to assess students' mastery of knowledge concepts from heterogeneous, noisy interactions. Recent work has tried to utilize Large Language Models (LLMs) for cognitive diagnosis, yet LLMs struggle with structured data and are prone to noise-induced misjudgments. Specially, WIES's open environment continuously attracts new students and produces vast amounts of response logs, exacerbating the data imbalance and noise issues inherent in traditional educational systems. To address these challenges, we propose DLLM, a Diffusion-based LLM framework for noise-robust cognitive diagnosis. DLLM first constructs independent subgraphs based on response correctness, then applies relation augmentation alignment module to mitigate data imbalance. The two subgraph representations are then fused and aligned with LLM-derived, semantically augmented representations. Importantly, before each alignment step, DLLM employs a two-stage denoising diffusion module to eliminate intrinsic noise while assisting structural representation alignment. Specifically, unconditional denoising diffusion first removes erroneous information, followed by conditional denoising diffusion based on graph-guided to eliminate misleading information. Finally, the noise-robust representation that integrates semantic knowledge and structural information is fed into existing cognitive diagnosis models for prediction. Experimental results on three publicly available web-based educational platform datasets demonstrate that our DLLM achieves optimal predictive performance across varying noise levels, which demonstrates that DLLM achieves noise robustness while effectively leveraging semantic knowledge from LLM.


Report of the 2025 Workshop on Next-Generation Ecosystems for Scientific Computing: Harnessing Community, Software, and AI for Cross-Disciplinary Team Science

arXiv.org Artificial Intelligence

This report summarizes insights from the 2025 Workshop on Next-Generation Ecosystems for Scientific Computing: Harnessing Community, Software, and AI for Cross-Disciplinary Team Science, which convened more than 40 experts from national laboratories, academia, industry, and community organizations to chart a path toward more powerful, sustainable, and collaborative scientific software ecosystems. To address urgent challenges at the intersection of high-performance computing (HPC), AI, and scientific software, participants envisioned agile, robust ecosystems built through socio-technical co-design--the intentional integration of social and technical components as interdependent parts of a unified strategy. This approach combines advances in AI, HPC, and software with new models for cross-disciplinary collaboration, training, and workforce development. Key recommendations include building modular, trustworthy AI-enabled scientific software systems; enabling scientific teams to integrate AI systems into their workflows while preserving human creativity, trust, and scientific rigor; and creating innovative training pipelines that keep pace with rapid technological change. Pilot projects were identified as near-term catalysts, with initial priorities focused on hybrid AI/HPC infrastructure, cross-disciplinary collaboration and pedagogy, responsible AI guidelines, and prototyping of public-private partnerships. This report presents a vision of next-generation ecosystems for scientific computing where AI, software, hardware, and human expertise are interwoven to drive discovery, expand access, strengthen the workforce, and accelerate scientific progress.


Generating High-Quality Datasets for Code Editing via Open-Source Language Models

arXiv.org Artificial Intelligence

Code editing plays a vital role in software engineering, requiring developers to adjust existing code according to natural language instructions while keeping functionality intact and avoiding unnecessary modifications. However, commit-based datasets commonly used for this task are often noisy, lack diversity, and fail to reflect the style of real-world edit instructions. To address this, we introduce OpenCodeEdit, an open-source pipeline that leverages multiple LLMs to synthesize realistic code-edit triplets. The pipeline produces both concise "lazy" instructions and more detailed "descriptive" ones, and applies filtering based on diffs and topics to guarantee data quality and variety. Using this process, we construct OCEDataFT, a curated dataset of 20K samples. Fine-tuning three advanced base models on OCEDataFT leads to significant performance boosts on the CanItEdit benchmark, with relative pass@1 improvements ranging from 4.50% to 20.79%. Notably, the resulting models achieve performance close to closed-source systems, narrowing the gap to GPT-4 to just 3.54%, without relying on proprietary resources or manual annotation.


COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

arXiv.org Artificial Intelligence

Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50\% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.


GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

arXiv.org Artificial Intelligence

Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM. The benchmark and all related resources are publicly available at https://gem-bench.org/.