Goto

Collaborating Authors

 lexical feature


Deep Reinforcement Learning for Phishing Detection with Transformer-Based Semantic Features

Faisal, Aseer Al

arXiv.org Artificial Intelligence

Phishing is a cybercrime in which individuals are deceived into revealing personal information, often resulting in financial loss. These attacks commonly occur through fraudulent messages, misleading advertisements, and compromised legitimate websites. This study proposes a Quantile Regression Deep Q-Network (QR-DQN) approach that integrates RoBERTa semantic embeddings with handcrafted lexical features to enhance phishing detection while accounting for uncertainties. Unlike traditional DQN methods that estimate single scalar Q-values, QR-DQN leverages quantile regression to model the distribution of returns, improving stability and generalization on unseen phishing data. A diverse dataset of 105,000 URLs was curated from PhishTank, OpenPhish, Cloudflare, and other sources, and the model was evaluated using an 80/20 train-test split. The QR-DQN framework achieved a test accuracy of 99.86%, precision of 99.75%, recall of 99.96%, and F1-score of 99.85%, demonstrating high effectiveness. Compared to standard DQN with lexical features, the hybrid QR-DQN with lexical and semantic features reduced the generalization gap from 1.66% to 0.04%, indicating significant improvement in robustness. Five-fold cross-validation confirmed model reliability, yielding a mean accuracy of 99.90% with a standard deviation of 0.04%. These results suggest that the proposed hybrid approach effectively identifies phishing threats, adapts to evolving attack strategies, and generalizes well to unseen data.


Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity

Tokareva, Anastasiia, Dineley, Judith, Firth, Zoe, Conde, Pauline, Matcham, Faith, Siddi, Sara, Lamers, Femke, Carr, Ewan, Oetzmann, Carolin, Leightley, Daniel, Zhang, Yuezhou, Folarin, Amos A., Haro, Josep Maria, Penninx, Brenda W. J. H., Bailon, Raquel, Vairavan, Srinivasan, Wykes, Til, Dobson, Richard J. B., Narayan, Vaibhav A., Hotopf, Matthew, Cummins, Nicholas, Consortium, The RADAR-CNS

arXiv.org Artificial Intelligence

Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.


Blending Learning to Rank and Dense Representations for Efficient and Effective Cascades

Nardini, Franco Maria, Perego, Raffaele, Tonellotto, Nicola, Trani, Salvatore

arXiv.org Artificial Intelligence

We investigate the exploitation of both lexical and neural relevance signals for ad-hoc passage retrieval. Our exploration involves a large-scale training dataset in which dense neural representations of MS-MARCO queries and passages are complemented and integrated with 253 hand-crafted lexical features extracted from the same corpus. Blending of the relevance signals from the two different groups of features is learned by a classical Learning-to-Rank (LTR) model based on a forest of decision trees. To evaluate our solution, we employ a pipelined architecture where a dense neural retriever serves as the first stage and performs a nearest-neighbor search over the neural representations of the documents. Our LTR model acts instead as the second stage that re-ranks the set of candidates retrieved by the first stage to enhance effectiveness. The results of reproducible experiments conducted with state-of-the-art dense retrievers on publicly available resources show that the proposed solution significantly enhances the end-to-end ranking performance while relatively minimally impacting efficiency. Specifically, we achieve a boost in nDCG@10 of up to 11% with an increase in average query latency of only 4.3%. This confirms the advantage of seamlessly combining two distinct families of signals that mutually contribute to retrieval effectiveness.


2d6cc4b2d139a53512fb8cbb3086ae2e-Reviews.html

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes a model for labeling images with classes for which no example appear in the training set, which is based on a combination of word and image embeddings and novelty detection. Using distances in th embedding space between test images and unseen and seen class labels, the approach is able to assign a probability for a new image to be from an unseen class. This is later used to decide which classifier to use (one designed for seen classes the other for unknown ones). Results on CIFAR10 are provided.


On the Contribution of Lexical Features to Speech Emotion Recognition

Combei, David

arXiv.org Artificial Intelligence

Although paralinguistic cues are often considered the primary drivers of speech emotion recognition (SER), we investigate the role of lexical content extracted from speech and show that it can achieve competitive and in some cases higher performance compared to acoustic models. On the MELD dataset, our lexical-based approach obtains a weighted F1-score (WF1) of 51.5%, compared to 49.3% for an acoustic-only pipeline with a larger parameter count. Furthermore, we analyze different self-supervised (SSL) speech and text representations, conduct a layer-wise study of transformer-based encoders, and evaluate the effect of audio denoising.


LinguaSynth: Heterogeneous Linguistic Signals for News Classification

Zhang, Duo, Mo, Junyi

arXiv.org Artificial Intelligence

Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.


UKTA: Unified Korean Text Analyzer

Ahn, Seokho, Park, Junhyung, Go, Ganghee, Kim, Chulhui, Jung, Jiho, Shin, Myung Sun, Kim, Do-Guk, Seo, Young-Duk

arXiv.org Artificial Intelligence

High-level, abstract evaluation results should be interpretable by humans, who need to understand Evaluating writing quality is complex and time-consuming often the reason behind the scores and the features that influenced the delaying feedback to learners. While automated writing evaluation results. Providing this explainability to users is crucial for ensuring tools are effective for English, Korean automated writing evaluation reliability, as these tools have the potential to make mistakes; tools face challenges due to their inability to address multi-view Unfortunately, existing Korean text analyzers [16, 18, 20] and automated analysis, error propagation, and evaluation explainability. To overcome writing evaluation tools [21, 37] do not fully meet all these these challenges, we introduce UKTA (Unified Korean Text requirements, limiting their practical use. Analyzer), a comprehensive Korea text analysis and writing evaluation To address the research gap, we introduce UKTA (Unified Korean system. UKTA provides accurate low-level morpheme analysis, Text Analyzer), a comprehensive Korean text analysis system for key lexical features for mid-level explainability, and transparent evaluating Korean writing. First, we provide accurate low-level analysis high-level rubric-based writing scores. Our approach enhances based on state-of-the-art Korean morpheme analyzer, which accuracy and quadratic weighted kappa over existing baseline, positioning minimizes error propagation. In addition to morpheme analysis, we UKTA as a leading multi-perspective tool for Korean text categorize and provide key features, such as lexical richness and analysis and writing evaluation.


A Survey on Pedophile Attribution Techniques for Online Platforms

Fallatah, Hiba, Suen, Ching, Ormandjieva, Olga

arXiv.org Artificial Intelligence

Reliance on anonymity in social media has increased its popularity on these platforms among all ages. The availability of public Wi-Fi networks has facilitated a vast variety of online content, including social media applications. Although anonymity and ease of access can be a convenient means of communication for their users, it is difficult to manage and protect its vulnerable users against sexual predators. Using an automated identification system that can attribute predators to their text would make the solution more attainable. In this survey, we provide a review of the methods of pedophile attribution used in social media platforms. We examine the effect of the size of the suspect set and the length of the text on the task of attribution. Moreover, we review the most-used datasets, features, classification techniques and performance measures for attributing sexual predators. We found that few studies have proposed tools to mitigate the risk of online sexual predators, but none of them can provide suspect attribution. Finally, we list several open research problems.


Large Language Models for Dysfluency Detection in Stuttered Speech

Wagner, Dominik, Bayerl, Sebastian P., Baumann, Ilja, Riedhammer, Korbinian, Nöth, Elmar, Bocklet, Tobias

arXiv.org Artificial Intelligence

Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.


How Lexical is Bilingual Lexicon Induction?

Kohli, Harsh, Feng, Helian, Dronen, Nicholas, McCarter, Calvin, Moeini, Sina, Kebarighotbi, Ali

arXiv.org Artificial Intelligence

In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.