AITopics | lexical feature

Collaborating Authors

lexical feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Deep Reinforcement Learning for Phishing Detection with Transformer-Based Semantic Features

Faisal, Aseer Al

arXiv.org Artificial IntelligenceDec-9-2025

Phishing is a cybercrime in which individuals are deceived into revealing personal information, often resulting in financial loss. These attacks commonly occur through fraudulent messages, misleading advertisements, and compromised legitimate websites. This study proposes a Quantile Regression Deep Q-Network (QR-DQN) approach that integrates RoBERTa semantic embeddings with handcrafted lexical features to enhance phishing detection while accounting for uncertainties. Unlike traditional DQN methods that estimate single scalar Q-values, QR-DQN leverages quantile regression to model the distribution of returns, improving stability and generalization on unseen phishing data. A diverse dataset of 105,000 URLs was curated from PhishTank, OpenPhish, Cloudflare, and other sources, and the model was evaluated using an 80/20 train-test split. The QR-DQN framework achieved a test accuracy of 99.86%, precision of 99.75%, recall of 99.96%, and F1-score of 99.85%, demonstrating high effectiveness. Compared to standard DQN with lexical features, the hybrid QR-DQN with lexical and semantic features reduced the generalization gap from 1.66% to 0.04%, indicating significant improvement in robustness. Five-fold cross-validation confirmed model reliability, yielding a mean accuracy of 99.90% with a standard deviation of 0.04%. These results suggest that the proposed hybrid approach effectively identifies phishing threats, adapts to evolving attack strategies, and generalizes well to unseen data.

detection, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2512.06925

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity

Tokareva, Anastasiia, Dineley, Judith, Firth, Zoe, Conde, Pauline, Matcham, Faith, Siddi, Sara, Lamers, Femke, Carr, Ewan, Oetzmann, Carolin, Leightley, Daniel, Zhang, Yuezhou, Folarin, Amos A., Haro, Josep Maria, Penninx, Brenda W. J. H., Bailon, Raquel, Vairavan, Srinivasan, Wykes, Til, Dobson, Richard J. B., Narayan, Vaibhav A., Hotopf, Matthew, Cummins, Nicholas, Consortium, The RADAR-CNS

arXiv.org Artificial IntelligenceNov-11-2025

Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.

lexical feature, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.07011

Country:

Europe > Spain (1.00)
North America > United States (0.93)
Europe > United Kingdom > England > Greater London > London (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Blending Learning to Rank and Dense Representations for Efficient and Effective Cascades

Nardini, Franco Maria, Perego, Raffaele, Tonellotto, Nicola, Trani, Salvatore

arXiv.org Artificial IntelligenceOct-21-2025

We investigate the exploitation of both lexical and neural relevance signals for ad-hoc passage retrieval. Our exploration involves a large-scale training dataset in which dense neural representations of MS-MARCO queries and passages are complemented and integrated with 253 hand-crafted lexical features extracted from the same corpus. Blending of the relevance signals from the two different groups of features is learned by a classical Learning-to-Rank (LTR) model based on a forest of decision trees. To evaluate our solution, we employ a pipelined architecture where a dense neural retriever serves as the first stage and performs a nearest-neighbor search over the neural representations of the documents. Our LTR model acts instead as the second stage that re-ranks the set of candidates retrieved by the first stage to enhance effectiveness. The results of reproducible experiments conducted with state-of-the-art dense retrievers on publicly available resources show that the proposed solution significantly enhances the end-to-end ranking performance while relatively minimally impacting efficiency. Specifically, we achieve a boost in nDCG@10 of up to 11% with an increase in average query latency of only 4.3%. This confirms the advantage of seamlessly combining two distinct families of signals that mutually contribute to retrieval effectiveness.

information retrieval, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2510.16393

Country:

Europe (1.00)
North America > United States (0.69)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)

Add feedback

2d6cc4b2d139a53512fb8cbb3086ae2e-Reviews.html

Neural Information Processing SystemsOct-3-2025, 08:11:42 GMT

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes a model for labeling images with classes for which no example appear in the training set, which is based on a combination of word and image embeddings and novelty detection. Using distances in th embedding space between test images and unseen and seen class labels, the approach is able to assign a probability for a new image to be from an unseen class. This is later used to decide which classifier to use (one designed for seen classes the other for unknown ones). Results on CIFAR10 are provided.

feature vector, non-linear transformation, unseen class, (11 more...)

Neural Information Processing Systems

Country: North America > United States > Nevada (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.71)

Add feedback

On the Contribution of Lexical Features to Speech Emotion Recognition

Combei, David

arXiv.org Artificial IntelligenceSep-9-2025

Although paralinguistic cues are often considered the primary drivers of speech emotion recognition (SER), we investigate the role of lexical content extracted from speech and show that it can achieve competitive and in some cases higher performance compared to acoustic models. On the MELD dataset, our lexical-based approach obtains a weighted F1-score (WF1) of 51.5%, compared to 49.3% for an acoustic-only pipeline with a larger parameter count. Furthermore, we analyze different self-supervised (SSL) speech and text representations, conduct a layer-wise study of transformer-based encoders, and evaluate the effect of audio denoising.

emotion recognition, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.05634

Country: Europe (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
(2 more...)

Add feedback

UKTA: Unified Korean Text Analyzer

Ahn, Seokho, Park, Junhyung, Go, Ganghee, Kim, Chulhui, Jung, Jiho, Shin, Myung Sun, Kim, Do-Guk, Seo, Young-Duk

arXiv.org Artificial IntelligenceFeb-11-2025

High-level, abstract evaluation results should be interpretable by humans, who need to understand Evaluating writing quality is complex and time-consuming often the reason behind the scores and the features that influenced the delaying feedback to learners. While automated writing evaluation results. Providing this explainability to users is crucial for ensuring tools are effective for English, Korean automated writing evaluation reliability, as these tools have the potential to make mistakes; tools face challenges due to their inability to address multi-view Unfortunately, existing Korean text analyzers [16, 18, 20] and automated analysis, error propagation, and evaluation explainability. To overcome writing evaluation tools [21, 37] do not fully meet all these these challenges, we introduce UKTA (Unified Korean Text requirements, limiting their practical use. Analyzer), a comprehensive Korea text analysis and writing evaluation To address the research gap, we introduce UKTA (Unified Korean system. UKTA provides accurate low-level morpheme analysis, Text Analyzer), a comprehensive Korean text analysis system for key lexical features for mid-level explainability, and transparent evaluating Korean writing. First, we provide accurate low-level analysis high-level rubric-based writing scores. Our approach enhances based on state-of-the-art Korean morpheme analyzer, which accuracy and quadratic weighted kappa over existing baseline, positioning minimizes error propagation. In addition to morpheme analysis, we UKTA as a leading multi-perspective tool for Korean text categorize and provide key features, such as lexical richness and analysis and writing evaluation.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.09648

Country:

Europe > Italy > Sicily (0.05)
Asia > South Korea > Incheon > Incheon (0.05)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Educational Setting (0.93)
Education > Assessment & Standards > Student Performance (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Survey on Pedophile Attribution Techniques for Online Platforms

Fallatah, Hiba, Suen, Ching, Ormandjieva, Olga

arXiv.org Artificial IntelligenceJan-14-2025

Reliance on anonymity in social media has increased its popularity on these platforms among all ages. The availability of public Wi-Fi networks has facilitated a vast variety of online content, including social media applications. Although anonymity and ease of access can be a convenient means of communication for their users, it is difficult to manage and protect its vulnerable users against sexual predators. Using an automated identification system that can attribute predators to their text would make the solution more attainable. In this survey, we provide a review of the methods of pedophile attribution used in social media platforms. We examine the effect of the size of the suspect set and the length of the text on the task of attribution. Moreover, we review the most-used datasets, features, classification techniques and performance measures for attributing sexual predators. We found that few studies have proposed tools to mitigate the risk of online sexual predators, but none of them can provide suspect attribution. Finally, we list several open research problems.

attribution, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.08296

Country: North America > Canada (0.28)

Genre: Overview (1.00)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.94)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(4 more...)

Add feedback

Large Language Models for Dysfluency Detection in Stuttered Speech

Wagner, Dominik, Bayerl, Sebastian P., Baumann, Ilja, Riedhammer, Korbinian, Nöth, Elmar, Bocklet, Tobias

arXiv.org Artificial IntelligenceJun-16-2024

Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.

acoustic feature, dataset, representation, (14 more...)

arXiv.org Artificial Intelligence

2406.11025

Country: Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

How Lexical is Bilingual Lexicon Induction?

Kohli, Harsh, Feng, Helian, Dronen, Nicholas, McCarter, Calvin, Moeini, Sina, Kebarighotbi, Ali

arXiv.org Artificial IntelligenceApr-5-2024

In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.

computational linguistic, frequency, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2404.04221

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Xie, Roy, Ahia, Orevaoghene, Tsvetkov, Yulia, Anastasopoulos, Antonios

arXiv.org Artificial IntelligenceMar-23-2024

Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.

computational linguistic, dialect, explanation, (15 more...)

arXiv.org Artificial Intelligence

2402.17914

Country:

Asia > Taiwan (0.05)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(19 more...)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback