AITopics | keyword extraction

Collaborating Authors

keyword extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Iskandardinata, Michael, Christian, William, Suhartono, Derwin

arXiv.org Artificial IntelligenceNov-27-2025

Abstract--Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 T ask 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. In the field of machine learning, natural language processing (NLP) tasks have been shown to be a crucial part of human life. NLP tasks revolve around processing text in a specific manner and receiving an output that can be useful, with examples such as text classification, text generation, information retrieval, and similar related tasks. Sarcasm detection, or as some call it verbal irony detection, is a task in NLP that automatically classifies text, and in extended forms includes images, audio, or video, as either sarcastic or not. Systems designed for sarcasm detection are becoming more important in the 21st century due to the growth of the use of sarcasm detection datasets caused by media usage of it, such as social media, television, and much more. Additionally, enhancing these automatic systems for sarcasm detection could become crucial in interpreting the real sentence meaning of a text.

information retrieval, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.21066

Country:

Asia > Indonesia (0.69)
North America > United States (0.46)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection

Hossan, Rakib, Dipta, Shubhashis Roy

arXiv.org Artificial IntelligenceNov-19-2025

The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.

category, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2510.09771

Country:

North America > Mexico (0.28)
North America > United States > Maryland (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)

Add feedback

INSIGHT: Bridging the Student-Teacher Gap in Times of Large Language Models

Thys, Jarne, Vanbrabant, Sebe, Vanacken, Davy, Ruiz, Gustavo Rovelo

arXiv.org Artificial IntelligenceJul-1-2025

The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom. AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy. Based on interviews with teaching staff, this paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises. INSIGHT has a modular design that allows it to be integrated into various higher education courses. We analyze students' questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students' questions and provide new insights for the teaching staff to use for more personalized face-to-face support. Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.17677

Country:

Europe (1.00)
Asia > Singapore (0.14)

Genre:

Instructional Material > Course Syllabus & Notes (0.69)
Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Setting > Higher Education (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Rizvi, F. A., Navojith, T., Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini

arXiv.org Artificial IntelligenceApr-16-2025

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

information retrieval, large language model, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2504.10679

Country: Asia > Sri Lanka (0.17)

Genre: Research Report (0.70)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos

Kok-Shun, Brice Valentin, Chan, Johnny

arXiv.org Artificial IntelligenceFeb-20-2025

Brice Valentin Kok - Shun Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0001 - 9923 - 5042 Johnny Chan Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0002 - 3535 - 4533 Abstract -- This work - in - progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collect ion of 421 auto - generated and manual transcripts which are then fed into a prompt - engineered GPT - 4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for ca tegory identification . The results revealed a significant prevalence of product - related ads across vari ous educational topics, with ad categories refined using GPT - 4 o into succinct 9 content and 4 advertisement categories . This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. T his study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of ad vertisement strategies in digital media. In recent years, video - sharing platforms like YouTube have become dominant sources of entertainment, education, and information [1] . YouTube is invaluable for content creators, marketers, and advertisers. One of the key features of YouTube's revenue model is the integration of sponsored advertisement (ad) segments, which allows content creators to monetize their videos while providing advertisers a direct route to target specific audiences [2] .

ad segment, category, transcript, (12 more...)

arXiv.org Artificial Intelligence

2502.15102

Country:

Oceania > New Zealand > North Island > Auckland Region > Auckland (0.85)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.04)
(6 more...)

Genre:

Research Report (0.84)
Overview > Innovation (0.34)

Industry: Marketing (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SEKE: Specialised Experts for Keyword Extraction

Martinc, Matej, Tran, Hanh Thi Hong, Pollak, Senja, Koloski, Boshko

arXiv.org Artificial IntelligenceDec-18-2024

Keyword extraction involves identifying the most descriptive words in a document, allowing automatic categorisation and summarisation of large quantities of diverse textual data. Relying on the insight that real-world keyword detection often requires handling of diverse content, we propose a novel supervised keyword extraction approach based on the mixture of experts (MoE) technique. MoE uses a learnable routing sub-network to direct information to specialised experts, allowing them to specialize in distinct regions of the input space. SEKE, a mixture of Specialised Experts for supervised Keyword Extraction, uses DeBERTa as the backbone model and builds on the MoE framework, where experts attend to each token, by integrating it with a recurrent neural network (RNN), to allow successful extraction even on smaller corpora, where specialisation is harder due to lack of training data. The MoE framework also provides an insight into inner workings of individual experts, enhancing the explainability of the approach. We benchmark SEKE on multiple English datasets, achieving state-of-the-art performance compared to strong supervised and unsupervised baselines. Our analysis reveals that depending on data size and type, experts specialize in distinct syntactic and semantic components, such as punctuation, stopwords, parts-of-speech, or named entities. Code is available at: https://github.com/matejMartinc/SEKE_keyword_extraction

information retrieval, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2412.14087

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan (0.04)
North America > Canada (0.04)
(14 more...)

Genre: Research Report (1.00)

Industry: Government > Military (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Back-of-the-Book Index Automation for Arabic Documents

Haidar, Nawal, Zaraket, Fadi A.

arXiv.org Artificial IntelligenceOct-14-2024

Back-of-the-book indexes are crucial for book readability. Their manual creation is laborious and error prone. In this paper, we consider automating back-of-the-book index extraction for Arabic books to help simplify both the creation and review tasks. Given a back-of-the-book index, we aim to check and identify the accurate occurrences of index terms relative to the associated pages. To achieve this, we first define a pool of candidates for each term by extracting all possible noun phrases from paragraphs appearing on the relevant index pages. These noun phrases, identified through part-of-speech analysis, are stored in a vector database for efficient retrieval. We use several metrics, including exact matches, lexical similarity, and semantic similarity, to determine the most appropriate occurrence. The candidate with the highest score based on these metrics is chosen as the occurrence of the term. We fine-tuned a heuristic method, that considers the above metrics and that achieves an F1-score of .966 (precision=.966, recall=.966). These excellent results open the door for future work related to automation of back-of-the-book index generation and checking.

extraction, information retrieval, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.10286

Country:

Europe > United Kingdom (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.65)

Industry: Media > News (0.38)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback

Cross-Domain Keyword Extraction with Keyness Patterns

Zhou, Dongmei, Tang, Xuri

arXiv.org Artificial IntelligenceSep-27-2024

Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features -- heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.18724

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Media (0.67)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

Meisenbacher, Stephen, Schopf, Tim, Yan, Weixin, Holl, Patrick, Matthes, Florian

arXiv.org Artificial IntelligenceJul-19-2024

The task of $\textit{keyword extraction}$ is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of $\textit{class-specific}$ keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular $\textbf{KeyBERT}$ library to identify only keywords related to a class described by $\textit{seed keywords}$. We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for $\textit{class-specific}$ keyword extraction.

extraction, keyword, seed keyword, (13 more...)

arXiv.org Artificial Intelligence

2407.14085

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(8 more...)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Large Language Model Enhanced Clustering for News Event Detection

Tarekegn, Adane Nega

arXiv.org Artificial IntelligenceJul-6-2024

The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labelling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes multiple feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that the use of LLM embedding in the event detection framework has significantly improved the results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy in news analysis and reporting.

algorithm, csai score, llm, (13 more...)

arXiv.org Artificial Intelligence

2406.10552

Country:

North America > United States > Colorado (0.04)
Europe > Norway > Western Norway > Vestland > Bergen (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Media > News (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback