keyword extraction
Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection
Iskandardinata, Michael, Christian, William, Suhartono, Derwin
Abstract--Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 T ask 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. In the field of machine learning, natural language processing (NLP) tasks have been shown to be a crucial part of human life. NLP tasks revolve around processing text in a specific manner and receiving an output that can be useful, with examples such as text classification, text generation, information retrieval, and similar related tasks. Sarcasm detection, or as some call it verbal irony detection, is a task in NLP that automatically classifies text, and in extended forms includes images, audio, or video, as either sarcastic or not. Systems designed for sarcasm detection are becoming more important in the 21st century due to the growth of the use of sarcasm detection datasets caused by media usage of it, such as social media, television, and much more. Additionally, enhancing these automatic systems for sarcasm detection could become crucial in interpreting the real sentence meaning of a text.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection
Hossan, Rakib, Dipta, Shubhashis Roy
The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.
- Asia > Bangladesh (0.04)
- North America > United States > Maryland > Baltimore County (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (3 more...)
Using Artificial Intuition in Distinct, Minimalist Classification of Scientific Abstracts for Management of Technology Portfolios
Ranka, Prateek, Morstatter, Fred, Graddy-Reed, Alexandra, Belz, Andrea
Classification of scientific abstracts is useful for strategic activities but challenging to automate because the sparse text provides few contextual clues. Metadata associated with the scientific publication can be used to improve performance but still often requires a semi-supervised setting. Moreover, such schemes may generate labels that lack distinction -- namely, they overlap and thus do not uniquely define the abstract. In contrast, experts label and sort these texts with ease. Here we describe an application of a process we call artificial intuition to replicate the expert's approach, using a Large Language Model (LLM) to generate metadata. We use publicly available abstracts from the United States National Science Foundation to create a set of labels, and then we test this on a set of abstracts from the Chinese National Natural Science Foundation to examine funding trends. We demonstrate the feasibility of this method for research portfolio management, technology scouting, and other strategic activities.
- North America > United States > California (0.15)
- Asia > China (0.05)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Singapore (0.04)
INSIGHT: Bridging the Student-Teacher Gap in Times of Large Language Models
Thys, Jarne, Vanbrabant, Sebe, Vanacken, Davy, Ruiz, Gustavo Rovelo
The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom. AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy. Based on interviews with teaching staff, this paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises. INSIGHT has a modular design that allows it to be integrated into various higher education courses. We analyze students' questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students' questions and provide new insights for the teaching staff to use for more personalized face-to-face support. Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.
- North America > United States > Virginia (0.04)
- Europe > Switzerland (0.04)
- Europe > Italy > Sicily > Palermo (0.04)
- (2 more...)
- Instructional Material > Course Syllabus & Notes (0.69)
- Research Report (0.64)
- Information Technology > Security & Privacy (1.00)
- Education > Educational Setting > Higher Education (0.49)
Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content
Rizvi, F. A., Navojith, T., Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini
Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
- Asia > Sri Lanka > Western Province > Colombo > Colombo (0.05)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos
Kok-Shun, Brice Valentin, Chan, Johnny
Brice Valentin Kok - Shun Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0001 - 9923 - 5042 Johnny Chan Department of Information Systems and Operations Management University of Auckland Auckland, New Zealand 0000 - 0002 - 3535 - 4533 Abstract -- This work - in - progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collect ion of 421 auto - generated and manual transcripts which are then fed into a prompt - engineered GPT - 4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for ca tegory identification . The results revealed a significant prevalence of product - related ads across vari ous educational topics, with ad categories refined using GPT - 4 o into succinct 9 content and 4 advertisement categories . This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. T his study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of ad vertisement strategies in digital media. In recent years, video - sharing platforms like YouTube have become dominant sources of entertainment, education, and information [1] . YouTube is invaluable for content creators, marketers, and advertisers. One of the key features of YouTube's revenue model is the integration of sponsored advertisement (ad) segments, which allows content creators to monetize their videos while providing advertisers a direct route to target specific audiences [2] .
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.85)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.04)
- (6 more...)
- Research Report (0.84)
- Overview > Innovation (0.34)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
SEKE: Specialised Experts for Keyword Extraction
Martinc, Matej, Tran, Hanh Thi Hong, Pollak, Senja, Koloski, Boshko
Keyword extraction involves identifying the most descriptive words in a document, allowing automatic categorisation and summarisation of large quantities of diverse textual data. Relying on the insight that real-world keyword detection often requires handling of diverse content, we propose a novel supervised keyword extraction approach based on the mixture of experts (MoE) technique. MoE uses a learnable routing sub-network to direct information to specialised experts, allowing them to specialize in distinct regions of the input space. SEKE, a mixture of Specialised Experts for supervised Keyword Extraction, uses DeBERTa as the backbone model and builds on the MoE framework, where experts attend to each token, by integrating it with a recurrent neural network (RNN), to allow successful extraction even on smaller corpora, where specialisation is harder due to lack of training data. The MoE framework also provides an insight into inner workings of individual experts, enhancing the explainability of the approach. We benchmark SEKE on multiple English datasets, achieving state-of-the-art performance compared to strong supervised and unsupervised baselines. Our analysis reveals that depending on data size and type, experts specialize in distinct syntactic and semantic components, such as punctuation, stopwords, parts-of-speech, or named entities. Code is available at: https://github.com/matejMartinc/SEKE_keyword_extraction
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan (0.04)
- North America > Canada (0.04)
- (14 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Back-of-the-Book Index Automation for Arabic Documents
Haidar, Nawal, Zaraket, Fadi A.
Back-of-the-book indexes are crucial for book readability. Their manual creation is laborious and error prone. In this paper, we consider automating back-of-the-book index extraction for Arabic books to help simplify both the creation and review tasks. Given a back-of-the-book index, we aim to check and identify the accurate occurrences of index terms relative to the associated pages. To achieve this, we first define a pool of candidates for each term by extracting all possible noun phrases from paragraphs appearing on the relevant index pages. These noun phrases, identified through part-of-speech analysis, are stored in a vector database for efficient retrieval. We use several metrics, including exact matches, lexical similarity, and semantic similarity, to determine the most appropriate occurrence. The candidate with the highest score based on these metrics is chosen as the occurrence of the term. We fine-tuned a heuristic method, that considers the above metrics and that achieves an F1-score of .966 (precision=.966, recall=.966). These excellent results open the door for future work related to automation of back-of-the-book index generation and checking.
- Europe > United Kingdom (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Cross-Domain Keyword Extraction with Keyness Patterns
Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features -- heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Media (0.67)
- Transportation (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry
Meisenbacher, Stephen, Schopf, Tim, Yan, Weixin, Holl, Patrick, Matthes, Florian
The task of $\textit{keyword extraction}$ is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of $\textit{class-specific}$ keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular $\textbf{KeyBERT}$ library to identify only keywords related to a class described by $\textit{seed keywords}$. We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for $\textit{class-specific}$ keyword extraction.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (8 more...)