Goto

Collaborating Authors

 clickbait


An Interpretable Benchmark for Clickbait Detection and Tactic Attribution

Nofar, Lihi, Portal, Tomer, Elbaz, Aviv, Apartsin, Alexander, Aperstein, Yehudit

arXiv.org Artificial Intelligence

The proliferation of clickbait headlines poses significant challenges to the credibility of information and user trust in digital media. While recent advances in machine learning have improved the detection of manipulative content, the lack of explainability limits their practical adoption. This paper presents a model for explainable clickbait detection that not only identifies clickbait titles but also attributes them to specific linguistic manipulation strategies. We introduce a synthetic dataset generated by systematically augmenting real news headlines using a predefined catalogue of clickbait strategies. This dataset enables controlled experimentation and detailed analysis of model behaviour. We present a two - stage framework for automatic clickbait analysis comprising detection and tactic attribution. In the first stage, we compare a fine - tuned BERT classifier with large language models (LLMs), specifically GPT - 4.0 and Gemini 2.4 Flash, under both zero - shot prompting and few - shot prompting enriched with illustrative clickbait headlines and their associated persuasive tactics. In the second stage, a dedicated BERT - based classifier predicts the specific clickbait strategies present in each headline. We share the dataset with the research community at https://github.com/LLM - HITCS25S/ClickbaitTacticsDetection The widespread use of clickbait headlines in digital media has become a pervasive challenge, undermining the credibility of information and exploiting user attention through manipulative linguistic techniques. While automated systems for detecting clickbait have improved in recent years, their focus has remained mainly on binary classification, simply labelling content as clickbait or not. However, effective mitigation of such content requires going beyond detection to understanding how and why certain headlines manipulate readers. Specifically, it is crucial to evaluate whether current AI models can accurately recognize and distinguish the diverse linguistic styles and persuasive strategies commonly employed in clickbait.


What Makes You CLIC: Detection of Croatian Clickbait Headlines

Anđelić, Marija, Šipek, Dominik, Majer, Laura, Šnajder, Jan

arXiv.org Artificial Intelligence

Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative -- commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTić model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.


Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News

Mordecki, Gabriel, Moncecchi, Guillermo, Couto, Javier

arXiv.org Artificial Intelligence

We revise the definition of clickbait, which lacks current consensus, and argue that the creation of a curiosity gap is the key concept that distinguishes clickbait from other related phenomena such as sensationalism and headlines that do not deliver what they promise or diverge from the article. Therefore, we propose a new definition: clickbait is a technique for generating headlines and teasers that deliberately omit part of the information with the goal of raising the readers' curiosity, capturing their attention and enticing them to click. We introduce a new approach to clickbait detection datasets creation, by refining the concept limits and annotations criteria, minimizing the subjectivity in the decision as much as possible. Following it, we created and release TA1C (for Te Ahorré Un Click, Spanish for Saved You A Click), the first open source dataset for clickbait detection in Spanish. It consists of 3,500 tweets coming from 18 well known media sources, manually annotated and reaching a 0.825 Fleiss' κ inter annotator agreement. We implement strong baselines that achieve 0.84 in F1-score.


Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning

Gamage, Bhanuka, Labib, Adnan, Joomun, Aisha, Lim, Chern Hong, Wong, KokSheik

arXiv.org Artificial Intelligence

Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s.


Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

Yu, Jianxing, Wang, Shiqi, Yin, Han, Sun, Zhenlong, Xie, Ruobing, Zhang, Bo, Rao, Yanghui

arXiv.org Artificial Intelligence

This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.


What Drives Online Popularity: Author, Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

Calderon, Pio, Rizoiu, Marian-Andrei

arXiv.org Artificial Intelligence

The spread of content on social media is shaped by intertwining factors on three levels: the source, the content itself, and the pathways of content spread. At the lowest level, the popularity of the sharing user determines its eventual reach. However, higher-level factors such as the nature of the online item and the credibility of its source also play crucial roles in determining how widely and rapidly the online item spreads. In this work, we propose the Bayesian Mixture Hawkes (BMH) model to jointly learn the influence of source, content and spread. We formulate the BMH model as a hierarchical mixture model of separable Hawkes processes, accommodating different classes of Hawkes dynamics and the influence of feature sets on these classes. We test the BMH model on two learning tasks, cold-start popularity prediction and temporal profile generalization performance, applying to two real-world retweet cascade datasets referencing articles from controversial and traditional media publishers. The BMH model outperforms the state-of-the-art models and predictive baselines on both datasets and utilizes cascade- and item-level information better than the alternatives. Lastly, we perform a counter-factual analysis where we apply the trained publisher-level BMH models to a set of article headlines and show that effectiveness of headline writing style (neutral, clickbait, inflammatory) varies across publishers. The BMH model unveils differences in style effectiveness between controversial and reputable publishers, where we find clickbait to be notably more effective for reputable publishers as opposed to controversial ones, which links to the latter's overuse of clickbait.


Generating clickbait spoilers with an ensemble of large language models

Woźny, Mateusz, Lango, Mateusz

arXiv.org Artificial Intelligence

Clickbait posts are a widespread problem in the webspace. The generation of spoilers, i.e. short texts that neutralize clickbait by providing information that satisfies the curiosity induced by it, is one of the proposed solutions to the problem. Current state-of-the-art methods are based on passage retrieval or question answering approaches and are limited to generating spoilers only in the form of a phrase or a passage. In this work, we propose an ensemble of fine-tuned large language models for clickbait spoiler generation. Our approach is not limited to phrase or passage spoilers, but is also able to generate multipart spoilers that refer to several non-consecutive parts of text. Experimental evaluation demonstrates that the proposed ensemble model outperforms the baselines in terms of BLEU, METEOR and BERTScore metrics.


Mitigating Clickbait: An Approach to Spoiler Generation Using Multitask Learning

Pal, Sayantan, Das, Souvik, Srihari, Rohini K.

arXiv.org Artificial Intelligence

This study introduces 'clickbait spoiling', a novel technique designed to detect, categorize, and generate spoilers as succinct text responses, countering the curiosity induced by clickbait content. By leveraging a multi-task learning framework, our model's generalization capabilities are significantly enhanced, effectively addressing the pervasive issue of clickbait. The crux of our research lies in generating appropriate spoilers, be it a phrase, an extended passage, or multiple, depending on the spoiler type required. Our methodology integrates two crucial techniques: a refined spoiler categorization method and a modified version of the Question Answering (QA) mechanism, incorporated within a multi-task learning paradigm for optimized spoiler extraction from context. Notably, we have included fine-tuning methods for models capable of handling longer sequences to accommodate the generation of extended spoilers. This research highlights the potential of sophisticated text processing techniques in tackling the omnipresent issue of clickbait, promising an enhanced user experience in the digital realm.


Maintaining Journalistic Integrity in the Digital Age: A Comprehensive NLP Framework for Evaluating Online News Content

Bojic, Ljubisa, Prodanovic, Nikola, Samala, Agariadne Dwinggo

arXiv.org Artificial Intelligence

The rapid growth of online news platforms has led to an increased need for reliable methods to evaluate the quality and credibility of news articles. This paper proposes a comprehensive framework to analyze online news texts using natural language processing (NLP) techniques, particularly a language model specifically trained for this purpose, alongside other well-established NLP methods. The framework incorporates ten journalism standards-objectivity, balance and fairness, readability and clarity, sensationalism and clickbait, ethical considerations, public interest and value, source credibility, relevance and timeliness, factual accuracy, and attribution and transparency-to assess the quality of news articles. By establishing these standards, researchers, media organizations, and readers can better evaluate and understand the content they consume and produce. The proposed method has some limitations, such as potential difficulty in detecting subtle biases and the need for continuous updating of the language model to keep pace with evolving language patterns.


Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Sung, Yoo Yeon, Boyd-Graber, Jordan, Hassan, Naeemul

arXiv.org Artificial Intelligence

Polarization and the marketplace for impressions have conspired to make navigating information online difficult for users, and while there has been a significant effort to detect false or misleading text, multimodal datasets have received considerably less attention. To complement existing resources, we present multimodal Video Misleading Headline (VMH), a dataset that consists of videos and whether annotators believe the headline is representative of the video's contents. After collecting and annotating this dataset, we analyze multimodal baselines for detecting misleading headlines. Our annotation process also focuses on why annotators view a video as misleading, allowing us to better understand the interplay of annotators' background and the content of the videos.