Goto

Collaborating Authors

 body text


Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling

arXiv.org Artificial Intelligence

Researchers must stay current in their fields by regularly reviewing academic literature, a task complicated by the daily publication of thousands of papers. Traditional multi-label text classification methods often ignore semantic relationships and fail to address the inherent class imbalances. This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts from the Elsevier OA CC-BY corpus. We use a multi-segment input strategy that processes abstracts, body text, titles, and keywords obtained via BERT topic modeling through SciBERT. Here, the [CLS] token embeddings capture the contextual representation of each segment, concatenated and processed through a CNN. The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality, optimizing the data for classification. Additionally, we incorporate class weights based on label frequency to address the class imbalance, significantly improving the classification F1 score and enhancing text classification systems and literature review efficiency.


ISQA: Informative Factuality Feedback for Scientific Summarization

arXiv.org Artificial Intelligence

We propose Iterative Facuality Refining on Informative Scientific Question-Answering (ISQA) feedback\footnote{Code is available at \url{https://github.com/lizekai-richard/isqa}}, a method following human learning theories that employs model-generated feedback consisting of both positive and negative information. Through iterative refining of summaries, it probes for the underlying rationale of statements to enhance the factuality of scientific summarization. ISQA does this in a fine-grained manner by asking a summarization agent to reinforce validated statements in positive feedback and fix incorrect ones in negative feedback. Our findings demonstrate that the ISQA feedback mechanism significantly improves the factuality of various open-source LLMs on the summarization task, as evaluated across multiple scientific datasets.


How Good Are SOTA Fake News Detectors

arXiv.org Artificial Intelligence

Automatic fake news detection with machine learning can prevent the dissemination of false statements before they gain many views. Several datasets labeling statements as legitimate or false have been created since the 2016 United States presidential election for the prospect of training machine learning models. We evaluate the robustness of both traditional and deep state-of-the-art models to gauge how well they may perform in the real world. We find that traditional models tend to generalize better to data outside the distribution it was trained on compared to more recently-developed large language models, though the best model to use may depend on the specific task at hand.


Emulating Reader Behaviors for Fake News Detection

arXiv.org Artificial Intelligence

The wide dissemination of fake news has affected our lives in many aspects, making fake news detection important and attracting increasing attention. Existing approaches make substantial contributions in this field by modeling news from a single-modal or multi-modal perspective. However, these modal-based methods can result in sub-optimal outcomes as they ignore reader behaviors in news consumption and authenticity verification. For instance, they haven't taken into consideration the component-by-component reading process: from the headline, images, comments, to the body, which is essential for modeling news with more granularity. To this end, we propose an approach of Emulating the behaviors of readers (Ember) for fake news detection on social media, incorporating readers' reading and verificating process to model news from the component perspective thoroughly. Specifically, we first construct intra-component feature extractors to emulate the behaviors of semantic analyzing on each component. Then, we design a module that comprises inter-component feature extractors and a sequence-based aggregator. This module mimics the process of verifying the correlation between components and the overall reading and verification sequence. Thus, Ember can handle the news with various components by emulating corresponding sequences. We conduct extensive experiments on nine real-world datasets, and the results demonstrate the superiority of Ember.


Detecting Contextomized Quotes in News Headlines by Contrastive Learning

arXiv.org Artificial Intelligence

Quotes are critical for establishing credibility in news articles. A direct quote enclosed in quotation marks has a strong visual appeal and is a sign of a reliable citation. Unfortunately, this journalistic practice is not strictly Figure 1: The central idea of QuoteCSE is based on followed, and a quote in the headline is often journalism principles, where quotes from news headlines "contextomized." Such a quote uses words out and body text should be matched. The proposed of context in a way that alters the speaker's contrastive learning framework maximizes the intention so that there is no semantically semantic similarity between the headline quote and the matching quote in the body text. We present matched quote in the body text while minimizing the QuoteCSE, a contrastive learning framework similarity for other unmatched quotes in the same or that represents the embedding of news quotes other articles.


Incongruity Detection between Bangla News Headline and Body Content through Graph Neural Network

arXiv.org Artificial Intelligence

Incongruity between news headlines and the body content is a common method of deception used to attract readers. Profitable headlines pique readers' interest and encourage them to visit a specific website. This is usually done by adding an element of dishonesty, using enticements that do not precisely reflect the content being delivered. As a result, automatic detection of incongruent news between headline and body content using language analysis has gained the research community's attention. However, various solutions are primarily being developed for English to address this problem, leaving low-resource languages out of the picture. Bangla is ranked 7th among the top 100 most widely spoken languages, which motivates us to pay special attention to the Bangla language. Furthermore, Bangla has a more complex syntactic structure and fewer natural language processing resources, so it becomes challenging to perform NLP tasks like incongruity detection and stance detection. To tackle this problem, for the Bangla language, we offer a graph-based hierarchical dual encoder (BGHDE) model that learns the content similarity and contradiction between Bangla news headlines and content paragraphs effectively. The experimental results show that the proposed Bangla graph-based neural network model achieves above 90% accuracy on various Bangla news datasets.


Combination Of Convolution Neural Networks And Deep Neural Networks For Fake News Detection

arXiv.org Artificial Intelligence

Nowadays, People prefer to follow the latest news on social media, as it is cheap, easily accessible, and quickly disseminated. However, it can spread fake or unreliable, low-quality news that intentionally contains false information. The spread of fake news can have a negative effect on people and society. Given the seriousness of such a problem, researchers did their best to identify patterns and characteristics that fake news may exhibit to design a system that can detect fake news before publishing. In this paper, we have described the Fake News Challenge stage #1 (FNC-1) dataset and given an overview of the competitive attempts to build a fake news detection system using the FNC-1 dataset. The proposed model was evaluated with the FNC-1 dataset. A competitive dataset is considered an open problem and a challenge worldwide. This system's procedure implies processing the text in the headline and body text columns with different natural language processing techniques. After that, the extracted features are reduced using the elbow truncated method, finding the similarity between each pair using the soft cosine similarity method. The new feature is entered into CNN and DNN deep learning approaches. The proposed system detects all the categories with high accuracy except the disagree category. As a result, the system achieves up to 84.6 % accuracy, classifying it as the second ranking based on other competitive studies regarding this dataset.


Machine Learning-based Automatic Annotation and Detection of COVID-19 Fake News

arXiv.org Artificial Intelligence

COVID-19 impacted every part of the world, although the misinformation about the outbreak traveled faster than the virus. Misinformation spread through online social networks (OSN) often misled people from following correct medical practices. In particular, OSN bots have been a primary source of disseminating false information and initiating cyber propaganda. Existing work neglects the presence of bots that act as a catalyst in the spread and focuses on fake news detection in 'articles shared in posts' rather than the post (textual) content. Most work on misinformation detection uses manually labeled datasets that are hard to scale for building their predictive models. In this research, we overcome this challenge of data scarcity by proposing an automated approach for labeling data using verified fact-checked statements on a Twitter dataset. In addition, we combine textual features with user-level features (such as followers count and friends count) and tweet-level features (such as number of mentions, hashtags and urls in a tweet) to act as additional indicators to detect misinformation. Moreover, we analyzed the presence of bots in tweets and show that bots change their behavior over time and are most active during the misinformation campaign. We collected 10.22 Million COVID-19 related tweets and used our annotation model to build an extensive and original ground truth dataset for classification purposes. We utilize various machine learning models to accurately detect misinformation and our best classification model achieves precision (82%), recall (96%), and false positive rate (3.58%). Also, our bot analysis indicates that bots generated approximately 10% of misinformation tweets. Our methodology results in substantial exposure of false information, thus improving the trustworthiness of information disseminated through social media platforms.


Less is Less: When Are Snippets Insufficient for Human vs Machine Relevance Estimation?

arXiv.org Artificial Intelligence

Traditional information retrieval (IR) ranking models process the full text of documents. Newer models based on Transformers, however, would incur a high computational cost when processing long texts, so typically use only snippets from the document instead. The model's input based on a document's URL, title, and snippet (UTS) is akin to the summaries that appear on a search engine results page (SERP) to help searchers decide which result to click. This raises questions about when such summaries are sufficient for relevance estimation by the ranking model or the human assessor, and whether humans and machines benefit from the document's full text in similar ways. To answer these questions, we study human and neural model based relevance assessments on 12k query-documents sampled from Bing's search logs. We compare changes in the relevance assessments when only the document summaries and when the full text is also exposed to assessors, studying a range of query and document properties, e.g., query type, snippet length. Our findings show that the full text is beneficial for humans and a BERT model for similar query and document types, e.g., tail, long queries. A closer look, however, reveals that humans and machines respond to the additional input in very different ways. Adding the full text can also hurt the ranker's performance, e.g., for navigational queries.


POSHAN: Cardinal POS Pattern Guided Attention for News Headline Incongruence

arXiv.org Artificial Intelligence

Automatic detection of click-bait and incongruent news headlines is crucial to maintaining the reliability of the Web and has raised much research attention. However, most existing methods perform poorly when news headlines contain contextually important cardinal values, such as a quantity or an amount. In this work, we focus on this particular case and propose a neural attention based solution, which uses a novel cardinal Part of Speech (POS) tag pattern based hierarchical attention network, namely POSHAN, to learn effective representations of sentences in a news article. In addition, we investigate a novel cardinal phrase guided attention, which uses word embeddings of the contextually-important cardinal value and neighbouring words. In the experiments conducted on two publicly available datasets, we observe that the proposed methodgives appropriate significance to cardinal values and outperforms all the baselines. An ablation study of POSHAN shows that the cardinal POS-tag pattern-based hierarchical attention is very effective for the cases in which headlines contain cardinal values.