AITopics | textual data

Collaborating Authors

textual data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e2c054ffc0467962d3fd3b2f17df910c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 11:05:20 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(4 more...)

Add feedback

Detection of Cyberbullying in GIF using AI

Dave, Pal, Yuan, Xiaohong, Siddula, Madhuri, Roy, Kaushik

arXiv.org Artificial IntelligenceDec-10-2025

Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.

artificial intelligence, machine learning, social media, (16 more...)

arXiv.org Artificial Intelligence

2512.07838

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Future of AI Models: A Computational perspective on Model collapse

Satharasi, Trivikram, Iyengar, S Sitharama

arXiv.org Artificial IntelligenceNov-11-2025

Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.05535

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry: Media > News (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The "Right" Discourse on Migration: Analysing Migration-Related Tweets in Right and Far-Right Political Movements

Chatterjee, Nishan, Bajt, Veronika, Vitez, Ana Zwitter, Pollak, Senja

arXiv.org Artificial IntelligenceOct-27-2025

The rise of right-wing populism in Europe has brought to the forefront the significance of analysing social media discourse to understand the dissemination of extremist ideologies and their impact on political outcomes. Twitter, as a platform for interaction and mobilisation, provides a unique window into the everyday communication of far-right supporters. In this paper, we propose a methodology that uses state-of-the-art natural language processing techniques with sociological insights to analyse the MIGR-TWIT corpus of far-right tweets in English and French. We aim to uncover patterns of discourse surrounding migration, hate speech, and persuasion techniques employed by right and far-right actors. By integrating linguistic, sociological, and computational approaches, we seek to offer cross-disciplinary insights into societal dynamics and contribute to a better understanding of contemporary challenges posed by right-wing extremism on social media platforms.

discourse, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.2122

Country: Europe > United Kingdom (1.00)

Genre: Research Report (1.00)

Industry:

Information Technology (1.00)
Government > Regional Government > Europe Government > United Kingdom Government (1.00)
Government > Immigration & Customs (1.00)
(5 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

LLM-Integrated Bayesian State Space Models for Multimodal Time-Series Forecasting

Cho, Sungjun, Shin, Changho, Jo, Suenggwan, Yan, Xinya, Chaudhuri, Shourjo Aditya, Sala, Frederic

arXiv.org Artificial IntelligenceOct-27-2025

Forecasting in the real world requires integrating structured time-series data with unstructured textual information, but existing methods are architecturally limited by fixed input/output horizons and are unable to model or quantify uncertainty. We address this challenge by introducing LLM-integrated Bayesian State space models (LBS), a novel probabilistic framework for multimodal temporal forecasting. At a high level, LBS consists of two components: (1) a state space model (SSM) backbone that captures the temporal dynamics of latent states from which both numerical and textual observations are generated and (2) a pretrained large language model (LLM) that is adapted to encode textual inputs for posterior state estimation and decode textual forecasts consistent with the latent trajectory. This design enables flexible lookback and forecast windows, principled uncertainty quantification, and improved temporal generalization thanks to the well-suited inductive bias of SSMs toward modeling dynamical systems. Experiments on the TextTimeCorpus benchmark demonstrate that LBS improves the previous state-of-the-art by 13.20% while providing human-readable summaries of each forecast. Our work is the first to unify LLMs and SSMs for joint numerical and textual prediction, offering a novel foundation for multimodal temporal reasoning.

forecasting, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2510.20952

Country: North America > United States (0.93)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-Label Clinical Text Eligibility Classification and Summarization System

Yerramsetty, Surya Tejaswi, Fathimah, Almas

arXiv.org Artificial IntelligenceOct-16-2025

Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.

classification, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.13115

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.95)
Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
Health & Medicine > Health Care Technology > Medical Record (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.66)

Add feedback

Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection Saehyung Lee

Neural Information Processing SystemsOct-10-2025, 19:29:03 GMT

Furthermore, HFTT employs a clever textual data synthesis method, effectively emulating the integration of unknown visual data distribution into the training process at no extra cost.

dataset, detection, hftt, (13 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(4 more...)

Add feedback

8e7768122f3eeec6d77cd2b424b72413-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 09:20:06 GMT

dataset, multi 0, uni 0, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
North America > Mexico (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)
(6 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting

Xu, Zhijian, Cai, Wanxu, Dai, Xilin, Deng, Zhaorong, Xu, Qiang

arXiv.org Machine LearningSep-30-2025

The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.

benchmark, dataset, forecasting, (16 more...)

arXiv.org Machine Learning

2509.24789

Country: